Data validation with machine learning

Project

Summary

The aim of this project is to extend and speed up data validation in the FSO by means of machine learning algorithms and at the same time to improve data quality.

Description

Statistical offices carry out data validation to check the quality and reliability of administrative data and survey data. Data that are either clearly incorrect or seem at least questionable are sent back to data suppliers with a correction request or comment. Until now, such data validation have mainly been carried out at two different levels: either through manual checks or automated processes using threshold values and logical tests. This process of two-way plausibility checks involves a great deal of work. In some cases, staff are required to manually check the data again, in other cases rules are applied that often require additional checks. This rule-based approach has developed from previous experience but is not necessarily exhaustive and always precise. Machine learning could help to ensure faster and more accurate checks. This approach would rely on an algorithm using historical data at first. Based on a previous data analysis, a target variable can be defined that should be able to be predicted by the algorithm. Only then can the algorithm be used for the prediction. As the final stage, the predicted and actual values of the target variables are compared and the predictive accuracy can be evaluated. Finally, a feedback mechanism is also used to send an automatic explanation to data suppliers.

Objectives

  • Produce a solution for data validation with machine learning.
  • Create an automated feedback function that can send an interpretation or explanation of possible errors to data suppliers.
  • Develop possible solutions to the medium-term integration of machine learning into the production environment.
  • Draw up documentation with a solution that is as scalable as possible so that it can be altered and applied throughout the FSO.