Data validation with machine learning

| Last update: 30.04.2024

image – experimental statistics

Background

Statistical offices carry out data validation to check the quality and reliability of administrative data and survey data. Data that are either clearly incorrect or seem at least questionable are sent back to data suppliers with a correction request or comment. Until now, such data validation has mainly been carried out at two different levels: either through manual checks or automated processes using threshold values and logical tests. This process of two-way plausibility checks involves a great deal of work. In some cases, staff are required to manually check the data again, in other cases rules are applied that often require additional checks. This rule-based approach has developed from previous experience but is not necessarily exhaustive and always precise. Machine learning could help to ensure faster and more accurate checks. This approach would rely on an algorithm using historical data at first. Based on a previous data analysis, a target variable can be defined that should be able to be predicted by the algorithm. Only then can the algorithm be used for the prediction. As the final stage, the predicted and actual values of the target variables are compared and the predictive accuracy can be evaluated. Finally, a feedback mechanism is also used to send an automatic explanation to data suppliers.

Data and procedure

The annually updated database is an anonymised, linked dataset of university staff (institutes of technology included) and student data from the Swiss University Information System for the last four years. These are supplemented with further statistical key figures. An algorithm (gradient boosting machines) is trained to predict the personnel category of university staff for the current year. If the personnel category does not match, a feedback mechanism is used to determine the variables that might be involved and finally the situation is clarified with the universities. The reuse of the last model is checked annually using several approaches (population stability and model monitoring).

Results

In order to minimise the workload of data providers, the feedback mechanism was also checked prior to dispatch. Problematic cases with a particularly high probability of finding an error were reported. In addition, further information was added to these cases.

Data providers were able to confirm that all selected problematic cases (potential errors) were correct, even those that occur infrequently. Further feedback from the data providers on possible problematic cases revealed that there may be several structural reasons for deviations.

With the calculation of a Population Stability Index, a module was added to check the distribution between the years and to assess the usability of the previous algorithm. The distribution of data per personnel category, per university and for other data provided did not differ noticeably between the years.

After each survey, the algorithm was retrained with the latest data. The very high accuracy of these annually retrained models remained stable. The accuracy of the models in predicting the personnel category also hardly changed when models were applied to data from other years.

The data quality of the personnel statistics is very high, which is why the plausibility checks using machine learning are not implemented as standard in the controlling tasks. If required by the universities, the algorithm can be used again.

When applying the approach of this project to other projects, it is important to first identify any major structural differences present in the input data in order to possibly integrate them into the development of the application of the algorithm to particular data sets. The algorithm is also applicable under different framework conditions. It is scalable and reusable with adaptations.

 

Documentation