TU Berlin

Database Systems and Information Management GroupPublications

Logo FG DIMA-new  65px

Page Content

to Navigation

Publications

Automating Data Quality Validation for Dynamic Data Ingestion
Citation key RedyukKMS21
Author Sergey Redyuk, Zoi Kaoudi, Volker Markl, Sebastian Schelter
Year 2021
Journal EDBT
Abstract Data quality validation is a crucial step in modern data-driven applications. Errors in the data lead to unexpected behavior of production pipelines and downstream services, such as deployed ML models or search engines. Typically, unforeseen data quality issues are handled via manual and tedious debugging processes in a reactive manner. The problem becomes more challenging in scenarios where large growing datasets have to be periodically ingested into non-relational stores such as data lakes. This is even worse when the characteristics of the data change over time, and domain expertise to define data quality constraints is lacking. We propose a data-centric approach to automate data quality validation in such scenarios. In contrast to existing solutions, our approach does not require domain experts to define rules and constraints or provide labeled examples, and self-adapts to temporal changes in the data characteristics. We compute a set of descriptive statistics of new data batches to ingest, and use a machine learning-based novelty detection method to monitor data quality and identify deviations from commonly observed data characteristics. We evaluate our approach against several baselines on five real-world datasets, on both real and synthetically generated errors. We show that our approach detects unspecified errors in many cases, outperforms other automated solutions in terms of predictive performance, and reaches the quality of baselines that are hand-tuned using domain expertise.
Link to original publication Download Bibtex entry

Navigation

Quick Access

Schnellnavigation zur Seite über Nummerneingabe