TU Berlin

Fachgebiet Datenbanksysteme und InformationsmanagementPublikationen

Logo FG DIMA-new  65px

Inhalt

zur Navigation

Publikationen

Automating Data Quality Validation for Dynamic Data Ingestion
Zitatschl├╝ssel RedyukKMS21
Autor Sergey Redyuk, Zoi Kaoudi, Volker Markl, Sebastian Schelter
Jahr 2021
Journal EDBT
Zusammenfassung Data quality validation is a crucial step in modern data-driven applications. Errors in the data lead to unexpected behavior of production pipelines and downstream services, such as deployed ML models or search engines. Typically, unforeseen data quality issues are handled via manual and tedious debugging processes in a reactive manner. The problem becomes more challenging in scenarios where large growing datasets have to be periodically ingested into non-relational stores such as data lakes. This is even worse when the characteristics of the data change over time, and domain expertise to define data quality constraints is lacking. We propose a data-centric approach to automate data quality validation in such scenarios. In contrast to existing solutions, our approach does not require domain experts to define rules and constraints or provide labeled examples, and self-adapts to temporal changes in the data characteristics. We compute a set of descriptive statistics of new data batches to ingest, and use a machine learning-based novelty detection method to monitor data quality and identify deviations from commonly observed data characteristics. We evaluate our approach against several baselines on five real-world datasets, on both real and synthetically generated errors. We show that our approach detects unspecified errors in many cases, outperforms other automated solutions in terms of predictive performance, and reaches the quality of baselines that are hand-tuned using domain expertise.
Link zur Originalpublikation Download Bibtex Eintrag

Navigation

Direktzugang

Schnellnavigation zur Seite über Nummerneingabe