Jeunes Chercheurs Jeunes Chercheuses (JCJC) grant
InfClean: Effective Inference of Cleaning Programs from Data Annotations
Dates: Oct 2018 – Feb 2022
This projects addresses a pressing need in data science applications: besides reliable models for decision making, we need data that has been processed from its original, raw state into a curated form, a process referred to as “data cleaning”. In this process, data engineers collaborate with domain experts to collect specifications, such as business rules on salaries, physical constraints for molecules, or representative training data. Specifications are then encoded in cleaning programs to be executed over the raw data to identify and fix errors. This human-centric process is expensive and, given the overwhelming amount of today’s data, is conducted with a best effort approach, which does not provide any formal guarantee on the ultimate quality of the data. The goal of InfClean is to rethink the data cleaning field from its assumptions with an inclusive formal framework that radically reduces the human effort in cleaning data.
More details and list of publications at the project website.