Jeunes Chercheurs Jeunes Chercheuses (JCJC) grant

InfClean: Effective Inference of Cleaning Programs from Data Annotations

Dates: Oct 2018 – Feb 2022

This projects addresses a pressing need in data science applications: besides reliable models for decision making, we need data that has been processed from its original, raw state into a curated form, a process referred to as “data cleaning”. In this process, data engineers collaborate with domain experts to collect specifications, such as business rules on salaries, physical constraints for molecules, or representative training data. Specifications are then encoded in cleaning programs to be executed over the raw data to identify and fix errors. This human-centric process is expensive and, given the overwhelming amount of today’s data, is conducted with a best effort approach, which does not provide any formal guarantee on the ultimate quality of the data. The goal of InfClean is to rethink the data cleaning field from its assumptions with an inclusive formal framework that radically reduces the human effort in cleaning data.

More details and list of publications at the project website.

Edit this page

Paolo Papotti
Associate Professor of Computer Science

His research focuses on the broad area of scalable data management, with a focus on data and information quality.