Data Preparation for Duplicate Detection

Ioannis Koumarelas, Lan Jiang, Felix Naumann

July, 2020

Abstract

Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed, by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach, prior to performing duplicate detection. Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations, decisions are made based on the improvement or worsening of pair similarities. Second, using the remaining data preparations an iterative leave-one-out classification process removes preparations one by one and determines the redundant preparations based on the achieved area under the precision-recall curve (AUC-PR). Using this workflow, we manage to improve the results of duplicate detection up to 19% in AUC-PR.

Type

Journal article

Publication

In ACM Journal of Data and Information Quality 2020

Ioannis Koumarelas

PhD graduate in Data Cleaning

My research interests include Data Cleaning, Artificial Intelligence, and Machine Learning.

Data Preparation for Duplicate Detection

Abstract

Ioannis Koumarelas

PhD graduate in Data Cleaning

Related