Welcome to Learn2Clean’s documentation

_images/learn2clean-text.png

Learn2Clean: The Python library for optimizing data preprocessing and cleaning pipelines based on Q-Learning

Overview

Learn2Clean is a Python library for data preprocessing and cleaning based on Q-Learning, a model-free reinforcement learning technique. It selects, for a given dataset, a ML model, and a quality performance metric, the optimal sequence of tasks for preperaring the data such that the quality of the ML model result is maximized.

_images/figure_Learn2Clean.jpeg

In Learn2CLean, various types of representative preprocessing techniques can be used for:

  • Normalization. Min-Max (MM), Z-score (ZS), and decimal scale normalization (DS);
  • Feature selection. based on a user-defined acceptable ratio of missing values (MR), removing collinear features (LC), using a wrapper subset evaluator (WR), and a model-based classifier for feature selection (Tree-Based or SVC);
  • Imputation. Expectation-Maximization (EM), K-Nearest Neighbours, Multiple Imputation by Chained Equations (MICE), and replacement by the most frequent value (MF);
  • Outlier detection. Inter Quartile Range (IQR), the Z-score-based method (ZSB), Local Outlier Factor (LOF);
  • Deduplication. Exact duplicate (ED) and approximate duplicate (AD) detection based on Jaccard similarity distance;
  • Consistency checking. Two methods based on constraint discovery and checking (CC) and pattern checking (PC).