Getting started

Learn2Clean main package contains the following sub-packages for data preprocessing: loading, normalization, feature-selection, outlier-detection, duplicate-detection, consistency-checking, imputation and qlearning. And ML packages for clustering, classification, and regression.

Here are a few lines to import Learn2Clean:

import learn2clean.normalization.normalizer as nl
import learn2clean.feature_selection.feature_selector as fs
import learn2clean.duplicate_detection.duplicate_detector as dd
import learn2clean.outlier_detection.outlier_detector as od
import learn2clean.imputation.imputer as imp
import learn2clean.classification.classifier as cl

Then, you need to give :

  • the list of paths to your train datasets and test datasets
  • the name of the target you try to predict (classification or regression)
  • or you can submit only one dataset and Leanr2Clean will split it into train and test datasets
paths = ["<file_1>.csv", "<file_2>.csv", ..., "<file_n>.csv"] #to modify
target_name = "<my_target>" #to modify

Now, you can write your own pipeline

… to preprocess your files :

data = rd.Reader(sep=',',verbose = True, encoding = True).train_test_split(paths, target_name) #reading
# replace numerical missing values by median
d1 = imp.Imputer(dataset = data, strategy = 'MEDIAN', verbose = False).transform()
# decimal scaling for numerical variables
d2 = nl.Normalizer(dataset = d1, strategy = 'DS', exclude = None, verbose = False).transform()
# eliminate 20 LOF outliers
d3 = od.Outlier_detector(dataset = d2, strategy = 'LOF', threshold= 0.2, verbose= False).transform()
# classify with LDA
cl.Classifier(dataset = d3, strategy = 'LDA', target = target_name, verbose = True).transform()

… or to test the automatic optimization the preprocessing pipeline by Learn2Clean based on Qlearning:

import learn2clean.qlearning.qlearner as ql
l2c_classif=ql.Qlearner(dataset = data, goal = 'LDA',target_goal = target_name,
   target_prepare=None, file_name = 'results_file_name', verbose = False)

… finally, Learn2Clean will select the best preprocessing strategy for the given ML task and its by-default quality metric.

That’s all ! You can have a look at the Jupyter notebook examples in the folder “examples” and also the folder “save” where you can find :

  • the results in ‘results_file_name’ including the best processing strategy found by Learn2Clean,
  • the results of random preprocessing and no-preprocessing,
  • and the discovered patterns and constraints from the data by the Consistency_checker