Getting started¶
Learn2Clean main package contains the following sub-packages for data preprocessing: loading, normalization, feature-selection, outlier-detection, duplicate-detection, consistency-checking, imputation and qlearning. And ML packages for clustering, classification, and regression.
Here are a few lines to import Learn2Clean:
import learn2clean.normalization.normalizer as nl
import learn2clean.feature_selection.feature_selector as fs
import learn2clean.duplicate_detection.duplicate_detector as dd
import learn2clean.outlier_detection.outlier_detector as od
import learn2clean.imputation.imputer as imp
import learn2clean.classification.classifier as cl
Then, you need to give :
- the list of paths to your train datasets and test datasets
- the name of the target you try to predict (classification or regression)
- or you can submit only one dataset and Leanr2Clean will split it into train and test datasets
paths = ["<file_1>.csv", "<file_2>.csv", ..., "<file_n>.csv"] #to modify
target_name = "<my_target>" #to modify
Now, you can write your own pipeline
… to preprocess your files :
data = rd.Reader(sep=',',verbose = True, encoding = True).train_test_split(paths, target_name) #reading
# replace numerical missing values by median
d1 = imp.Imputer(dataset = data, strategy = 'MEDIAN', verbose = False).transform()
# decimal scaling for numerical variables
d2 = nl.Normalizer(dataset = d1, strategy = 'DS', exclude = None, verbose = False).transform()
# eliminate 20 LOF outliers
d3 = od.Outlier_detector(dataset = d2, strategy = 'LOF', threshold= 0.2, verbose= False).transform()
# classify with LDA
cl.Classifier(dataset = d3, strategy = 'LDA', target = target_name, verbose = True).transform()
… or to test the automatic optimization the preprocessing pipeline by Learn2Clean based on Qlearning:
import learn2clean.qlearning.qlearner as ql
l2c_classif=ql.Qlearner(dataset = data, goal = 'LDA',target_goal = target_name,
target_prepare=None, file_name = 'results_file_name', verbose = False)
l2c_classif.learn2clean()
… finally, Learn2Clean will select the best preprocessing strategy for the given ML task and its by-default quality metric.
That’s all ! You can have a look at the Jupyter notebook examples in the folder “examples” and also the folder “save” where you can find :
- the results in ‘results_file_name’ including the best processing strategy found by Learn2Clean,
- the results of random preprocessing and no-preprocessing,
- and the discovered patterns and constraints from the data by the Consistency_checker