ml_preprocessing 4.0.0

Build Status Coverage Status pub package Gitter Chat

ml_preprocessing #

Data preprocessing algorithms

What is data preprocessing? #

Data preprocessing is a set of techniques for data preparation before one can use the data in Machine Learning algorithms.

Why is it needed? #

Let's say, you have a dataset:

    ----------------------------------------------------------------------------------------
    | Gender | Country | Height (cm) | Weight (kg) | Diabetes (1 - Positive, 0 - Negative) |
    ----------------------------------------------------------------------------------------
    | Female | France  |     165     |     55      |                    1                  |
    ----------------------------------------------------------------------------------------
    | Female | Spain   |     155     |     50      |                    0                  |
    ----------------------------------------------------------------------------------------
    | Male   | Spain   |     175     |     75      |                    0                  |
    ----------------------------------------------------------------------------------------
    | Male   | Russia  |     173     |     77      |                   N/A                 |
    ----------------------------------------------------------------------------------------

Everything seems good for now. Say, you're about to train a classifier to predict if a person has diabetes. But there is an obstacle - how can it possible to use the data in mathematical equations with string-value columns (Gender, Country)? And things are getting even worse because of an empty (N/A) value in Diabetes column. There should be a way to convert this data to a valid numerical representation. Here data preprocessing techniques come to play. You should decide, how to convert string data (aka categorical data) to numbers and how to treat empty values. Of course, you can come up with your own unique algorithms to do all of these operations, but, actually, there are a bunch of well-known well-performed techniques for doing all the conversions.

The aim of the library - to give data scientists, who are interested in Dart programming language, these preprocessing techniques.

Prerequisites #

The library depends on DataFrame class from the repo. It's necessary to use it as a dependency in your project, because you need to pack data into DataFrame before doing preprocessing. An example with a part of pubspec.yaml:

dependencies:
  ...
  ml_dataframe: ^0.0.3
  ...

A simple usage example #

Let's download some data from Kaggle - let it be amazing black friday dataset. It's pretty interesting data with huge amount of observations (approx. 538000 rows) and a good number of categorical features.

First, import all necessary libraries:

import 'package:ml_dataframe/ml_dataframe.dart';
import 'package:ml_preprocessing/ml_preprocessing.dart';
import 'package:xrange/zrange.dart';

Then, we should read the csv and create a data frame:

final dataFrame = await fromCsv('example/black_friday/black_friday.csv', 
  columns: [2, 3, 5, 6, 7, 11]);

After we get a dataframe, we may encode all the needed features. Let's analyze the dataset and decide, what features should be encoded. In our case these are:

final featureNames = ['Gender', 'Age', 'City_Category', 'Stay_In_Current_City_Years', 'Marital_Status'];

Let's fit the encoder.

Why should we fit it? Categorical data encoder fitting is a process, when all the unique category values are being searched for in order to create an encoded labels list. After the fitting is complete, one may use the fitted encoder for new data of the same source. In order to fit the encoder it's needed to create the entity and pass the fitting data as an argument to the constructor, along with the features to be encoded:

final encoder = Encoder.oneHot(
  dataFrame,
  featureNames: featureNames,
);

Let's encode the features:

final encoded = encoder.encode(dataFrame);

We used the same dataframe here - it's absolutely normal, since when we created the encoder, we just fit it with the dataframe, and now is the time to apply the dataframe to the fitted encoder.

It's time to take a look at our processed data! Let's read it:

final data = encoded.toMatrix();

print(data);

In the output we will see just numerical data, that's exactly we wanted to reach.

Changelog #

4.0.0 #

  • DataFrame class split up into separate smaller entities
  • DataFrame class core moved to separate repository
  • Pipeline entity created
  • Categorical data encoders implemented Pipeable interface

3.4.0 #

  • DataFrame: encodedColumnRanges added

3.3.0 #

  • ml_linalg 10.0.0 supported

3.2.0 #

  • ml_linalg 9.0.0 supported

3.1.0 #

  • Categorical data processing: encoders parameter added to DataFrame.fromCsv constructor

3.0.0 #

  • xrange library supported: it's possible to provide ZRange object now instead of tuple2 to specify a range of indices

2.0.0 #

  • DataFrame introduced

1.1.0 #

  • Float32x4InterceptPreprocessor added
  • readme updated

1.0.0 #

  • Package published

example/main.dart

import 'dart:async';

import 'package:ml_dataframe/ml_dataframe.dart';
import 'package:ml_preprocessing/ml_preprocessing.dart';
import 'package:ml_preprocessing/src/encoder/pipeable/label_encode.dart';
import 'package:ml_preprocessing/src/encoder/pipeable/one_hot_encode.dart';
import 'package:ml_preprocessing/src/pipeline/pipeline.dart';

Future main() async {
  final dataFrame = await fromCsv('example/dataset.csv',
      columns: [0, 1, 2, 3]);

  final pipeline = Pipeline(dataFrame, [
    encodeAsOneHotLabels(
      columnNames: ['position'],
      headerPostfix: '_position',
    ),
    encodeAsIntegerLabels(
      columnNames: ['country'],
    ),
  ]);

  print(pipeline.process(dataFrame).toMatrix());
}

Use this package as a library

1. Depend on it

Add this to your package's pubspec.yaml file:


dependencies:
  ml_preprocessing: ^4.0.0

2. Install it

You can install packages from the command line:

with pub:


$ pub get

with Flutter:


$ flutter pub get

Alternatively, your editor might support pub get or flutter pub get. Check the docs for your editor to learn more.

3. Import it

Now in your Dart code, you can use:


import 'package:ml_preprocessing/ml_preprocessing.dart';
  
Popularity:
Describes how popular the package is relative to other packages. [more]
31
Health:
Code health derived from static analysis. [more]
100
Maintenance:
Reflects how tidy and up-to-date the package is. [more]
90
Overall:
Weighted score of the above. [more]
64
Learn more about scoring.

We analyzed this package on Sep 16, 2019, and provided a score, details, and suggestions below. Analysis was completed with status completed using:

  • Dart: 2.5.0
  • pana: 0.12.21

Platforms

Detected platforms: Flutter, other

Primary library: package:ml_preprocessing/ml_preprocessing.dart with components: io.

Health suggestions

Format lib/src/encoder/encoder.dart.

Run dartfmt to format lib/src/encoder/encoder.dart.

Format lib/src/encoder/encoder_impl.dart.

Run dartfmt to format lib/src/encoder/encoder_impl.dart.

Format lib/src/encoder/encoder_type.dart.

Run dartfmt to format lib/src/encoder/encoder_type.dart.

Fix additional 11 files with analysis or formatting issues.

Additional issues in the following files:

  • lib/src/encoder/helpers/create_encoder_to_series_mapping.dart (Run dartfmt to format lib/src/encoder/helpers/create_encoder_to_series_mapping.dart.)
  • lib/src/encoder/helpers/get_series_names_by_indices.dart (Run dartfmt to format lib/src/encoder/helpers/get_series_names_by_indices.dart.)
  • lib/src/encoder/pipeable/label_encode.dart (Run dartfmt to format lib/src/encoder/pipeable/label_encode.dart.)
  • lib/src/encoder/pipeable/one_hot_encode.dart (Run dartfmt to format lib/src/encoder/pipeable/one_hot_encode.dart.)
  • lib/src/encoder/series_encoder/label_series_encoder.dart (Run dartfmt to format lib/src/encoder/series_encoder/label_series_encoder.dart.)
  • lib/src/encoder/series_encoder/one_hot_series_encoder.dart (Run dartfmt to format lib/src/encoder/series_encoder/one_hot_series_encoder.dart.)
  • lib/src/encoder/series_encoder/series_encoder.dart (Run dartfmt to format lib/src/encoder/series_encoder/series_encoder.dart.)
  • lib/src/encoder/series_encoder/series_encoder_factory.dart (Run dartfmt to format lib/src/encoder/series_encoder/series_encoder_factory.dart.)
  • lib/src/encoder/series_encoder/series_encoder_factory_impl.dart (Run dartfmt to format lib/src/encoder/series_encoder/series_encoder_factory_impl.dart.)
  • lib/src/encoder/unknown_value_handling_type.dart (Run dartfmt to format lib/src/encoder/unknown_value_handling_type.dart.)
  • lib/src/pipeline/pipeline.dart (Run dartfmt to format lib/src/pipeline/pipeline.dart.)

Maintenance issues and suggestions

Support latest dependencies. (-10 points)

The version constraint in pubspec.yaml does not support the latest published versions for 1 dependency (ml_linalg).

Dependencies

Package Constraint Resolved Available
Direct dependencies
Dart SDK >=2.4.0 <3.0.0
ml_dataframe ^0.0.3 0.0.3 0.0.4
ml_linalg ^10.0.3 10.3.7 11.0.0
quiver ^2.0.2 2.0.5
tuple ^1.0.2 1.0.2
xrange ^0.0.4 0.0.6
Transitive dependencies
csv 4.0.3
matcher 0.12.5
meta 1.1.7
path 1.6.4
stack_trace 1.9.3
Dev dependencies
benchmark_harness >=1.0.0 <2.0.0
build_runner ^1.1.2
build_test ^0.10.2
grinder ^0.8.3
ml_tech ^0.0.5
mockito ^3.0.0
test ^1.2.0