ml_preprocessing 3.4.0+1

Build Status Coverage Status pub package Gitter Chat

ml_preprocessing #

Data preprocessing algorithms

What is data preprocessing? #

Data preprocessing is a set of techniques for data preparation before one can use the data in Machine Learning algorithms.

Why is it needed? #

Let's say, you have a dataset:

    | Gender | Country | Height (cm) | Weight (kg) | Diabetes (1 - Positive, 0 - Negative) |
    | Female | France  |     165     |     55      |                    1                  |
    | Female | Spain   |     155     |     50      |                    0                  |
    | Male   | Spain   |     175     |     75      |                    0                  |
    | Male   | Russia  |     173     |     77      |                   N/A                 |

Everything seems good for now. Say, you're about to train a classifier to predict if a person has diabetes. But there is an obstacle - how can it possible to use the data in mathematical equations with string-value columns (Gender, Country)? And things are getting even worse because of an empty (N/A) value in Diabetes column. There should be a way to convert this data to a valid numerical representation. Here data preprocessing techniques come to play. You should decide, how to convert string data (aka categorical data) to numbers and how to treat empty values. Of course, you can come up with your own unique algorithms to do all of these operations, but, actually, there are a bunch of well-known well-performed techniques for doing all the conversions.

In this library, all the data preprocessing operations are narrowed to just one entity - DataFrame.

DataFrame #

DataFrame is a factory, that creates instances of different adapters for data. For example, one can create a csv reader, that makes work with csv data easier: it's just needed to point, where a dataset resides and then get features and labels in convenient data science friendly format. Also one can specify, how to treat categorical data.

A simple usage example #

Let's download some data from Kaggle - let it be amazing black friday dataset. It's pretty interesting data with huge amount of observations (approx. 538000 rows) and a good number of categorical features.

First, import all necessary libraries:

import 'package:ml_preprocessing/ml_preprocessing.dart';
import 'package:xrange/zrange.dart';

Then, we should read the csv and create a data frame:

final dataFrame = DataFrame.fromCsv('example/black_friday/black_friday.csv',
  labelName: 'Purchase\r',
  columns: [ZRange.closed(2, 3), ZRange.closed(5, 7), ZRange.closed(11, 11)],
  rows: [ZRange.closed(0, 20)],
  categories: {
    'Gender': CategoricalDataEncoderType.oneHot,
    'Age': CategoricalDataEncoderType.oneHot,
    'City_Category': CategoricalDataEncoderType.oneHot,
    'Stay_In_Current_City_Years': CategoricalDataEncoderType.oneHot,
    'Marital_Status': CategoricalDataEncoderType.oneHot,

Apparently, it is needed to explain input parameters.

  • labelName - name of a column, that contains dependant variables
  • columns - a set of intervals, representing which columns one needs to read
  • rows - the same as columns, but in this case it's being described, which rows one needs to read
  • categories - columns, which contains categorical data, and encoders we want these columns to be processed with. In this particular case we want to encode all the categorical columns with one-hot encoder

It's time to take a look at our processed data! Let's read it:

final features = await dataFrame.features;
final labels = await dataFrame.labels;


In the output we will see just numerical data, that's exactly we wanted to reach.

Changelog #

3.4.0 #

  • DataFrame: encodedColumnRanges added

3.3.0 #

  • ml_linalg 10.0.0 supported

3.2.0 #

  • ml_linalg 9.0.0 supported

3.1.0 #

  • Categorical data processing: encoders parameter added to DataFrame.fromCsv constructor

3.0.0 #

  • xrange library supported: it's possible to provide ZRange object now instead of tuple2 to specify a range of indices

2.0.0 #

  • DataFrame introduced

1.1.0 #

  • Float32x4InterceptPreprocessor added
  • readme updated

1.0.0 #

  • Package published


import 'dart:async';

import 'package:ml_preprocessing/ml_preprocessing.dart';
import 'package:xrange/zrange.dart';

Future main() async {
  // Let's create data frame from a csv file,
  // `labelIdx: 3` means that the label (dependent variable in terms of
  // Machine Learning) column of the dataset is its third column
  // `headerExists: true` means, that our csv-file has a header row
  // `categories: {...}` means, that we want to encode values of
  // `position`-column with one-hot encoder and column `country` will be
  // encoded with Ordinal encoder
  // `rows: [Tuple2<int, int>(0, 6)]` means, that we want to read range of the
  // csv's rows from 0 to 6th
  // `columns: [Tuple2<int, int>(0, 3)]` means, that we want to read range of
  // the csv's columns from 0 to third columns
  final data = DataFrame.fromCsv('example/dataset.csv', labelIdx: 3,
    headerExists: true,
    categories: {
      'position': CategoricalDataEncoderType.oneHot,
      'country': CategoricalDataEncoderType.ordinal,
    rows: [ZRange.closed(0, 6)],
    columns: [ZRange.closed(0, 3)],

  // Let's read the header of the dataset, preprocessed features and labels
  final header = await data.header;
  final features = await data.features;
  final labels = await data.labels;

  // And print the result

  // That's, actually, all you have to do to use the data further in different
  // applications

Use this package as a library

1. Depend on it

Add this to your package's pubspec.yaml file:

  ml_preprocessing: ^3.4.0+1

2. Install it

You can install packages from the command line:

with pub:

$ pub get

with Flutter:

$ flutter pub get

Alternatively, your editor might support pub get or flutter pub get. Check the docs for your editor to learn more.

3. Import it

Now in your Dart code, you can use:

import 'package:ml_preprocessing/ml_preprocessing.dart';
Describes how popular the package is relative to other packages. [more]
Code health derived from static analysis. [more]
Reflects how tidy and up-to-date the package is. [more]
Weighted score of the above. [more]
Learn more about scoring.

We analyzed this package on Jul 17, 2019, and provided a score, details, and suggestions below. Analysis was completed with status completed using:

  • Dart: 2.4.0
  • pana: 0.12.19


Detected platforms: Flutter, other

Primary library: package:ml_preprocessing/ml_preprocessing.dart with components: io.

Health suggestions

Fix lib/src/categorical_encoder/one_hot_encoder.dart. (-0.50 points)

Analysis of lib/src/categorical_encoder/one_hot_encoder.dart reported 1 hint:

line 1 col 8: Unused import: 'dart:typed_data'.

Fix lib/src/categorical_encoder/ordinal_encoder.dart. (-0.50 points)

Analysis of lib/src/categorical_encoder/ordinal_encoder.dart reported 1 hint:

line 1 col 8: Unused import: 'dart:typed_data'.

Fix lib/src/data_frame/data_frame.dart. (-0.50 points)

Analysis of lib/src/data_frame/data_frame.dart reported 1 hint:

line 10 col 3: Prefer using /// for doc comments.

Fix additional 14 files with analysis or formatting issues.

Additional issues in the following files:

  • lib/src/categorical_encoder/encoder_factory_impl.dart (Run dartfmt to format lib/src/categorical_encoder/encoder_factory_impl.dart.)
  • lib/src/categorical_encoder/encoder_mixin.dart (Run dartfmt to format lib/src/categorical_encoder/encoder_mixin.dart.)
  • lib/src/data_frame/csv_data_frame.dart (Run dartfmt to format lib/src/data_frame/csv_data_frame.dart.)
  • lib/src/data_frame/encoders_processor/encoders_processor.dart (Run dartfmt to format lib/src/data_frame/encoders_processor/encoders_processor.dart.)
  • lib/src/data_frame/encoders_processor/encoders_processor_factory.dart (Run dartfmt to format lib/src/data_frame/encoders_processor/encoders_processor_factory.dart.)
  • lib/src/data_frame/encoders_processor/encoders_processor_factory_impl.dart (Run dartfmt to format lib/src/data_frame/encoders_processor/encoders_processor_factory_impl.dart.)
  • lib/src/data_frame/encoders_processor/encoders_processor_impl.dart (Run dartfmt to format lib/src/data_frame/encoders_processor/encoders_processor_impl.dart.)
  • lib/src/data_frame/header_extractor/header_extractor_factory_impl.dart (Run dartfmt to format lib/src/data_frame/header_extractor/header_extractor_factory_impl.dart.)
  • lib/src/data_frame/header_extractor/header_extractor_impl.dart (Run dartfmt to format lib/src/data_frame/header_extractor/header_extractor_impl.dart.)
  • lib/src/data_frame/index_ranges_combiner/index_ranges_combiner_factory_impl.dart (Run dartfmt to format lib/src/data_frame/index_ranges_combiner/index_ranges_combiner_factory_impl.dart.)
  • lib/src/data_frame/index_ranges_combiner/index_ranges_combiner_impl.dart (Run dartfmt to format lib/src/data_frame/index_ranges_combiner/index_ranges_combiner_impl.dart.)
  • lib/src/data_frame/validator/error_messages.dart (Run dartfmt to format lib/src/data_frame/validator/error_messages.dart.)
  • lib/src/data_frame/validator/params_validator_impl.dart (Run dartfmt to format lib/src/data_frame/validator/params_validator_impl.dart.)
  • lib/src/data_frame/variables_extractor/variables_extractor_impl.dart (Run dartfmt to format lib/src/data_frame/variables_extractor/variables_extractor_impl.dart.)


Package Constraint Resolved Available
Direct dependencies
Dart SDK >=2.3.0 <3.0.0
csv ^4.0.0 4.0.3
ml_linalg ^10.0.3 10.3.7
tuple ^1.0.2 1.0.2
xrange ^0.0.4 0.0.6
Transitive dependencies
matcher 0.12.5
meta 1.1.7
path 1.6.2
quiver 2.0.3
stack_trace 1.9.3
Dev dependencies
benchmark_harness >=1.0.0 <2.0.0
build_runner ^1.1.2
build_test ^0.10.2
mockito ^3.0.0
test ^1.2.0