ml_algo 16.7.1 copy "ml_algo: ^16.7.1" to clipboard
ml_algo: ^16.7.1 copied to clipboard

outdated

Machine learning algorithms, Machine learning models performance evaluation functionality

Build Status Coverage Status pub package Gitter Chat

Machine learning algorithms for Dart developers - ml_algo library #

The library is a part of ecosystem:

Table of contents

What is ml_algo for? #

The main purpose of the library is to give native Dart implementation of machine learning algorithms to those who are interested both in Dart language and data science. This library aims at Dart VM and Flutter, it's impossible to use it in the web applications.

The library's content #

  • Model selection

    • CrossValidator. Factory that creates instances of cross validators. Cross validation allows researchers to fit different hyperparameters of machine learning algorithms assessing prediction quality on different parts of a dataset.
  • Classification algorithms

    • LogisticRegressor. A class that performs linear binary classification of data. To use this kind of classifier your data has to be linearly separable.

    • SoftmaxRegressor. A class that performs linear multiclass classification of data. To use this kind of classifier your data has to be linearly separable.

    • DecisionTreeClassifier A class that performs classification using decision trees. May work with data with non-linear patterns.

    • KnnClassifier A class that performs classification using k nearest neighbours algorithm - it makes prediction basing on the first k closest observations to the given one.

  • Regression algorithms

    • LinearRegressor. A general class for finding a linear pattern in training data and predicting outcome as real numbers.

    • LinearRegressor.lasso Implementation of the linear regression algorithm based on coordinate descent with lasso regularisation

    • LinearRegressor.SGD Implementation of the linear regression algorithm based on stochastic gradient descent with L2 regularisation

    • KnnRegressor A class that makes prediction for each new observation basing on first k closest observations from training data. It may catch non-linear pattern of the data.

  • Clustering and retrieval algorithms

For more information on the library's API, please visit API reference

Examples #

Logistic regression #

Let's classify records from well-known dataset - Pima Indians Diabets Database via Logistic regressor

Important note:

Please pay attention to problems which classifiers and regressors exposed by the library solve. E.g. Logistic regressor solves only binary classification problem, and that means that you can't use this classifier with a dataset with more than two classes, keep that in mind - in order to find out more about regresseors and classifiers, please refer to the api documentation of the package

Import all necessary packages. First, it's needed to ensure if you have ml_preprocessing and ml_dataframe packages in your dependencies:

dependencies:
  ml_dataframe: ^0.5.1
  ml_preprocessing: ^6.0.0

We need these repos to parse raw data in order to use it further. For more details, please visit ml_preprocessing repository page.

Important note:

Regressors and classifiers exposed by the library do not handle strings, booleans and nulls, they can only deal with numbers! You necessarily need to convert all the improper values of your dataset to numbers, please refer to ml_preprocessing library to find out more about data preprocessing.

import 'package:ml_algo/ml_algo.dart';
import 'package:ml_dataframe/ml_dataframe.dart';
import 'package:ml_preprocessing/ml_preprocessing.dart';

Read a dataset's file #

Download the dataset from Pima Indians Diabets Database.

For a desktop application:

Just provide a proper path to your downloaded file and use a function-factory fromCsv from ml_dataframe package to read the file:

final samples = await fromCsv('datasets/pima_indians_diabetes_database.csv');

For a flutter application:

Be sure that you have ml_dataframe package version at least 0.5.1 and ml_algo package version at least 16.0.0 in your pubspec.yaml:

dependencies:
  ...
  ml_algo: ^16.0.0
  ml_dataframe: ^0.5.1
  ...

Then it's needed to add the dataset to the flutter assets by adding the following config in the pubspec.yaml:

flutter:
  assets:
    - assets/datasets/pima_indians_diabetes_database.csv

You need to create the assets directory in the file system and put the dataset's file there. After that you can access the dataset:

import 'package:flutter/services.dart' show rootBundle;
import 'package:ml_dataframe/ml_dataframe.dart';

final rawCsvContent = await rootBundle.loadString('assets/datasets/pima_indians_diabetes_database.csv');
final samples = DataFrame.fromRawCsv(rawCsvContent);

Prepare datasets for training and testing #

Data in this file is represented by 768 records and 8 features. 9th column is a label column, it contains either 0 or 1 on each row. This column is our target - we should predict a class label for each observation. The column's name is class variable (0 or 1). Let's store it:

final targetColumnName = 'class variable (0 or 1)';

Now it's the time to prepare data splits. Since we have a smallish dataset (only 768 records), we can't afford to split the data into just train and test sets and evaluate the model on them, the best approach in our case is Cross Validation. According to this, let's split the data in the following way using the library's splitData function:

final splits = splitData(samples, [0.7]);
final validationData = splits[0];
final testData = splits[1];

splitData accepts DataFrame instance as the first argument and ratio list as the second one. Now we have 70% of our data as a validation set and 30% as a test set for evaluating generalization error.

Set up a model selection algorithm #

Then we may create an instance of CrossValidator class to fit hyperparameters of our model. We should pass validation data (our validationData variable), and a number of folds into CrossValidator constructor.

final validator = CrossValidator.kFold(validationData, numberOfFolds: 5);

Let's create a factory for the classifier with desired hyperparameters. We have to decide after the cross validation, if the selected hyperparametrs are good enough or not:

final createClassifier = (DataFrame samples) =>
  LogisticRegressor(
    samples
    targetColumnName,
    optimizerType: LinearOptimizerType.gradient,
    iterationsLimit: 90,
    learningRateType: LearningRateType.timeBased,
    batchSize: samples.rows.length,
    probabilityThreshold: 0.7,
  );

Let's describe our hyperparameters:

  • optimizerType - type of optimization algorithm that will be used to learn coefficients of our model, this time we decided to use vanilla gradient ascent algorithm
  • iterationsLimit - number of learning iterations. Selected optimization algorithm (gradient ascent in our case) will be run this amount of times
  • learningRateType - a strategy for learning rate update. In our case the learning rate will decrease after every iteration
  • batchSize - size of data (in rows) that will be used per each iteration. As we have a really small dataset we may use full-batch gradient ascent, that's why we used samples.rows.length here - the total amount of data.
  • probabilityThreshold - lower bound for positive label probability

If we want to evaluate the learning process more thoroughly, we may pass collectLearningData argument to the classifier constructor:

final createClassifier = (DataFrame samples) =>
  LogisticRegressor(
    ...,
    collectLearningData: true,
  );

This argument activates collecting costs per each optimization iteration, and you can see the cost values right after the model creation.

Evaluate performance of the model #

Assume, we chose really good hyperprameters. In order to validate this hypothesis let's use CrossValidator instance created before:

final scores = await validator.evaluate(createClassifier, MetricType.accuracy);

Since the CrossValidator instance returns a Vector of scores as a result of our predictor evaluation, we may choose any way to reduce all the collected scores to a single number, for instance we may use Vector's mean method:

final accuracy = scores.mean();

Let's print the score:

print('accuracy on k fold validation: ${accuracy.toStringAsFixed(2)}');

We can see something like this:

accuracy on k fold validation: 0.65

Let's assess our hyperparameters on test set in order to evaluate the model's generalization error:

final testSplits = splitData(testData, [0.8]);
final classifier = createClassifier(testSplits[0]);
final finalScore = classifier.assess(testSplits[1], MetricType.accuracy);

The final score is like:

print(finalScore.toStringAsFixed(2)); // approx. 0.75

If we specified collectLearningData parameter, we may see costs per each iteration in order to evaluate how our cost changed from iteration to iteration during the learning process:

print(classifier.costPerIteration);

Write the model to a json file #

Seems, our model has a good generalization ability, and that means we may use it in the future. To do so we may store the model to a file as JSON:

await classifier.saveAsJson('diabetes_classifier.json');

After that we can simply read the model from the file and make predictions:

import 'dart:io';

final fileName = 'diabetes_classifier.json';
final file = File(fileName);
final encodedModel = await file.readAsString();
final classifier = LogisticRegressor.fromJson(encodedModel);
final unlabelledData = await fromCsv('some_unlabelled_data.csv');
final prediction = classifier.predict(unlabelledData);

print(prediction.header); // ('class variable (0 or 1)')
print(prediction.rows); // [ 
                        //   (1),
                        //   (0),
                        //   (0),
                        //   (1),
                        //   ...,
                        //   (1),
                        // ]

Please note that all the hyperparameters that we used to generate the model are persisted as the model's readonly fields, and we can access it anytime:

print(classifier.iterationsLimit);
print(classifier.probabilityThreshold);
// and so on
All the code for a desktop application:
import 'package:ml_algo/ml_algo.dart';
import 'package:ml_dataframe/ml_dataframe.dart';
import 'package:ml_preprocessing/ml_preprocessing.dart';

void main() async {
  final samples = await fromCsv('datasets/pima_indians_diabetes_database.csv', headerExists: true);
  final targetColumnName = 'class variable (0 or 1)';
  final splits = splitData(samples, [0.7]);
  final validationData = splits[0];
  final testData = splits[1];
  final validator = CrossValidator.kFold(validationData, numberOfFolds: 5);
  final createClassifier = (DataFrame samples) =>
    LogisticRegressor(
      samples
      targetColumnName,
      optimizerType: LinearOptimizerType.gradient,
      iterationsLimit: 90,
      learningRateType: LearningRateType.timeBased,
      batchSize: samples.rows.length,
      probabilityThreshold: 0.7,
    );
  final scores = await validator.evaluate(createClassifier, MetricType.accuracy);
  final accuracy = scores.mean();
  
  print('accuracy on k fold validation: ${accuracy.toStringAsFixed(2)}');

  final testSplits = splitData(testData, [0.8]);
  final classifier = createClassifier(testSplits[0], targetNames);
  final finalScore = classifier.assess(testSplits[1], targetNames, MetricType.accuracy);
  
  print(finalScore.toStringAsFixed(2));

  await classifier.saveAsJson('diabetes_classifier.json');
}
All the code for a flutter application:
import 'package:flutter/services.dart' show rootBundle;
import 'package:ml_algo/ml_algo.dart';
import 'package:ml_dataframe/ml_dataframe.dart';
import 'package:ml_preprocessing/ml_preprocessing.dart';

void main() async {
  final rawCsvContent = await rootBundle.loadString('assets/datasets/pima_indians_diabetes_database.csv');
  final samples = DataFrame.fromRawCsv(rawCsvContent);
  final targetColumnName = 'class variable (0 or 1)';
  final splits = splitData(samples, [0.7]);
  final validationData = splits[0];
  final testData = splits[1];
  final validator = CrossValidator.kFold(validationData, numberOfFolds: 5);
  final createClassifier = (DataFrame samples) =>
    LogisticRegressor(
      samples
      targetColumnName,
      optimizerType: LinearOptimizerType.gradient,
      iterationsLimit: 90,
      learningRateType: LearningRateType.timeBased,
      batchSize: samples.rows.length,
      probabilityThreshold: 0.7,
    );
  final scores = await validator.evaluate(createClassifier, MetricType.accuracy);
  final accuracy = scores.mean();
  
  print('accuracy on k fold validation: ${accuracy.toStringAsFixed(2)}');

  final testSplits = splitData(testData, [0.8]);
  final classifier = createClassifier(testSplits[0], targetNames);
  final finalScore = classifier.assess(testSplits[1], targetNames, MetricType.accuracy);
  
  print(finalScore.toStringAsFixed(2));

  await classifier.saveAsJson('diabetes_classifier.json');
}

Linear regression #

Let's try to predict house prices using linear regression and the famous Boston Housing dataset. The dataset contains 13 independent variables and 1 dependent variable - medv which is the target one (you can find the dataset in e2e/_datasets/housing.csv).

Again, first we need to download the file and create a dataframe. The dataset is headless, we may either use autoheader or provide our own header. Let's use autoheader in our example:

For a desktop application:

Just provide a proper path to your downloaded file and use a function-factory fromCsv from ml_dataframe package to read the file:

final samples = await fromCsv('datasets/housing.csv', headerExists: false, columnDelimiter: ' ');

For a flutter application:

It's needed to add the dataset to the flutter assets by adding the following config in the pubspec.yaml:

flutter:
  assets:
    - assets/datasets/housing.csv

You need to create the assets directory in the file system and put the dataset's file there. After that you can access the dataset:

import 'package:flutter/services.dart' show rootBundle;
import 'package:ml_dataframe/ml_dataframe.dart';

final rawCsvContent = await rootBundle.loadString('assets/datasets/housing.csv');
final samples = DataFrame.fromRawCsv(rawCsvContent, fieldDelimiter: ' ');

Prepare the dataset for training and testing #

Data in this file is represented by 505 records and 13 features. 14th column is a target. Since we use autoheader, the target's name is autogenerated and it is col_13. Let's store it in a variable:

final targetName = 'col_13';

then let's shuffle the data:

samples.shuffle();

Now it's the time to prepare data splits. Let's split the data into train and test subsets using the library's splitData function:

final splits = splitData(samples, [0.8]);
final trainData = splits[0];
final testData = splits[1];

splitData accepts DataFrame instance as the first argument and ratio list as the second one. Now we have 80% of our data as a train set and 20% as a test set.

Let's train the model:

final model = LinearRegressor(trainData, targetName);

By default, LinearRegressor uses closed-form solution to train the model. One can also use a different solution type, e.g. stochastic gradient descent algorithm:

final model = LinearRegressor.SGD(
  samples
  targetName,
  iterationLimit: 90,
);

or linear regression based on coordinate descent with Lasso regularization:

final model = LinearRegressor.lasso(
  samples
  targetName,
  iterationLimit: 90,
);

Next, we should evaluate performance of our model:

final error = model.assess(testData, MetricType.mape);

print(error);

If we are fine with the error, we can save the model for the future use:

await model.saveAsJson('housing_model.json');

Later we may use our trained model for prediction:

import 'dart:io';
import 'package:ml_algo/ml_algo.dart';
import 'package:ml_dataframe/ml_dataframe.dart';

final file = File('housing_model.json');
final encodedModel = await file.readAsString();
final model = LinearRegressor.fromJson(encodedModel);
final unlabelledData = await fromCsv('some_unlabelled_data.csv');
final prediction = model.predict(unlabelledData);

print(prediction.header);
print(prediction.rows);
All the code for a desktop application:
import 'package:ml_algo/ml_algo.dart';
import 'package:ml_dataframe/ml_dataframe.dart';

void main() async {
  final samples = (await fromCsv('datasets/housing.csv', headerExists: false, columnDelimiter: ' '))
    ..shuffle();
  final targetName = 'col_13';
  final splits = splitData(samples, [0.8]);
  final trainData = splits[0];
  final testData = splits[1];
  final model = LinearRegressor(trainData, targetName);
  final error = model.assess(testData, MetricType.mape);
  
  print(error);

  await classifier.saveAsJson('housing_model.json');
}
All the code for a flutter application:
import 'package:flutter/services.dart' show rootBundle;
import 'package:ml_algo/ml_algo.dart';
import 'package:ml_dataframe/ml_dataframe.dart';

void main() async {
  final rawCsvContent = await rootBundle.loadString('assets/datasets/pima_indians_diabetes_database.csv');
  final samples = DataFrame.fromRawCsv(rawCsvContent, fieldDelimiter: ' ')
    ..shuffle();
  final targetName = 'col_13';
  final splits = splitData(samples, [0.8]);
  final trainData = splits[0];
  final testData = splits[1];
  final model = LinearRegressor(trainData, targetName);
  final error = model.assess(testData, MetricType.mape);
    
  print(error);
  
  await classifier.saveAsJson('housing_model.json');
}

Models retraining #

Someday our previously shining model can degrade in terms of prediction accuracy - in this case we can retrain it. Retraining means simply re-running the same learning algorithm that was used to generate our current model keeping the same hyperparameters but using a new data set with the same features:

import 'dart:io';

final fileName = 'diabetes_classifier.json';
final file = File(fileName);
final encodedModel = await file.readAsString();
final classifier = LogisticRegressor.fromJson(encodedModel);

// ... 
// here we do something and realize that our classifier performance is not so good
// ...

final newData = await fromCsv('path/to/dataset/with/new/data/to/retrain/the/classifier');
final retrainedClassifier = classifier.retrain(newData);

The workflow with other predictors (SoftmaxRegressor, DecisionTreeClassifier and so on) is quite similar to the described above for LogisticRegressor, feel free to experiment with other models.

A couple of words about linear models which use gradient optimisation methods #

Sometimes you may get NaN or Infinity as a value of your score, or it may be equal to some inconceivable value (extremely big or extremely low). To prevent so, you need to find a proper value of the initial learning rate, and also you may choose between the following learning rate strategies: constant, timeBased, stepBased and exponential:

final createClassifier = (DataFrame samples) =>
    LogisticRegressor(
      ...,
      initialLearningRate: 1e-5,
      learningRateType: LearningRateType.timeBased,
      ...,
    );

Contacts #

If you have questions, feel free to text me on

103
likes
0
pub points
82%
popularity

Publisher

verified publisherml-algo.com

Machine learning algorithms, Machine learning models performance evaluation functionality

Repository (GitHub)
View/report issues

License

unknown (LICENSE)

Dependencies

collection, injector, json_annotation, ml_dataframe, ml_linalg, ml_preprocessing, quiver, xrange

More

Packages that depend on ml_algo