dsbuild 0.1.0-alpha.2 copy "dsbuild: ^0.1.0-alpha.2" to clipboard
dsbuild: ^0.1.0-alpha.2 copied to clipboard

Dataset preparation for LM training.

Dataset Preparation for LM Training #

Designed to help users clean and prepare their datasets for use in language model (LM) training. It is built using the Dart programming language and utilizes YAML configuration to describe the steps used to construct the dataset. The goal is to enable easy review, replication, and iteration upon the dataset, making it easier for users to train high-quality language models.

This package can be used either as a standalone tool, or as a library to introduce complex build steps.

Key Features:

  • YAML configuration for easy review, replication, and iteration.
  • Common text transformers for pruning or stripping messages.

The processing pipeline is as follows:

Reader -> Preprocessor -> Concatenate -> Postprocessor -> Writer

The responsibilities for each component of the pipeline are as follows:

  • Readers: Responsible for acquiring and formatting the data in a unified way to be fed to the preprocessors. Readers output a MessageEnvelope stream. Each MessageEnvelope contains information about its parent conversation, in addition to the message itself. This step may be run in parallel for each input.

  • Preprocessors: Responsible for preparing the data for merge. Multiple preprocessors may be applied sequentially. Preprocessors receive and output a MessageEnvelope stream. This step runs sequentially after the Reader or a previous preprocessor, and runs in parallel for each input.

  • Concatenate: Data output from preprocessors are collected in the order of their input. This step operates on a MessageEnvelope stream and outputs a Conversation stream. During this step, all conversations are fully stored in RAM.

  • Postprocessors: Performs transformations on the final Conversation stream. This happens sequentially after the merge and is currently sequential for each Conversation. Unlike preprocessors, these receive the entire conversation in context. Receives and outputs a Conversation stream.

  • Writers: Formats and outputs the final Conversation stream.

For installing the tool from source see here.

1
likes
0
pub points
0%
popularity

Publisher

verified publishersquish.tech

Dataset preparation for LM training.

Repository (GitHub)
View/report issues

Topics

#ai #nlp #lm

License

unknown (license)

Dependencies

bloc, crypto, csv, html, http, logging, yaml

More

Packages that depend on dsbuild