transform method

Stream<Chunk> transform(
  1. Stream<String> rawFeed
)

Transforms a text stream into non-overlapping chunks of approximately equal size.

This method breaks the input text stream into chunks that are close to the target chunk size. It uses the cleanChunks extension from the bpe package to prepare the text before chunking.

@param rawFeed The input stream of text @return A stream of non-overlapping chunks

Implementation

Stream<Chunk> transform(Stream<String> rawFeed) async* {
  int start = 0;
  int lengthBuffer = 0;
  List<String> buffer = [];
  int id = 0;
  await for (String i in rawFeed.cleanChunks(
    size: max(1, chunkSize ~/ 2),
    grace: max(1, chunkSize ~/ 4),
  )) {
    buffer.add(i);
    lengthBuffer += i.length;

    if (lengthBuffer >= chunkSize && lengthBuffer - i.length <= chunkSize) {
      Chunk c = Chunk(
        id++,
        start,
        lengthBuffer - i.length,
        buffer.sublist(0, buffer.length - 1).join(),
      );
      start += c.length;
      lengthBuffer = i.length;
      buffer = [i];
      yield c;
    }
  }

  if (buffer.isNotEmpty) {
    yield Chunk(id++, start, lengthBuffer, buffer.join());
  }
}