ViTBackbone class

A Vision Transformer (ViT) backbone for extracting image features.

This model processes an image by dividing it into patches, linearly embedding them, adding positional information, and feeding them through a Transformer Encoder. It outputs the contextualized embeddings of all patches (and optionally a CLS token) for downstream tasks like object detection.

Inheritance

Constructors

ViTBackbone({required int imageSize, required int patchSize, int numChannels = 3, required int embedSize, int numLayers = 2, int numHeads = 4})

Properties

clsToken ValueVector
final
embedSize int
final
hashCode int
The hash code for this object.
no setterinherited
imageSize int
final
numChannels int
final
numHeads int
final
numLayers int
final
patchProjection Layer
final
patchSize int
final
positionEmbeddings List<ValueVector>
final
runtimeType Type
A representation of the runtime type of the object.
no setterinherited
transformerEncoder TransformerEncoder
final

Methods

forward(List<double> imageData) List<ValueVector>
The forward pass for the ViT Backbone.
noSuchMethod(Invocation invocation) → dynamic
Invoked when a nonexistent method or property is accessed.
inherited
parameters() List<Value>
override
toString() String
A string representation of this object.
inherited
zeroGrad() → void
inherited

Operators

operator ==(Object other) bool
The equality operator.
inherited