🚀 G-Tensor: High-Performance Dart & CUDA Deep Learning EngineG-Tensor is a custom deep learning framework that combines the developer productivity of Dart with the raw computational power of NVIDIA CUDA.Unlike standard wrappers, G-Tensor features a custom Autoregressive Functional Transformer (AFT) implementation, a manual Autograd engine, and hand-optimized CUDA kernels for operations like Causal Masking, Layer Normalization, and Cross-Entropy with Label Smoothing.🏗 System ArchitectureThe engine is split into three distinct layers:Dart API (Frontend): High-level Tensor class with operator overloading (+, -, *, matmul) and Module classes for building neural networks.FFI Bridge: A low-level Dart FFI (Foreign Function Interface) layer that handles memory addresses and dispatches calls to compiled C++/CUDA binaries.CUDA Kernels (Backend): Hand-written .cu kernels optimized for parallel execution on the GPU, featuring custom broadcasting logic and stable gradient calculations.🧠 The AFT Causal Mechanism (Mathematical Derivation)The core of this engine is the Attention Free Transformer (AFT). Unlike standard Multi-Head Attention which has $O(T^2)$ complexity, AFT reduces this to $O(Td)$ by re-arranging the interaction between Queries, Keys, and Values.The FormulationIn your implementation, the attention-like operation is defined as:$$Z_t = \sigma(Q_t) \odot \frac{\sum_{i=1}^t \exp(K_i + w_{t,i}) \odot V_i}{\sum_{i=1}^t \exp(K_i + w_{t,i})}$$Where:$\sigma$ is the Sigmoid activation.$\odot$ is the Element-wise (Hadamard) product (successfully verified in our test_tensor2.dart).$w_{t,i}$ represents the Learned Pairwise Position Bias.The Causal MaskTo ensure the model cannot "cheat" by looking at future tokens, we apply a triangular causal mask. In G-Tensor, this is handled by a specialized engine.mulTensors call during the forward pass, ensuring that for any time $t$, the gradients from $t+1 \dots T$ are exactly zero.✨ Key FeaturesCustom Autograd: Fully functional backpropagation through computational graphs.Efficient Memory Management: Explicit tracker and dispose system to prevent VRAM leaks in Dart's garbage-collected environment.Broadcasting: Support for adding row-vector biases to activation matrices via custom CUDA indexing (e.g., adding 1, 128 bias to 64, 128 activations).Advanced Loss Kernels: Stable Cross-Entropy with built-in LogSoftmax and Label Smoothing ($\epsilon = 0.1$).🛠 Installation & SetupPrerequisitesDart SDK (v3.0+)NVIDIA CUDA Toolkit (v11.0+)CMake (for building the C++ backend)Building the BackendNavigate to the src directory.Compile the CUDA shared library:Bashmkdir build && cd build
cmake ..
make
Ensure the generated .so or .dll is in your LD_LIBRARY_PATH.💻 Usage Example1. Training with Memory ManagementBecause Dart is garbage collected but CUDA memory is not, you must use the tracker pattern:Dartfor (int step = 0; step < 1000; step++) {
List
// Forward pass final logits = gpt.forward(inputIdx, dummyEnc, tracker); final loss = logits.crossEntropy(targetIds);
// Backward pass loss.backward(); optimizer.step();
// CLEANUP: Free intermediate tensors to prevent CUDA OOM for (var t in tracker) { if (!gpt.parameters().contains(t)) t.dispose(); } loss.dispose(); } 2. Autoregressive GenerationThe engine supports greedy and nucleus sampling for text generation:Dartvoid generate(String prompt) { List
// Fetch only the last row for prediction List
Libraries
- adam
- aft
- aft_cross_attention
- aft_multi_head_attention
- aft_multi_head_cross_attention
- aft_muzero_transformer_decoder
- aft_text_decoder_block
- aft_transformer_decoder
- aft_transformer_decoder_block
- aft_transformer_encoder
- aft_transformer_encoder_block
- aft_vit_backbone
- aft_vit_face_embeding
- apps/face_embeddings
- apps/face_training
- apps/images
- apps/triplet_loader
- apps/triplet_loader2
- audio_transformer
- chess/mcts
- chess/uci
- core/engine
- core/matrix
- core/tensor
- dart_cuda
- dataset/chess
- dataset/dataset
- example_audio_video
- feed_forward
- gpu_tensor
- hungarian_algorithm
- layer_norm
- main_face_gpu
- mlp
- mlp2
- mlp3
- mlp_learn
- mu_zero/example
- mu_zero/example2
- mu_zero/example3
- mu_zero/mu_zero_greedy_agent2
- mu_zero/muzero_greedy_agent
- mu_zero/shakespear_example
- mu_zero/training
- multi_modal_transformer
- multi_modal_transformer2
- multi_modal_trnasformer_encoder
- network_utils
- nn
- nn/conv_2d
- open_cv/open_cv
- optimizers/cross_entropy
- optimizers/stochastic_grad_desc
- overfit
- persistence
- tests/tensor/mat_mul
- tests/test_tensor
- tests/test_tensor2
- text_decoder
- text_transformer
- train_xor
- train_xor_2
- train_xor_3
- triplet_loss
- video_transformer
- vit_object_detector