🕯️ Kandle
JavaScript Native PyTorch-aligned Machine Learning Framework
Bringing the true PyTorch experience to the JavaScript ecosystem
Quick Start • Core Features • Example Projects • Architecture • Roadmap
📖 Introduction
Kandle is a JavaScript Native machine learning framework that adopts an Eager Mode (dynamic graph) execution pattern, deeply referencing PyTorch’s ATen/c10 architectural design. I view PyTorch not just as a Python framework, but as the API specification standard for modern AI frameworks. Kandle is dedicated to implementing an API system highly align…
🕯️ Kandle
JavaScript Native PyTorch-aligned Machine Learning Framework
Bringing the true PyTorch experience to the JavaScript ecosystem
Quick Start • Core Features • Example Projects • Architecture • Roadmap
📖 Introduction
Kandle is a JavaScript Native machine learning framework that adopts an Eager Mode (dynamic graph) execution pattern, deeply referencing PyTorch’s ATen/c10 architectural design. I view PyTorch not just as a Python framework, but as the API specification standard for modern AI frameworks. Kandle is dedicated to implementing an API system highly aligned with PyTorch within the JavaScript ecosystem.
🎯 Core Value Proposition
- 🔄 Dynamic Graph Execution: True Eager Mode, supporting layer-by-layer debugging, intermediate state inspection, and dynamic control flow.
- 🎨 PyTorch API Alignment: Aligned at the architectural level rather than simple API wrapping, reducing migration costs and learning curves.
- ⚡ Hybrid Backend Architecture: Supports both WebGPU (GPU acceleration) and pure JS (CPU computation) backends under a unified interface.
- 🧩 Complete Tensor System: Implements a full Stride mechanism, broadcasting, view operations, and non-contiguous memory support.
- 🎵 Rich Operator Library: 200+ tensor operations covering arithmetic, linear algebra, convolution, FFT, audio processing, and more.
- 🚀 Out-of-the-Box Models: Native support for mainstream models like Qwen3 and Whisper, capable of loading Safetensor weights directly.
💡 Why Choose Kandle?
Current inference engines in the JavaScript ecosystem, such as ONNX Runtime and WebLLM, are excellent but are fundamentally Blackbox Systems focused on static graph inference. Kandle, as a Whitebox Framework, fills the following gaps:
| Requirement | Blackbox Inference Engines | Kandle (Whitebox Framework) |
|---|---|---|
| Intermediate Computation | ❌ Cannot intervene after static graph compilation | ✅ Pause/Inspect at any layer via dynamic graph |
| Model Interpretability | ❌ Blackbox, internal states inaccessible | ✅ Hooks, layer-by-layer state export |
| Custom Compute Flow | ❌ Limited to predefined Pipelines | ✅ Fully programmable control flow |
| Pre/Post-processing | ⚠️ Requires extra toolchains / ONNX export | ✅ Unified tensor operation system |
| API Learning Curve | ⚠️ Framework-proprietary APIs | ✅ Zero cost for PyTorch users |
| Debugging Experience | ❌ Hard to pinpoint issues in a blackbox | ✅ "Breakpoint-style" step-by-step debugging |
| Inference Performance | ✅ Static graph global optimization | ⚠️ Eager Mode trade-off |
What Whitebox can do that Blackbox cannot:
- 🔬 Layer-wise Feature Extraction: Export intermediate Tensors at any layer for visual analysis.
- 🎨 Runtime Layer Replacement: Dynamically replace/skip certain layers to implement model pruning or A/B testing.
- 🧪 Custom Loss Functions: Design special computation paths combined with business logic.
- 🎯 Precise Memory Control: Manually manage Tensor lifecycles to optimize VRAM usage.
- 🌐 Deep Integration with DOM API: Hooks directly bind to Canvas/WebGL for real-time rendering.
Suitable Scenarios: Research, prototype development, model debugging, applications requiring intermediate calculations, audio/visual pre-processing, interpretability analysis. Unsuitable Scenarios: High-performance production inference (please use ONNX Runtime or WebLLM).
🚨 Technical Verification Prototype Disclaimer
⚠️ This is a technical verification prototype, not a production-ready preview.
- ✅ The current version focuses on Forward Propagation Architecture Verification, implementing 200+ operators and a complete nn.Module system.
- 🚧 Autograd (Backpropagation) is under development and will be fully implemented in the next version.
- ⚠️ Happy Path Disclaimer: The current implementation mainly verifies the main flow (Happy Path); edge cases and error handling are not yet perfect.
- 🔒 No PRs Accepted Yet: The current development branch has completely diverged from the public version with breaking changes. Contributions will be opened after the architecture stabilizes.
- 💬 Feedback Welcome: I have been working somewhat in isolation, so I am very eager to hear the community’s thoughts and suggestions on "what a JavaScript version of PyTorch should look like."
- 🎯 Operator Demand Collection: Besides primitive operators, I want to know which specific operators the community needs supported early on.
🌐 Online Experience
No installation required, experience Kandle immediately. We provide a visual interactive Demo based on Qwen3-0.6B, fully showcasing the unique advantages of an Eager Mode framework in Model Interpretability:
📍 Access Addresses
- 🤗 HuggingFace Spaces: https://huggingface.co/spaces/finalkk/kandle-demo
- ⚡ Vercel: http://kandle-demo.vercel.app/
✨ Demo Core Features
| Feature | Description |
|---|---|
| 🎯 Step-by-Step Execution | Execute forward propagation step by step |
| ⏮️ Time Travel | Step back and re-select the generation path |
| 🎲 Manual Intervention | Manually select candidate words at each token generation to explore different branches |
| 🔍 Logit Lens | Visualize the probability distribution of each layer’s output in the vocabulary space |
| 🔗 Attention Links | Interactively view Self-Attention weight connection relationships |
| 🔥 Heatmap Visualization | Real-time display of Attention Maps and activation value distributions |
💡 This is the meaning of a Whitebox framework: Not just reasoning, but "dissecting" every step of the calculation process.
🎬 Usage Suggestions
- Explore the Model’s Thought Process: Observe the top-k tokens of each layer’s output during single-step execution to understand how the model gradually "focuses" on the final answer.
- Compare Different Paths: Backtrack and select different candidate words to observe the bifurcation points of the generation results.
- Discover Attention Patterns: Use Attention Links to discover key tokens the model focuses on (e.g., pronoun resolution, context dependencies).
- Debugging and Teaching: Suitable for researchers to understand the internal mechanisms of Transformers, or for teaching demonstrations.
⚠️ Demo Limitations
- Original Pre-trained Version Only: Currently, techniques like quantization are not implemented; it only loads original bf16 weights.
- Relatively Large Model Size: The original model size is about 1.5GB. It is recommended to download the model manually and load it using WebFile or Upload. Qwen3-0.6B Link
🚀 Quick Start
Installation
# Browser environment only needs the core library
# Using pnpm (Recommended)
pnpm add @kandle/core @kandle/backend-webgpu
# Optional type library, utilities, and pre-model building tools
pnpm add @kandle/types @kandle/utils @kandle/model-utils
# Or using npm
npm install @kandle/core @kandle/backend-webgpu
# If running in a Node.js environment, install webgpu polyfill additionally
npm install webgpu
Environment Requirements
- Node.js: ≥ 18.0.0 (ES2020+ support required)
- Browser: Chrome/Edge ≥ 113 (WebGPU support)
- TypeScript: ≥ 5.0 (Optional)
Basic Usage Examples
1️⃣ Initialize Backend (WebGPU)
import { env } from "@kandle/core";
import { WebGPUBackend } from "@kandle/backend-webgpu";
export async function initWebGPU() {
const backend = await WebGPUBackend.create();
env.setBackend(backend);
env.setDefaultDevice(backend.name);
}
2️⃣ Tensor Operations and Broadcasting
import * as k from '@kandle/core';
import { Tensor } from '@kandle/core';
// Create Tensor
const a = new Tensor([[1, 2, 3], [4, 5, 6]], { dtype: 'float32' });
const b = k.randn([2, 3]);
// Arithmetic operations (supports broadcasting)
const result = a.add(b).mul(2).softmax(-1);
// Get data (WebGPU asynchronous read)
const data = await result.dataAsync();
console.log(data); // Float32Array [...]
// Shape operations (Zero-copy views)
const transposed = a.transpose(0, 1);
console.log(transposed.shape); // [3, 2]
console.log(a.storageId === transposed.storageId); // true
console.log(a.id === transposed.id); // false
const reshaped = a.reshape([3, 2]);
console.log(reshaped.shape); // [3, 2]
console.log(a.storageId === reshaped.storageId); // true
console.log(a.id === reshaped.id); // false
// Advanced Indexing (Python style)
const slicedContiguous = a.slice(":1, 1:"); // a[:1, 1:]
console.log(slicedContiguous.shape) // [1, 2];
console.log(a.storageId === slicedContiguous.storageId); // true
console.log(a.id === slicedContiguous.id); // false
console.log(a.isContiguous); // true (contiguous here)
// Non-contiguous slicing
const slicedNonContiguous = a.slice("::2, ::-1"); // a[::2, ::-1]
console.log(slicedNonContiguous.shape) // [1, 3];
console.log(a.storageId === slicedNonContiguous.storageId); // true
console.log(a.id === slicedNonContiguous.id); // false
console.log(slicedNonContiguous.isContiguous); // false (non-contiguous here)
3️⃣ Linear Algebra and Matrix Operations
import * as k from '@kandle/core';
// Matrix Multiplication
const x = k.randn([128, 512]);
const weight = k.randn([512, 256]);
const output = k.matmul(x, weight); // [128, 256]
console.log(output.shape);
// Batch Matrix Multiplication
const batch = k.randn([4, 64, 128]);
const weights = k.randn([4, 128, 64]);
const batchOut = k.bmm(batch, weights); // [4, 64, 64]
console.log(batchOut.shape);
// Linear Layer (with bias)
const weightLinear = k.randn([256, 512]);
const bias = k.randn([256]);
const result = k.linear(x, weightLinear, bias);
console.log(result.shape); // [128, 256]
4️⃣ Building Models with nn.Module
import { nn, Tensor, randn } from '@kandle/core';
class MLP extends nn.Module {
fc1: nn.Linear;
fc2: nn.Linear;
constructor(inputDim: number, hiddenDim: number, outputDim: number) {
super();
this.fc1 = new nn.Linear(inputDim, hiddenDim);
this.fc2 = new nn.Linear(hiddenDim, outputDim);
}
async forward(x: Tensor): Promise<Tensor> {
// JS cannot overload operators, must provide call method to replace Python's model(x)
x = await this.fc1.call(x);
x = x.relu();
x = await this.fc2.call(x);
return x;
}
}
// Using the model
const model = new MLP(784, 256, 10);
const input = randn([32, 784]);
const output = await model.call(input);
console.log(output.shape); // [32, 10]
5️⃣ Memory Management (Like tf.tidy)
import * as k from '@kandle/core';
// Automatically release intermediate tensors
const result = k.tidy( () => {
const a = k.randn([1000, 1000]);
const temp1 = a.mul(2);
const temp2 = temp1.add(3);
return temp2.sum(); // Only the sum result is kept, temp1/temp2 are automatically released
});
console.log('Result:', await result.dataAsync());
📦 Monorepo Package Structure
Kandle uses a Monorepo architecture organized by pnpm workspace. The responsibilities of each package are as follows:
| Package Name | Function Description | Core File |
|---|---|---|
| @kandle/core | 🎨 User-side API, Tensor class, Operators, nn.Module | src/tensor.ts |
| @kandle/backend-webgpu | ⚡ WebGPU Backend Implementation (GPU Compute) | src/index.ts |
| @kandle/types | 📐 Type definitions, Interfaces, OpSchema | src/opschema/ |
| @kandle/utils | 🛠️ Utility functions, dtype handling, shape inference | src/index.ts |
| @kandle/model-utils | 🤖 Model building tools (Qwen3, Whisper) | src/index.ts |
✨ Core Features
1. Complete Tensor Primitive System
Stride Mechanism & Non-Contiguous Memory Support
- ✅ Stride Mechanism: Fully implements PyTorch-style memory layout management.
- ✅ Zero-Copy View Operations: Operations like
transpose,permute,slicedo not copy data. - ✅ Non-Contiguous Memory Computation: Supports direct computation after reshape or slice.
- ✅ Memory Format: Supports Contiguous and ChannelsLast layouts.
// Non-contiguous memory example
const x = randn([4, 3, 224, 224]);
const transposed = x.transpose(1, 2); // Zero-copy, strides changed
const sliced = x.slice("1:-1"); // View operation
// Automatically handles non-contiguous memory computation
const result = transposed.add(1).relu(); // Backend handles strides automatically
Broadcasting Mechanism
Fully compatible with NumPy/PyTorch broadcasting rules:
const a = randn([4, 1, 3]);
const b = randn([3]);
const result = a.add(b); // Automatically broadcasts b to [4, 1, 3]
2. Rich DType Support
💡 Design Philosophy: Logical dtype is separated from physical dtype; the backend automatically selects storage format based on device capabilities.
💡 Quantized types are planned, and storage optimization for bool / int8 / int16 / float16 will be added later.
| DType | TypedArray | WebGPU Storage | Status | Notes |
|---|---|---|---|---|
float32 | Float32Array | f32 | ✅ Full | Direct hardware support |
float64 | Float64Array | f32 | ⚠️ Downgrade | Downgrades to f32, precision loss exists |
float16 | Uint16Array | f16 / f32 | ⚠️ Device Dependent | Requires shader-f16 extension |
int32 | Int32Array | i32 | ✅ Full | Direct support |
uint32 | Uint32Array | u32 | ✅ Full | Direct support |
int8 / uint8 | Int8Array / Uint8Array | i32 / u32 | ⚠️ Extended | Extended storage to 32-bit |
int16 / uint16 | Int16Array / Uint16Array | i32 / u32 | ⚠️ Downgrade | Downgraded storage |
complex64 / complex128 | Float32Array / Float64Array | vec2<f32> | ⚠️ Rudimentary | Interleaved storage [r0,i0,r1,i1,...] |
bool | Uint8Array | u32 | ⚠️ Extended | Extended storage |
3. 200+ Tensor Operations
💡 List generated by AI retrieval, may contain omissions or unimplemented items. Please refer with caution.
💡 The following shows torch operator names. To align with JavaScript development experience, snake-case names are replaced with camelCase.
📐 Arithmetic & Math Operations
Basic Arithmetic: add, sub, mul, div, pow, sqrt, abs, neg, reciprocal, floor, ceil, round, trunc, frac, sign
Trigonometric: sin, cos, tan, asin, acos, atan, atan2
Hyperbolic: sinh, cosh, tanh, asinh, acosh, atanh
Exponential & Logarithmic: exp, exp2, expm1, log, log10, log2, log1p
Special Functions: erf, erfc, sigmoid, logit, i0
🔢 Linear Algebra
Matrix Operations: matmul, mm, bmm, dot, mv, outer, addmm, addmv, baddbmm
Matrix Manipulation: diag, diagonal, trace, tril, triu
Decomposition & Solving (Planned): svd, qr, cholesky, solve
🎲 Reduction Operations
sum, mean, std, var, min, max, argmin, argmax, logsumexp, prod, norm, median, mode, all, any
Supports reduction on specific dimensions and keepdim parameter:
const x = randn([4, 5, 6]);
const result = x.sum(1, true); // Reduce on dim 1, keep dim -> [4, 1, 6]
🔍 Comparison & Logic
Comparison: eq, ne, lt, le, gt, ge, maximum, minimum, clamp
Logic: logical_and, logical_or, logical_not, logical_xor
Conditional Selection: where, masked_fill, masked_select
🔀 Shape Operations
View Operations (Zero Copy): view, reshape, transpose, permute, squeeze, unsqueeze, flatten
Concatenation & Splitting: cat, stack, split, chunk, unbind
Indexing & Slicing: slice, select, index_select, gather, scatter, masked_select
Repetition & Expansion: repeat, repeat_interleave, expand, tile
Flipping & Rotating: flip, fliplr, flipud, rot90, roll
Advanced: as_strided (Direct stride manipulation)
🧮 Convolution & Pooling
Convolution: conv1d, conv2d, conv3d, conv_transpose2d, conv_transpose3d
Pooling: max_pool1d, max_pool2d, max_pool3d, avg_pool1d, avg_pool2d, avg_pool3d
Adaptive Pooling: adaptive_avg_pool2d, adaptive_max_pool2d
Padding: pad (Supports constant, reflect, replicate, circular modes)
📊 Normalization
batch_norm, layer_norm, group_norm, instance_norm, rms_norm, normalize
⚡ Activation Functions
relu, gelu, silu (swish), elu, selu, leaky_relu, prelu, rrelu, hardtanh, relu6, softplus, softsign, softmax, log_softmax, softmin, sigmoid, tanh, log_sigmoid, hardsigmoid, hardswish, mish, dropout
🎵 FFT (Fast Fourier Transform)
Real FFT: rfft, irfft, rfft2, irfft2
Complex FFT: fft, ifft, fft2, ifft2
Application: Audio signal processing, spectrum analysis
📈 Cumulative Operations
cumsum, cumprod, cummax, cummin, diff
🔧 Other Utilities
Sorting: sort, argsort, topk, kthvalue
Unique Values: unique, unique_consecutive
Filling & Cloning: fill_, zero_, clone, detach
Type Conversion: to (dtype/device conversion), contiguous (force contiguous memory)
4. Complete nn.Module Ecosystem
Core Base Classes
nn.Module: Base class, supportsforward,parameters()nn.Parameter: Learnable parameter wrapper- Containers:
Sequential,ModuleList,ModuleDict
state_dict()andload_state_dict()are hard to align perfectly, refer to theIOclass API below for model loading.
Implemented Layers
Linear & Embedding Layers
-
nn.Linear: Fully connected layer -
nn.Embedding: Embedding layer Convolution Layers -
nn.Conv1d,nn.Conv2d,nn.Conv3d -
nn.ConvTranspose2d,nn.ConvTranspose3dPooling Layers -
nn.MaxPool1d,nn.MaxPool2d,nn.MaxPool3d -
nn.AvgPool1d,nn.AvgPool2d,nn.AvgPool3dNormalization Layers -
nn.LayerNorm -
nn.RMSNormActivation Layers -
nn.ReLU,nn.GELU,nn.SiLU -
nn.LeakyReLU,nn.PReLU,nn.Softmax,nn.LogSoftmax -
nn.Sigmoid,nn.Tanh,nn.Softplus,nn.Mish
Hook Mechanism
Supports Forward and Backward Hooks (Backward requires Autograd support):
// Register forward Hook, register_forward_hook
model.registerForwardHook(async (module, input, output) => {
console.log('Layer output shape:', output.shape);
});
// Register forward pre-hook, register_forward_pre_hook
model.registerForwardPreHook(async (module, input) => {
console.log('Layer input shape:', input.shape);
});
Use Cases:
- Feature Visualization (e.g., CAM, Grad-CAM)
- Intermediate Layer Output Extraction
- Model Debugging and Profiling
- Dynamic Layer Replacement
5. audio Module (benchmarking torchaudio)
Implements core functionality of PyTorch’s audio processing library:
Transforms
Class API:
audio.Spectrogram: Spectrogramaudio.MelScale: Mel Filter Bankaudio.MelSpectrogram: Mel Spectrogramaudio.MFCC: Mel-frequency cepstral coefficientsaudio.AmplitudeToDB: Amplitude to Decibelsaudio.InverseMelScale: Inverse Mel Transformaudio.GriffinLim: Phase Reconstructionaudio.FrequencyMasking: Frequency Masking (Data Augmentation)audio.TimeMasking: Time Masking (Data Augmentation)
Functional API: Corresponding audio.functional.* functions
Usage Example
import { audio, Tensor } from '@kandle/core';
// Assume 3 seconds of audio data
const audioData = new Float32Array(16000 * 3);
const waveform = new Tensor(audioData, { shape: [1, audioData.length] });
// Compute Mel Spectrogram
const melSpec = new audio.MelSpectrogram({
sample_rate: 16000,
n_fft: 400,
hop_length: 160,
n_mels: 80,
});
const melOutput = await melSpec.call(waveform);
console.log(melOutput.shape); // [1, 80, 301]
// Convert to log scale
const ampToDB = new audio.AmplitudeToDB();
const logMel = await ampToDB.call(melOutput);
console.log(logMel.shape); // [1, 80, 301]
6️⃣ Audio Signal Processing
import { audio, Tensor } from '@kandle/core';
// Assume 3 seconds of audio data
const audioData = new Float32Array(16000 * 3);
const waveform = new Tensor(audioData, { shape: [1, audioData.length] });
// Compute Spectrogram
const spectrogram = new audio.Spectrogram({
n_fft: 512,
hop_length: 256,
power: 2.0,
});
const spec = await spectrogram.call(waveform);
console.log(spec.shape); // [1, 257, 188]
// Apply Mel Filter
const melScale = new audio.MelScale({
n_mels: 80,
sample_rate: 16000,
n_stft: 257,
});
const melSpec = await melScale.call(spec);
console.log(melSpec.shape); // [1, 80, 188]
// Compute MFCC
const mfcc = new audio.MFCC({
sample_rate: 16000,
n_mfcc: 13,
n_mels: 40
});
const mfccFeatures = await mfcc.call(waveform);
console.log(mfccFeatures.shape); // [1, 13, 241]
// Data Augmentation: Time Masking
const timeMask = new audio.TimeMasking({ time_mask_param: 10 });
const augmented = await timeMask.call(melSpec);
console.log(augmented.shape); // [1, 80, 188]
6. I/O System
Supported Model Formats
- ✅ Safetensor: HuggingFace mainstream format, supports shard index (
.safetensors.index.json) - ✅ NumPy (
.npy): Used for test data loading
ByteSource Abstraction
Unified data source interface across platforms:
FileByteSource(Node.js)BlobByteSource(Web)BufferByteSource(Memory)
Safetensor Loading Example
import { io } from '@kandle/core';
// Load safetensor (read header only, data not loaded)
const group = await io.loadSafetensor('./model.safetensors');
// View all weights
group.dumpWeightMap();
// Load specific tensor
const layer = group.getLayer('model.embed_tokens.weight');
const tensor = await io.tensorFromSafetensorLayer(layer!, { device: 'webgpu' });
console.log(tensor.shape, tensor.dtype);
// Release resources
group.close();
Full IO usage see IO Documentation
7. Showcase: Full Model Implementation (Aligned with PyTorch)
💡 Design Goal: Constructing these models is not to replace dedicated inference engines, but to demonstrate how Kandle, as a Whitebox Framework, implements model architectures highly aligned with PyTorch.
🤖 Qwen3 (Text Generation)
Qwen3MLP (SwiGLU) Code Comparison: HuggingFace Transformers Official vs. Kandle Implementation
🐍 Python (HuggingFace Transformers)
# Source: huggingface/transformers
# https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen3/modeling_qwen3.py
class Qwen3MLP(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.hidden_size = config.hidden_size
self.intermediate_size = config.intermediate_size
self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
self.act_fn = ACT2FN[config.hidden_act]
def forward(self, x):
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
return down_proj
📘 TypeScript (Kandle)
// @kandle/model-utils
// src/mlp/swiglu.ts
export class SwiGLUMLP extends nn.Module {
gate_proj: nn.Linear;
up_proj: nn.Linear;
down_proj: nn.Linear;
constructor(options: SwiGLUMLPOptions) {
super();
const {
hiddenSize,
intermediateSize,
bias = false,
} = options;
this.hiddenSize = hiddenSize;
this.intermediateSize = intermediateSize;
this.gate_proj = new nn.Linear(hiddenSize, intermediateSize, bias);
this.up_proj = new nn.Linear(hiddenSize, intermediateSize, bias);
this.down_proj = new nn.Linear(intermediateSize, hiddenSize, bias);
this.addModule('gate_proj', this.gate_proj);
this.addModule('up_proj', this.up_proj);
this.addModule('down_proj', this.down_proj);
}
async forward(x: Tensor): Promise<Tensor> {
const gateProj = await this.gate_proj.call(x);
const gate = functional.silu(gateProj);
const up = await this.up_proj.call(x);
const hidden = gate.mul(up);
const output = await this.down_proj.call(hidden);
return output;
}
}
📌 Source Note: Python code referenced from huggingface/transformers - modeling_qwen3.py
Architecture Completeness:
- ✅
Qwen3DecoderLayer: Fully implements Attention + MLP + LayerNorm - ✅
GroupedQueryAttention: GQA with RoPE + Q/K RMSNorm - ✅
SwiGLUMLP: SwiGLU activation (silu(gate) * up) - ✅
nn.RMSNorm: RMS Normalization - ✅ Complete Forward Propagation flow, including KV Cache, Causal Mask
Full Example: playground-web/qwen3/, playground-node/src/qwen3/
import { Qwen3ForCausalLM } from '@kandle/model-utils';
const model = new Qwen3ForCausalLM(config, useCausalMask = true);
await model.loadFromSafetensor(safetensorGroup);
const output = await model.forward(inputIds, {
positionIds,
pastKeyValues,
attentionMask,
});
🎤 Whisper (Speech Recognition)
- Architecture Components:
WhisperEncoder,WhisperDecoder,WhisperModel - Audio Processing: Integrated Mel Spectrogram preprocessing
- Decoding Strategy: Greedy Decoding
- Full Example: playground-node/src/whisper/
import { Whisper, prepareAudioInput } from '@kandle/model-utils';
const model = new Whisper(WHISPER_BASE_CONFIG);
await model.loadFromSafetensor(safetensorGroup);
const melInput = await prepareAudioInput(audioFloat32Array);
const result = await transcribe(model, tokenizer, melInput);
console.log(result.text);
Utility Components
- RoPE:
applyRotaryPosEmb - Sinusoidal Positional Encoding:
sinusoidalPositionEncoding - KV Cache:
KVCache(Inference acceleration) - Attention Variants:
multiHeadAttention,groupedQueryAttention,multiQueryAttention - MLP Variants:
SwiGLU,GeGLU
🏗️ Architecture Design
Layered Architecture Diagram
┌─────────────────────────────────────────────────────────┐
│ User API Layer (@kandle/core) │
│ Tensor, zeros, randn, nn.Module, audio... │
└────────────────────┬────────────────────────────────────┘
│
┌────────────────────▼────────────────────────────────────┐
│ Dispatch Layer │
│ Operation routing, dtype resolution, broadcasting │
└────────────────────┬────────────────────────────────────┘
│
┌───────────────┼───────────────┐
│ │ │
┌────▼──────┐ ┌────▼──────┐ ┌────▼──────┐
│ Handler 1 │ │ Handler 2 │ │ Handler N │ (Mechanism-based)
│ Map/Reduce│ │ Composite │ │ FFT │
└────┬──────┘ └────┬──────┘ └────┬──────┘
│ │ │
└───────────────┼───────────────┘
│
┌────────────────────▼────────────────────────────────────┐
│ Kernel Layer │
│ Backend-specific implementations │
└────────────────────┬────────────────────────────────────┘
│
┌───────────┴───────────┐
┌────────▼─────────┐ ┌─────────▼──────────┐
│ @kandle/backend- │ │ @kandle/backend-js │
│ webgpu │ │ (CPU fallback) │
└──────────────────┘ └────────────────────┘
Core Concepts
Separation of Storage & Computation (Storage & Handle)
Referencing PyTorch’s ATen/c10 design:
// 1. Storage: Physical memory
interface IStorage {
data: TypedArray;
byteOffset: number;
byteLength: number;
}
// 2. TensorHandle: Metadata
interface ITensorHandle {
storage: IStorage;
shape: number[];
strides: number[];
offset: number;
dtype: DType;
}
// 3. Tensor: User-side wrapper
class Tensor {
constructor(public handle: ITensorHandle) {}
// View operations modify handle only, no storage copy
transpose(dim0: number, dim1: number): Tensor {
const newStrides = swapStrides(this.handle.strides, dim0, dim1);
return new Tensor({ ...this.handle, strides: newStrides });
}
}
Advantages:
- ✅ Zero-copy view operations
- ✅ Supports non-contiguous memory layouts
- ✅ Flexible memory management strategies
Dispatch System (Simplified Distribution Mechanism)
⚠️ Difference from PyTorch: PyTorch uses a complex Dispatch Key system (e.g.,
AutogradCPU,AutogradCUDA) supporting multi-dimensional dispatch (backend, layout, autograd). Kandle currently implements a simplified version based onopName + devicedispatch.
📝 Architecture Evolution: The current dispatch routing mechanism will be rewritten in future versions, but the core mechanized routing philosophy remains.
Routing by Computation Mechanism:
// packages/utils/src/dispatchUtils.ts
const handlers = {
'map_reduce': MapReduceHandler, // Element-wise + Reduction
'composite': CompositeHandler, // Pure JS composite operations
'fft': FFTHandler, // FFT specialized processing
'conv': ConvolutionHandler, // Convolution specialized
'matmul': MatmulHandler, // Matrix Multiplication specialized
....
};
// Simplified dispatch logic (Non-Dispatch Key)
function dispatch(opSchema: OpSchema, ...args) {
const handler = handlers[opSchema.mechanism];
const backend = getBackendByDevice(args[0].device);
return handler.execute(backend, opSchema, ...args);
}
Current Implementation:
- ✅ Route to different Handlers by
mechanismfield - ✅ Get corresponding Backend (webgpu / js) by
device - ❌ Does not support PyTorch-style multi-dimensional Dispatch Key
- ❌ Does not support runtime dynamic registration of Dispatch rules (Under development)
DType Resolver (Logical vs. Physical Separation)
Automatically handles dtype conversion and device compatibility:
// User code
const x = randn([100], { dtype: 'float64' });
// Backend actual storage (WebGPU does not support f64)
// Logical dtype: float64
// Physical dtype: float32 (downgrade)
// Upload: Float64Array -> Float32Array (precision loss warning)
// Download: Float32Array -> Float64Array
Features:
- Auto-detects
shader-f16extension - Transparently handles dtype downgrading
- Supports
vec2<f32>mapping for complex types
Codegen System (Reference PyTorch native_functions.yaml)
💡 Design Inspiration: PyTorch uses
native_functions.yamlto define operator signatures and generates C++ code via torchgen. Kandle implements a similar idea, using TypeScript Interface as OpSchema and generating user-side APIs via Codegen.
Generator: File Location
Generated Files: File Location
Reduces boilerplate and ensures API consistency:
pnpm codegen
OpSchema Definition Example:
// packages/types/src/opschema/ops/activation.ts
export const gelu: OpEntry = {
name: 'gelu',
mechanism: 'Iterator',
iteratorType: 'Map',
signature: {
params: [
{ name: 'self', type: SchemaT.Tensor() },
{ name: 'approximate', type: SchemaT.String(['none', 'tanh']), default: 'none' },
],
returns: { single: SchemaT.Tensor() },
},
iteratorConfig: {
factory: 'unary',
tensorInputs: ['self'],
scalarArgs: ['approximate'],
},
shape: SchemaShape.same('self'),
dtype: SchemaDtype.same('self'),
dispatchKey: 'gelu',
codegen: { tensorMethod: 'gelu', namespace: 'nn.functional' },
};
Generated Content:
methods-gen.ts: Tensor prototype methods (e.g.,tensor.add())ops-gen.ts: Top-level operation functions (e.g.,add(tensor, other))types-gen.ts: OpSchema type definition summary
Comparison with PyTorch:
| Feature | PyTorch (YAML) | Kandle (TypeScript Interface) |
|---|---|---|
| Definition Format | native_functions.yaml | TypeScript Interface |
| Generation Target | C++ / Python Binding | TypeScript API |
| Type Check | Runtime | Compile-time (TypeScript) |
| Extensibility | ✅ Supports Complex Dispatch | ⚠️ Current Simplified Version |
🎯 Special Handling
1. Python-style Slice Syntax
import { randn, slice } from '@kandle/core';
const x = randn([3, 4, 5]);
// Python: x[:, 1:5, ::2]
// Kandle:
const result = x.slice(":,1:5,::2");
console.log(result.shape); // [3,3,3]
// Supports negative indexing
const tail = x.slice("-5:"); // x[-5:]
console.log(tail.shape); // [3,4,5]
⚠️ Known Limits and Issues
Detailed documentation see knownIssues/
1. Async Propagation
Issue: WebGPU’s buffer.mapAsync() forces all data reading to be asynchronous. Impact:
- ✅
forwardmethod is unified asasync. - ❌ Cannot directly read values of other Tensors in kernel (e.g., conditional judgment).
- ❌ Complexity of implementing composite operators increases.
Mitigation:
- Provide synchronous JS backend (Under development).
- Design to avoid operations requiring synchronous reading.
Details: knownIssues/async.md
2. DType Downgrading
Issue: WebGPU does not support some dtypes, requiring downgrading or extended storage. Impact:
float64→float32: Precision loss.int8→i32: Memory waste (4x).complex128→vec2<f32>: Precision loss.
Recommendation:
- Prioritize
float32andint32. - Use JS backend for high precision (Under development).
Details: See Core Features - DType Support
3. Rudimentary Complex Support
Issue: Current complex type implementation is basic, only supporting basic arithmetic. Plan: Will refactor the complex number calculation system in future versions.
Details: knownIssues/complex.md
4. Type System Needs Strengthening
Issue: Significant use of as any type assertions. Plan: Gradually strengthen TypeScript type inference and generic constraints.
Details: knownIssues/type.md
5. Dispatch Layer Responsibility Mix
Issue: The current dispatch layer mixes scheduling logic with some computation logic. Plan: Refactor into a pure routing layer.
Details: knownIssues/dispatch.md, knownIssues/opschema.md
6. WebGPU Numerical Stability Issues
Issue: WebGPU backend may produce numerical differences across different hardware/drivers, especially in certain activation functions (like GELU, softmax) and mathematical operations, leading to NaN or precision issues.
Impact:
- ⚠️ Identical models may produce slightly different outputs on different GPU devices.
- ❌ Extreme cases may produce NaN values (e.g., unclamped GELU, softmax exp overflow).
- 🔴 Numerical instability caused by hardware/driver implementation differences seems unavoidable?
Known Cases:
- GELU Activation NaN: Without limiting tanh input range, large activation values in certain layers can produce NaN (See knownIssues/shader.md).
- Softmax Overflow: If input is not subtracted by max value, exp may overflow to Infinity.
- Precision Loss Accumulation: Float32 precision loss may accumulate after multi-layer computation.
Mitigation:
- ✅ Numerical stability protection added to key operators (e.g., clamp for GELU, subtract max for softmax).
- ⚠️ Use identical hardware for testing and deployment to avoid cross-device result differences.
- 📊 Monitor numerical ranges of key outputs to detect anomalies in time.
- 🔍 Refer to knownIssues/shader.md for detailed troubleshooting guides.
Current Limitations:
- Since the WebGPU specification does not mandate precise floating-point behavior, implementations across drivers/hardware may vary.
- There is currently no excellent solution to completely eliminate this difference; this is an inherent limitation of the WebGPU ecosystem.
Details: knownIssues/shader.md
7. WebGPU VRAM Leaks and Memory Management
Issue: The WebGPU backend suffers from VRAM leaks because the JavaScript side cannot perceive WebGPU side memory pressure.
Root Causes:
- ❌ JS & WebGPU Memory Isolation: JavaScript’s Garbage Collection (GC) mechanism cannot perceive GPU VRAM pressure.
- ❌ FinalizationRegistry Timing Uncontrollable: Even using
FinalizationRegistryto register destructors, the callback trigger timing is entirely decided by GC and may trigger after VRAM is exhausted. - ⚠️ Complex View Tensor References: View Tensors created by
transpose,slice, etc., share Storage with the original Tensor, creating complex reference relationships that make precise release timing difficult to determine.
Impact:
- ❌ Long inference sessions (e.g., generating 1000+ tokens) may crash due to VRAM exhaustion.
- ⚠️ Even after loading large models, intermediate Tensors that are no longer used may still occupy VRAM.
- ⚠️ View operations (like
view(),transpose()) extend the lifecycle of the original Storage even though they don’t copy data.
My Optimization Attempts:
- ⚠️ Implemented a complex Memory Pool mechanism to reuse GPU Buffers, but it didn’t achieve practical results, so it is disabled in the current release. See File Location.
- ✅ Provided
tidy()and manualdispose()APIs. - ✅ Attempted to optimize reference counting for View Tensors.
- ⚠️ But problems persist: Due to the inherent limitation of JS/WebGPU memory isolation, perfect automatic management is impossible.
Mitigation (User Cooperation Required):
- Highly Recommended: Use
tidy()to wrap computation logic to automatically manage intermediate Tensor lifecycles.
const result = tidy(() => {
const temp1 = a.mul(2);
const temp2 = temp1.add(3);
return temp2.sum(); // Only the sum result is kept
});
- Explicitly call
dispose()to release unused Tensors.
const temp = a.mul(2);
const result = temp.add(3);
temp.dispose(); // Manual release
- Periodically monitor VRAM usage (Chrome DevTools → Performance Monitor).
- Avoid creating massive temporary Tensors in loops without releasing them.
Long-term Plan:
- Optimize Memory Pool strategy for more aggressive memory reclamation.
- Improve reference tracking mechanism for View Tensors.
Looking for expert advice!
Details: knownIssues/cache.md
🌐 Browser Compatibility
WebGPU Support Status
| Browser | Minimum Version | Notes |
|---|---|---|
| Chrome | 113+ | ✅ Full Support |
| Edge | 113+ | ✅ Full Support |
| Safari | Preview | ⚠️ Partial Support (macOS 14+) |
| Firefox | Experimental | ⚠️ Requires Manual Enable |
📚 Example Projects
Web Environment: Qwen3 Text Generation
Location: playground-web/qwen3/
cd playground-web
pnpm install
pnpm dev
# Access http://localhost:5173/qwen3/
Features:
- WebGPU accelerated text generation
- Streaming output support
- Visualized Attention weights
Node.js Environment: Whisper Speech Recognition
Location: playground-node/src/whisper/
cd playground-node
pnpm install
pnpm start
Features:
- Loads local audio files
- Mel Spectrogram preprocessing
- End-to-end speech-to-text
🚀 Roadmap
🔨 In Development (Current Version)
Architecture Refactoring: Further optimize layered design, refine Codegen system and type inference.
Autograd (Automatic Differentiation): Backpropagation system supporting gradient calculation and parameter optimization.
Currently implementing an auto-differentiation system based on derivatives.yaml.
Designing a TypeScript version of the parser referencing PyTorch’s DSL (Complex, AI might implement all primitive operators faster).
Automatically generate backpropagation operators via derivatives.yaml, ensuring consistency with PyTorch behavior.
Goal: Cover gradient definitions for most common forward operators, support higher-order derivatives.
nn.Module Enhancements:
✅ Generator-implemented layer-by-layer debugging.
🚧 Runtime Module Swapping.
🚧 State Checkpoints.
Custom Kernel Registration: Runtime custom kernel registration, supporting Fused Kernel optimization.
Pure JS Backend Completion: Fully synchronous CPU computation backend (analogous to PyTorch CPU).
Domain Module Completion: Continue perfecting the audio module (benchmarking torchaudio) and vision module (benchmarking torchvision).
📅 Short-term Plan (3-6 Months)
Quantization Support:
int4, int8 quantization dtypes.
Dynamic Quantization.
Static Quantization.
Independent Scalar Math Library: Solve type conversion issues for mixed dtype calculations in JS.
Performance Optimization:
Kernel Fusion.
Memory Pool Optimization.
Shader Cache System.
🌟 Long-term Plan (6-12 Months)
Remote Backend: Distributed computing backend based on WebSocket/gRPC.
Training API: Complete training loop support (requires Autograd completion).
NumPy API Compatibility Layer: Reuse computation dispatch architecture, add numpy operators, exposed via namespace import { np } from '@kandle/core'.
Model Interpretability UI Component Library (React-based):
Heatmap Visualization.
Feature Maps display.
Attention Weight Visualization.
Inference Process Animation.
Pre-trained Model Ecosystem:
Launch independent @kandle/models package, implementing functionality similar to HuggingFace Transformers.
Provide out-of-the-box pre-trained models (LLaMA, BERT, ViT, Whisper, etc.).
Support loading models and configs directly from HuggingFace Hub.
Unified model loading and inference interface.
GitHub Agent Automated Workflow:
Implement intelligent GitHub Agent listening for specific Issue/PR formats.
When matching operator requests, automatically trigger Agent to:
- Search relevant technical docs and PyTorch implementations.
- Generate operator definitions (OpSchema).
- Implement Kernel (WebGPU/JS dual backend).
- Automatically generate functional tests and numerical validation cases.
- Submit Pull Request for human review.
- Lower community contribution threshold and accelerate operator ecosystem construction.
🎭 API Design Principles
Code Style Note
⚠️ Naming Convention Transition: Due to objective reasons related to Vibe Coding, the current code contains a mix of
snake_caseandcamelCase. I will gradually unify this tocamelCasein future versions to align with JavaScript/TypeScript community habits.
Compromises for JavaScript Localization
Due to language differences between JavaScript and Python, some APIs cannot be perfectly aligned:
1. Parameter Naming
Python (Keyword Arguments):
torch.zeros(size=(3, 4), dtype=torch.float32, device='cuda')
JavaScript (Object Arguments):
zeros([3, 4], { dtype: 'float32', device: 'webgpu' })
2. Operator Overloading
Since JavaScript does not support operator overloading, basic operations require explicit method calls:
| Python | TypeScript (Kandle) |
|---|---|
a + b | add(a, b) or a.add(b) |
a - b | sub(a, b) or a.sub(b) |
a * b | mul(a, b) or a.mul(b) |
a / b | div(a, b) or a.div(b) |
a @ b | matmul(a, b) or a.matmul(b) |
model(x) | model.call(x) |
💡
nn.Module’s__call__needs to be explicitly called via.call()method.
3. Slicing Syntax
Python:
x[:, 1:5]
JavaScript (Function Simulation):
x.slice(":,1:5")
API Evolution in Future Versions
Regarding parameter positioning, two options are considered:
- Full Alignment with Torch: Attempt complete alignment via complex overloading.
Most APIs feasible, but implementation is overly complex, and a few APIs will fail to align, requiring separate memorization, leading to inconsistent experience.
- Design JS Specification: Design a JS benchmark specification, enforcing "alignment after translation" via rules.
Simpler development, but leads to degraded experience and lower alignment with Torch.
⚡ Performance
Design Trade-offs
Kandle uses Eager Mode (dynamic graph) execution, which differs fundamentally from static graph inference engines:
| Feature | Eager Mode (Kandle) | Static Graph (ONNX) |
|---|---|---|
| Execution Style | Op-by-Op execution | One-time graph optimization |
| Intermediate State | ✅ Accessible anytime | ❌ Invisible after compilation |
| Dynamic Control Flow | ✅ Supports if/loop | ⚠️ Limited |
| Memory Overhead | ⚠️ High (keeps intermediate results) | ✅ Low after optimization |
| Inference Speed | ⚠️ Slower (no global optimization) | ✅ Extreme optimization |
| Debugging Experience | ✅ Excellent | ❌ Difficult |
Applicable Scenarios
✅ Recommend Kandle:
- Research and Prototype Development
- Model Debugging and Interpretability Analysis
- Applications requiring intermediate calculations (e.g., Audio Preprocessing + Model Inference)
- Teaching and Learning
❌ Do Not Recommend Kandle:
- High-performance production inference (Please use ONNX Runtime)
- Mobile/Edge devices (Please use WebLLM or TFLite)
- Real-time applications strictly sensitive to latency
Performance Optimization Suggestions
- Avoid Unnecessary Data Reads: Reduce
dataAsync()calls. - Use
tidy()for Memory: Automatically release intermediate tensors. - Batch Inference: Increase batch size to improve GPU utilization.
🤖 About AI Assisted Development
Vibe Coding Practice and Exploration
💡 This is also an exploration of the limits of Vibe Coding.
This project adopts the Vibe Coding development mode, attempting to explore the boundaries of AI-assisted development:
- **Architec