final-kk/kandle: A JavaScript Native PyTorch-aligned Machine Learning Framework, built from scratch on WebGPU

🕯️ Kandle

JavaScript Native PyTorch-aligned Machine Learning Framework

Bringing the true PyTorch experience to the JavaScript ecosystem

Quick Start • Core Features • Example Projects • Architecture • Roadmap

📖 Introduction

Kandle is a JavaScript Native machine learning framework that adopts an Eager Mode (dynamic graph) execution pattern, deeply referencing PyTorch’s ATen/c10 architectural design. I view PyTorch not just as a Python framework, but as the API specification standard for modern AI frameworks. Kandle is dedicated to implementing an API system highly align…

🕯️ Kandle

JavaScript Native PyTorch-aligned Machine Learning Framework

中文版 (Chinese Version)

Bringing the true PyTorch experience to the JavaScript ecosystem

Quick Start • Core Features • Example Projects • Architecture • Roadmap

📖 Introduction

🎯 Core Value Proposition

🔄 Dynamic Graph Execution: True Eager Mode, supporting layer-by-layer debugging, intermediate state inspection, and dynamic control flow.
🎨 PyTorch API Alignment: Aligned at the architectural level rather than simple API wrapping, reducing migration costs and learning curves.
⚡ Hybrid Backend Architecture: Supports both WebGPU (GPU acceleration) and pure JS (CPU computation) backends under a unified interface.
🧩 Complete Tensor System: Implements a full Stride mechanism, broadcasting, view operations, and non-contiguous memory support.
🎵 Rich Operator Library: 200+ tensor operations covering arithmetic, linear algebra, convolution, FFT, audio processing, and more.
🚀 Out-of-the-Box Models: Native support for mainstream models like Qwen3 and Whisper, capable of loading Safetensor weights directly.

💡 Why Choose Kandle?

Current inference engines in the JavaScript ecosystem, such as ONNX Runtime and WebLLM, are excellent but are fundamentally Blackbox Systems focused on static graph inference. Kandle, as a Whitebox Framework, fills the following gaps:

Requirement	Blackbox Inference Engines	Kandle (Whitebox Framework)
Intermediate Computation	❌ Cannot intervene after static graph compilation	✅ Pause/Inspect at any layer via dynamic graph
Model Interpretability	❌ Blackbox, internal states inaccessible	✅ Hooks, layer-by-layer state export
Custom Compute Flow	❌ Limited to predefined Pipelines	✅ Fully programmable control flow
Pre/Post-processing	⚠️ Requires extra toolchains / ONNX export	✅ Unified tensor operation system
API Learning Curve	⚠️ Framework-proprietary APIs	✅ Zero cost for PyTorch users
Debugging Experience	❌ Hard to pinpoint issues in a blackbox	✅ "Breakpoint-style" step-by-step debugging
Inference Performance	✅ Static graph global optimization	⚠️ Eager Mode trade-off

What Whitebox can do that Blackbox cannot:

🔬 Layer-wise Feature Extraction: Export intermediate Tensors at any layer for visual analysis.
🎨 Runtime Layer Replacement: Dynamically replace/skip certain layers to implement model pruning or A/B testing.
🧪 Custom Loss Functions: Design special computation paths combined with business logic.
🎯 Precise Memory Control: Manually manage Tensor lifecycles to optimize VRAM usage.
🌐 Deep Integration with DOM API: Hooks directly bind to Canvas/WebGL for real-time rendering.

Suitable Scenarios: Research, prototype development, model debugging, applications requiring intermediate calculations, audio/visual pre-processing, interpretability analysis. Unsuitable Scenarios: High-performance production inference (please use ONNX Runtime or WebLLM).

🚨 Technical Verification Prototype Disclaimer

⚠️ This is a technical verification prototype, not a production-ready preview.

✅ The current version focuses on Forward Propagation Architecture Verification, implementing 200+ operators and a complete nn.Module system.
🚧 Autograd (Backpropagation) is under development and will be fully implemented in the next version.
⚠️ Happy Path Disclaimer: The current implementation mainly verifies the main flow (Happy Path); edge cases and error handling are not yet perfect.
🔒 No PRs Accepted Yet: The current development branch has completely diverged from the public version with breaking changes. Contributions will be opened after the architecture stabilizes.
💬 Feedback Welcome: I have been working somewhat in isolation, so I am very eager to hear the community’s thoughts and suggestions on "what a JavaScript version of PyTorch should look like."
🎯 Operator Demand Collection: Besides primitive operators, I want to know which specific operators the community needs supported early on.

🌐 Online Experience

No installation required, experience Kandle immediately. We provide a visual interactive Demo based on Qwen3-0.6B, fully showcasing the unique advantages of an Eager Mode framework in Model Interpretability:

📍 Access Addresses

🤗 HuggingFace Spaces: https://huggingface.co/spaces/finalkk/kandle-demo
⚡ Vercel: http://kandle-demo.vercel.app/

✨ Demo Core Features

Feature	Description
🎯 Step-by-Step Execution	Execute forward propagation step by step
⏮️ Time Travel	Step back and re-select the generation path
🎲 Manual Intervention	Manually select candidate words at each token generation to explore different branches
🔍 Logit Lens	Visualize the probability distribution of each layer’s output in the vocabulary space
🔗 Attention Links	Interactively view Self-Attention weight connection relationships
🔥 Heatmap Visualization	Real-time display of Attention Maps and activation value distributions

💡 This is the meaning of a Whitebox framework: Not just reasoning, but "dissecting" every step of the calculation process.

🎬 Usage Suggestions

Explore the Model’s Thought Process: Observe the top-k tokens of each layer’s output during single-step execution to understand how the model gradually "focuses" on the final answer.
Compare Different Paths: Backtrack and select different candidate words to observe the bifurcation points of the generation results.
Discover Attention Patterns: Use Attention Links to discover key tokens the model focuses on (e.g., pronoun resolution, context dependencies).
Debugging and Teaching: Suitable for researchers to understand the internal mechanisms of Transformers, or for teaching demonstrations.

⚠️ Demo Limitations

Original Pre-trained Version Only: Currently, techniques like quantization are not implemented; it only loads original bf16 weights.
Relatively Large Model Size: The original model size is about 1.5GB. It is recommended to download the model manually and load it using WebFile or Upload. Qwen3-0.6B Link

🚀 Quick Start

Installation

# Browser environment only needs the core library
# Using pnpm (Recommended)
pnpm add @kandle/core @kandle/backend-webgpu

# Optional type library, utilities, and pre-model building tools
pnpm add @kandle/types @kandle/utils @kandle/model-utils

# Or using npm
npm install @kandle/core @kandle/backend-webgpu

# If running in a Node.js environment, install webgpu polyfill additionally
npm install webgpu

Environment Requirements

Node.js: ≥ 18.0.0 (ES2020+ support required)
Browser: Chrome/Edge ≥ 113 (WebGPU support)
TypeScript: ≥ 5.0 (Optional)

Basic Usage Examples

1️⃣ Initialize Backend (WebGPU)

import { env } from "@kandle/core";
import { WebGPUBackend } from "@kandle/backend-webgpu";

export async function initWebGPU() {
const backend = await WebGPUBackend.create();
env.setBackend(backend);
env.setDefaultDevice(backend.name);
}

2️⃣ Tensor Operations and Broadcasting

import * as k from '@kandle/core';
import { Tensor } from '@kandle/core';

// Create Tensor
const a = new Tensor([[1, 2, 3], [4, 5, 6]], { dtype: 'float32' });
const b = k.randn([2, 3]);

// Arithmetic operations (supports broadcasting)
const result = a.add(b).mul(2).softmax(-1);

// Get data (WebGPU asynchronous read)
const data = await result.dataAsync();
console.log(data); // Float32Array [...]

// Shape operations (Zero-copy views)
const transposed = a.transpose(0, 1);
console.log(transposed.shape); // [3, 2]
console.log(a.storageId === transposed.storageId); // true
console.log(a.id === transposed.id); // false
const reshaped = a.reshape([3, 2]);
console.log(reshaped.shape); // [3, 2]
console.log(a.storageId === reshaped.storageId); // true
console.log(a.id === reshaped.id); // false


// Advanced Indexing (Python style)
const slicedContiguous = a.slice(":1, 1:"); // a[:1, 1:]
console.log(slicedContiguous.shape) // [1, 2];
console.log(a.storageId === slicedContiguous.storageId); // true
console.log(a.id === slicedContiguous.id); // false
console.log(a.isContiguous); // true (contiguous here)

// Non-contiguous slicing
const slicedNonContiguous = a.slice("::2, ::-1"); // a[::2, ::-1]
console.log(slicedNonContiguous.shape) // [1, 3];
console.log(a.storageId === slicedNonContiguous.storageId); // true
console.log(a.id === slicedNonContiguous.id); // false
console.log(slicedNonContiguous.isContiguous); // false (non-contiguous here)

3️⃣ Linear Algebra and Matrix Operations

import * as k from '@kandle/core';

// Matrix Multiplication
const x = k.randn([128, 512]);
const weight = k.randn([512, 256]);
const output = k.matmul(x, weight); // [128, 256]
console.log(output.shape);

// Batch Matrix Multiplication
const batch = k.randn([4, 64, 128]);
const weights = k.randn([4, 128, 64]);
const batchOut = k.bmm(batch, weights); // [4, 64, 64]
console.log(batchOut.shape);

// Linear Layer (with bias)
const weightLinear = k.randn([256, 512]);
const bias = k.randn([256]);
const result = k.linear(x, weightLinear, bias);
console.log(result.shape);  // [128, 256]

4️⃣ Building Models with nn.Module

import { nn, Tensor, randn } from '@kandle/core';

class MLP extends nn.Module {
fc1: nn.Linear;
fc2: nn.Linear;

constructor(inputDim: number, hiddenDim: number, outputDim: number) {
super();
this.fc1 = new nn.Linear(inputDim, hiddenDim);
this.fc2 = new nn.Linear(hiddenDim, outputDim);
}

async forward(x: Tensor): Promise<Tensor> {
// JS cannot overload operators, must provide call method to replace Python's model(x)
x = await this.fc1.call(x);
x = x.relu();
x = await this.fc2.call(x);
return x;
}
}

// Using the model
const model = new MLP(784, 256, 10);
const input = randn([32, 784]);
const output = await model.call(input);
console.log(output.shape);  // [32, 10]

5️⃣ Memory Management (Like tf.tidy)

import * as k from '@kandle/core';

// Automatically release intermediate tensors
const result = k.tidy( () => {
const a = k.randn([1000, 1000]);
const temp1 = a.mul(2);
const temp2 = temp1.add(3);
return temp2.sum(); // Only the sum result is kept, temp1/temp2 are automatically released
});

console.log('Result:', await result.dataAsync());

📦 Monorepo Package Structure

Kandle uses a Monorepo architecture organized by pnpm workspace. The responsibilities of each package are as follows:

Package Name	Function Description	Core File
@kandle/core	🎨 User-side API, Tensor class, Operators, nn.Module	src/tensor.ts
@kandle/backend-webgpu	⚡ WebGPU Backend Implementation (GPU Compute)	src/index.ts
@kandle/types	📐 Type definitions, Interfaces, OpSchema	src/opschema/
@kandle/utils	🛠️ Utility functions, dtype handling, shape inference	src/index.ts
@kandle/model-utils	🤖 Model building tools (Qwen3, Whisper)	src/index.ts

✨ Core Features

1. Complete Tensor Primitive System

Stride Mechanism & Non-Contiguous Memory Support

✅ Stride Mechanism: Fully implements PyTorch-style memory layout management.
✅ Zero-Copy View Operations: Operations like transpose, permute, slice do not copy data.
✅ Non-Contiguous Memory Computation: Supports direct computation after reshape or slice.
✅ Memory Format: Supports Contiguous and ChannelsLast layouts.

// Non-contiguous memory example
const x = randn([4, 3, 224, 224]);
const transposed = x.transpose(1, 2); // Zero-copy, strides changed
const sliced = x.slice("1:-1"); // View operation

// Automatically handles non-contiguous memory computation
const result = transposed.add(1).relu(); // Backend handles strides automatically

Broadcasting Mechanism

Fully compatible with NumPy/PyTorch broadcasting rules:

const a = randn([4, 1, 3]);
const b = randn([3]);
const result = a.add(b); // Automatically broadcasts b to [4, 1, 3]

2. Rich DType Support

💡 Design Philosophy: Logical dtype is separated from physical dtype; the backend automatically selects storage format based on device capabilities.

💡 Quantized types are planned, and storage optimization for bool / int8 / int16 / float16 will be added later.

DType	TypedArray	WebGPU Storage	Status	Notes
`float32`	`Float32Array`	`f32`	✅ Full	Direct hardware support
`float64`	`Float64Array`	`f32`	⚠️ Downgrade	Downgrades to f32, precision loss exists
`float16`	`Uint16Array`	`f16` / `f32`	⚠️ Device Dependent	Requires shader-f16 extension
`int32`	`Int32Array`	`i32`	✅ Full	Direct support
`uint32`	`Uint32Array`	`u32`	✅ Full	Direct support
`int8` / `uint8`	`Int8Array` / `Uint8Array`	`i32` / `u32`	⚠️ Extended	Extended storage to 32-bit
`int16` / `uint16`	`Int16Array` / `Uint16Array`	`i32` / `u32`	⚠️ Downgrade	Downgraded storage
`complex64` / `complex128`	`Float32Array` / `Float64Array`	`vec2<f32>`	⚠️ Rudimentary	Interleaved storage `[r0,i0,r1,i1,...]`
`bool`	`Uint8Array`	`u32`	⚠️ Extended	Extended storage

3. 200+ Tensor Operations

💡 List generated by AI retrieval, may contain omissions or unimplemented items. Please refer with caution.

💡 The following shows torch operator names. To align with JavaScript development experience, snake-case names are replaced with camelCase.

📐 Arithmetic & Math Operations

Basic Arithmetic: add, sub, mul, div, pow, sqrt, abs, neg, reciprocal, floor, ceil, round, trunc, frac, sign

Trigonometric: sin, cos, tan, asin, acos, atan, atan2

Hyperbolic: sinh, cosh, tanh, asinh, acosh, atanh

Exponential & Logarithmic: exp, exp2, expm1, log, log10, log2, log1p

Special Functions: erf, erfc, sigmoid, logit, i0

🔢 Linear Algebra

Matrix Operations: matmul, mm, bmm, dot, mv, outer, addmm, addmv, baddbmm

Matrix Manipulation: diag, diagonal, trace, tril, triu

Decomposition & Solving (Planned): svd, qr, cholesky, solve

🎲 Reduction Operations

sum, mean, std, var, min, max, argmin, argmax, logsumexp, prod, norm, median, mode, all, any

Supports reduction on specific dimensions and keepdim parameter:

const x = randn([4, 5, 6]);
const result = x.sum(1, true); // Reduce on dim 1, keep dim -> [4, 1, 6]

🔍 Comparison & Logic

Comparison: eq, ne, lt, le, gt, ge, maximum, minimum, clamp

Logic: logical_and, logical_or, logical_not, logical_xor

Conditional Selection: where, masked_fill, masked_select

🔀 Shape Operations

View Operations (Zero Copy): view, reshape, transpose, permute, squeeze, unsqueeze, flatten

Concatenation & Splitting: cat, stack, split, chunk, unbind

Indexing & Slicing: slice, select, index_select, gather, scatter, masked_select

Repetition & Expansion: repeat, repeat_interleave, expand, tile

Flipping & Rotating: flip, fliplr, flipud, rot90, roll

Advanced: as_strided (Direct stride manipulation)

🧮 Convolution & Pooling

Convolution: conv1d, conv2d, conv3d, conv_transpose2d, conv_transpose3d

Pooling: max_pool1d, max_pool2d, max_pool3d, avg_pool1d, avg_pool2d, avg_pool3d

Adaptive Pooling: adaptive_avg_pool2d, adaptive_max_pool2d

Padding: pad (Supports constant, reflect, replicate, circular modes)

📊 Normalization

batch_norm, layer_norm, group_norm, instance_norm, rms_norm, normalize

⚡ Activation Functions

relu, gelu, silu (swish), elu, selu, leaky_relu, prelu, rrelu, hardtanh, relu6, softplus, softsign, softmax, log_softmax, softmin, sigmoid, tanh, log_sigmoid, hardsigmoid, hardswish, mish, dropout

🎵 FFT (Fast Fourier Transform)

Real FFT: rfft, irfft, rfft2, irfft2

Complex FFT: fft, ifft, fft2, ifft2

Application: Audio signal processing, spectrum analysis

📈 Cumulative Operations

cumsum, cumprod, cummax, cummin, diff

🔧 Other Utilities

Sorting: sort, argsort, topk, kthvalue

Unique Values: unique, unique_consecutive

Filling & Cloning: fill_, zero_, clone, detach

Type Conversion: to (dtype/device conversion), contiguous (force contiguous memory)

4. Complete nn.Module Ecosystem

Core Base Classes

nn.Module: Base class, supports forward, parameters()
nn.Parameter: Learnable parameter wrapper
Containers: Sequential, ModuleList, ModuleDict

state_dict() and load_state_dict() are hard to align perfectly, refer to the IO class API below for model loading.

Implemented Layers

Linear & Embedding Layers

nn.Linear: Fully connected layer
nn.Embedding: Embedding layer Convolution Layers
nn.Conv1d, nn.Conv2d, nn.Conv3d
nn.ConvTranspose2d, nn.ConvTranspose3d Pooling Layers
nn.MaxPool1d, nn.MaxPool2d, nn.MaxPool3d
nn.AvgPool1d, nn.AvgPool2d, nn.AvgPool3d Normalization Layers
nn.LayerNorm
nn.RMSNorm Activation Layers
nn.ReLU, nn.GELU, nn.SiLU
nn.LeakyReLU, nn.PReLU, nn.Softmax, nn.LogSoftmax
nn.Sigmoid, nn.Tanh, nn.Softplus, nn.Mish

Hook Mechanism

Supports Forward and Backward Hooks (Backward requires Autograd support):

// Register forward Hook, register_forward_hook
model.registerForwardHook(async (module, input, output) => {
console.log('Layer output shape:', output.shape);
});

// Register forward pre-hook, register_forward_pre_hook
model.registerForwardPreHook(async (module, input) => {
console.log('Layer input shape:', input.shape);
});

Use Cases:

Feature Visualization (e.g., CAM, Grad-CAM)
Intermediate Layer Output Extraction
Model Debugging and Profiling
Dynamic Layer Replacement

5. audio Module (benchmarking torchaudio)

Implements core functionality of PyTorch’s audio processing library:

Transforms

Class API:

audio.Spectrogram: Spectrogram
audio.MelScale: Mel Filter Bank
audio.MelSpectrogram: Mel Spectrogram
audio.MFCC: Mel-frequency cepstral coefficients
audio.AmplitudeToDB: Amplitude to Decibels
audio.InverseMelScale: Inverse Mel Transform
audio.GriffinLim: Phase Reconstruction
audio.FrequencyMasking: Frequency Masking (Data Augmentation)
audio.TimeMasking: Time Masking (Data Augmentation)

Functional API: Corresponding audio.functional.* functions

Usage Example

import { audio, Tensor } from '@kandle/core';

// Assume 3 seconds of audio data
const audioData = new Float32Array(16000 * 3);

const waveform = new Tensor(audioData, { shape: [1, audioData.length] });

// Compute Mel Spectrogram
const melSpec = new audio.MelSpectrogram({
sample_rate: 16000,
n_fft: 400,
hop_length: 160,
n_mels: 80,
});
const melOutput = await melSpec.call(waveform);
console.log(melOutput.shape);  // [1, 80, 301]

// Convert to log scale
const ampToDB = new audio.AmplitudeToDB();
const logMel = await ampToDB.call(melOutput);
console.log(logMel.shape);  // [1, 80, 301]

6️⃣ Audio Signal Processing

import { audio, Tensor } from '@kandle/core';

// Assume 3 seconds of audio data
const audioData = new Float32Array(16000 * 3);

const waveform = new Tensor(audioData, { shape: [1, audioData.length] });

// Compute Spectrogram
const spectrogram = new audio.Spectrogram({
n_fft: 512,
hop_length: 256,
power: 2.0,
});
const spec = await spectrogram.call(waveform);
console.log(spec.shape);    // [1, 257, 188]

// Apply Mel Filter
const melScale = new audio.MelScale({
n_mels: 80,
sample_rate: 16000,
n_stft: 257,
});
const melSpec = await melScale.call(spec);
console.log(melSpec.shape);  // [1, 80, 188]

// Compute MFCC
const mfcc = new audio.MFCC({
sample_rate: 16000,
n_mfcc: 13,
n_mels: 40
});
const mfccFeatures = await mfcc.call(waveform);
console.log(mfccFeatures.shape); // [1, 13, 241]

// Data Augmentation: Time Masking
const timeMask = new audio.TimeMasking({ time_mask_param: 10 });
const augmented = await timeMask.call(melSpec);
console.log(augmented.shape);   // [1, 80, 188]

6. I/O System

Supported Model Formats

✅ Safetensor: HuggingFace mainstream format, supports shard index (.safetensors.index.json)
✅ NumPy (.npy): Used for test data loading

ByteSource Abstraction

Unified data source interface across platforms:

FileByteSource (Node.js)
BlobByteSource (Web)
BufferByteSource (Memory)

Safetensor Loading Example

import { io } from '@kandle/core';

// Load safetensor (read header only, data not loaded)
const group = await io.loadSafetensor('./model.safetensors');

// View all weights
group.dumpWeightMap();

// Load specific tensor
const layer = group.getLayer('model.embed_tokens.weight');
const tensor = await io.tensorFromSafetensorLayer(layer!, { device: 'webgpu' });

console.log(tensor.shape, tensor.dtype);

// Release resources
group.close();

Full IO usage see IO Documentation

7. Showcase: Full Model Implementation (Aligned with PyTorch)

💡 Design Goal: Constructing these models is not to replace dedicated inference engines, but to demonstrate how Kandle, as a Whitebox Framework, implements model architectures highly aligned with PyTorch.

🤖 Qwen3 (Text Generation)

Qwen3MLP (SwiGLU) Code Comparison: HuggingFace Transformers Official vs. Kandle Implementation

🐍 Python (HuggingFace Transformers)

# Source: huggingface/transformers
# https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen3/modeling_qwen3.py

class Qwen3MLP(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.hidden_size = config.hidden_size
self.intermediate_size = config.intermediate_size
self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
self.act_fn = ACT2FN[config.hidden_act]

def forward(self, x):
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
return down_proj

📘 TypeScript (Kandle)

// @kandle/model-utils
// src/mlp/swiglu.ts
export class SwiGLUMLP extends nn.Module {
gate_proj: nn.Linear;
up_proj: nn.Linear;
down_proj: nn.Linear;

constructor(options: SwiGLUMLPOptions) {
super();
const {
hiddenSize,
intermediateSize,
bias = false,
} = options;
this.hiddenSize = hiddenSize;
this.intermediateSize = intermediateSize;
this.gate_proj = new nn.Linear(hiddenSize, intermediateSize, bias);
this.up_proj = new nn.Linear(hiddenSize, intermediateSize, bias);
this.down_proj = new nn.Linear(intermediateSize, hiddenSize, bias);
this.addModule('gate_proj', this.gate_proj);
this.addModule('up_proj', this.up_proj);
this.addModule('down_proj', this.down_proj);
}

async forward(x: Tensor): Promise<Tensor> {
const gateProj = await this.gate_proj.call(x);
const gate = functional.silu(gateProj);
const up = await this.up_proj.call(x);
const hidden = gate.mul(up);
const output = await this.down_proj.call(hidden);
return output;
}

}

📌 Source Note: Python code referenced from huggingface/transformers - modeling_qwen3.py

Architecture Completeness:

✅ Qwen3DecoderLayer: Fully implements Attention + MLP + LayerNorm
✅ GroupedQueryAttention: GQA with RoPE + Q/K RMSNorm
✅ SwiGLUMLP: SwiGLU activation (silu(gate) * up)
✅ nn.RMSNorm: RMS Normalization
✅ Complete Forward Propagation flow, including KV Cache, Causal Mask

Full Example: playground-web/qwen3/, playground-node/src/qwen3/

import { Qwen3ForCausalLM } from '@kandle/model-utils';

const model = new Qwen3ForCausalLM(config, useCausalMask = true);
await model.loadFromSafetensor(safetensorGroup);

const output = await model.forward(inputIds, {
positionIds,
pastKeyValues,
attentionMask,
});

🎤 Whisper (Speech Recognition)

Architecture Components: WhisperEncoder, WhisperDecoder, WhisperModel
Audio Processing: Integrated Mel Spectrogram preprocessing
Decoding Strategy: Greedy Decoding
Full Example: playground-node/src/whisper/

import { Whisper, prepareAudioInput } from '@kandle/model-utils';

const model = new Whisper(WHISPER_BASE_CONFIG);
await model.loadFromSafetensor(safetensorGroup);

const melInput = await prepareAudioInput(audioFloat32Array);
const result = await transcribe(model, tokenizer, melInput);
console.log(result.text);

Utility Components

RoPE: applyRotaryPosEmb
Sinusoidal Positional Encoding: sinusoidalPositionEncoding
KV Cache: KVCache (Inference acceleration)
Attention Variants: multiHeadAttention, groupedQueryAttention, multiQueryAttention
MLP Variants: SwiGLU, GeGLU

🏗️ Architecture Design

Layered Architecture Diagram

┌─────────────────────────────────────────────────────────┐
│          User API Layer (@kandle/core)                  │
│  Tensor, zeros, randn, nn.Module, audio...              │
└────────────────────┬────────────────────────────────────┘
│
┌────────────────────▼────────────────────────────────────┐
│               Dispatch Layer                            │
│  Operation routing, dtype resolution, broadcasting      │
└────────────────────┬────────────────────────────────────┘
│
┌───────────────┼───────────────┐
│               │               │
┌────▼──────┐  ┌────▼──────┐  ┌────▼──────┐
│ Handler 1 │  │ Handler 2 │  │ Handler N │  (Mechanism-based)
│ Map/Reduce│  │ Composite │  │   FFT     │
└────┬──────┘  └────┬──────┘  └────┬──────┘
│               │               │
└───────────────┼───────────────┘
│
┌────────────────────▼────────────────────────────────────┐
│                 Kernel Layer                            │
│  Backend-specific implementations                       │
└────────────────────┬────────────────────────────────────┘
│
┌───────────┴───────────┐
┌────────▼─────────┐   ┌─────────▼──────────┐
│ @kandle/backend- │   │ @kandle/backend-js │
│      webgpu      │   │   (CPU fallback)   │
└──────────────────┘   └────────────────────┘

Core Concepts

Separation of Storage & Computation (Storage & Handle)

Referencing PyTorch’s ATen/c10 design:

// 1. Storage: Physical memory
interface IStorage {
data: TypedArray;
byteOffset: number;
byteLength: number;
}

// 2. TensorHandle: Metadata
interface ITensorHandle {
storage: IStorage;
shape: number[];
strides: number[];
offset: number;
dtype: DType;
}

// 3. Tensor: User-side wrapper
class Tensor {
constructor(public handle: ITensorHandle) {}

// View operations modify handle only, no storage copy
transpose(dim0: number, dim1: number): Tensor {
const newStrides = swapStrides(this.handle.strides, dim0, dim1);
return new Tensor({ ...this.handle, strides: newStrides });
}
}

Advantages:

✅ Zero-copy view operations
✅ Supports non-contiguous memory layouts
✅ Flexible memory management strategies

Dispatch System (Simplified Distribution Mechanism)

⚠️ Difference from PyTorch: PyTorch uses a complex Dispatch Key system (e.g., AutogradCPU, AutogradCUDA) supporting multi-dimensional dispatch (backend, layout, autograd). Kandle currently implements a simplified version based on opName + device dispatch.

📝 Architecture Evolution: The current dispatch routing mechanism will be rewritten in future versions, but the core mechanized routing philosophy remains.

Routing by Computation Mechanism:

// packages/utils/src/dispatchUtils.ts
const handlers = {
'map_reduce': MapReduceHandler,     // Element-wise + Reduction
'composite': CompositeHandler,      // Pure JS composite operations
'fft': FFTHandler,                  // FFT specialized processing
'conv': ConvolutionHandler,         // Convolution specialized
'matmul': MatmulHandler,            // Matrix Multiplication specialized
....
};

// Simplified dispatch logic (Non-Dispatch Key)
function dispatch(opSchema: OpSchema, ...args) {
const handler = handlers[opSchema.mechanism];
const backend = getBackendByDevice(args[0].device);
return handler.execute(backend, opSchema, ...args);
}

Current Implementation:

✅ Route to different Handlers by mechanism field
✅ Get corresponding Backend (webgpu / js) by device
❌ Does not support PyTorch-style multi-dimensional Dispatch Key
❌ Does not support runtime dynamic registration of Dispatch rules (Under development)

DType Resolver (Logical vs. Physical Separation)

Automatically handles dtype conversion and device compatibility:

// User code
const x = randn([100], { dtype: 'float64' });

// Backend actual storage (WebGPU does not support f64)
// Logical dtype: float64
// Physical dtype: float32 (downgrade)
// Upload: Float64Array -> Float32Array (precision loss warning)
// Download: Float32Array -> Float64Array

Features:

Auto-detects shader-f16 extension
Transparently handles dtype downgrading
Supports vec2<f32> mapping for complex types

Codegen System (Reference PyTorch native_functions.yaml)

💡 Design Inspiration: PyTorch uses native_functions.yaml to define operator signatures and generates C++ code via torchgen. Kandle implements a similar idea, using TypeScript Interface as OpSchema and generating user-side APIs via Codegen.

Generator: File Location

Generated Files: File Location

Reduces boilerplate and ensures API consistency:

pnpm codegen

OpSchema Definition Example:

// packages/types/src/opschema/ops/activation.ts
export const gelu: OpEntry = {
name: 'gelu',
mechanism: 'Iterator',
iteratorType: 'Map',
signature: {
params: [
{ name: 'self', type: SchemaT.Tensor() },
{ name: 'approximate', type: SchemaT.String(['none', 'tanh']), default: 'none' },
],
returns: { single: SchemaT.Tensor() },
},
iteratorConfig: {
factory: 'unary',
tensorInputs: ['self'],
scalarArgs: ['approximate'],
},
shape: SchemaShape.same('self'),
dtype: SchemaDtype.same('self'),
dispatchKey: 'gelu',
codegen: { tensorMethod: 'gelu', namespace: 'nn.functional' },
};

Generated Content:

methods-gen.ts: Tensor prototype methods (e.g., tensor.add())
ops-gen.ts: Top-level operation functions (e.g., add(tensor, other))
types-gen.ts: OpSchema type definition summary

Comparison with PyTorch:

Feature	PyTorch (YAML)	Kandle (TypeScript Interface)
Definition Format	`native_functions.yaml`	TypeScript Interface
Generation Target	C++ / Python Binding	TypeScript API
Type Check	Runtime	Compile-time (TypeScript)
Extensibility	✅ Supports Complex Dispatch	⚠️ Current Simplified Version

🎯 Special Handling

1. Python-style Slice Syntax

import { randn, slice } from '@kandle/core';

const x = randn([3, 4, 5]);
// Python: x[:, 1:5, ::2]
// Kandle:
const result = x.slice(":,1:5,::2");
console.log(result.shape); // [3,3,3]

// Supports negative indexing
const tail = x.slice("-5:"); // x[-5:]
console.log(tail.shape);    // [3,4,5]

⚠️ Known Limits and Issues

Detailed documentation see knownIssues/

1. Async Propagation

Issue: WebGPU’s buffer.mapAsync() forces all data reading to be asynchronous. Impact:

✅ forward method is unified as async.
❌ Cannot directly read values of other Tensors in kernel (e.g., conditional judgment).
❌ Complexity of implementing composite operators increases.

Mitigation:

Provide synchronous JS backend (Under development).
Design to avoid operations requiring synchronous reading.

Details: knownIssues/async.md

2. DType Downgrading

Issue: WebGPU does not support some dtypes, requiring downgrading or extended storage. Impact:

float64 → float32: Precision loss.
int8 → i32: Memory waste (4x).
complex128 → vec2<f32>: Precision loss.

Recommendation:

Prioritize float32 and int32.
Use JS backend for high precision (Under development).

Details: See Core Features - DType Support

3. Rudimentary Complex Support

Issue: Current complex type implementation is basic, only supporting basic arithmetic. Plan: Will refactor the complex number calculation system in future versions.

Details: knownIssues/complex.md

4. Type System Needs Strengthening

Issue: Significant use of as any type assertions. Plan: Gradually strengthen TypeScript type inference and generic constraints.

Details: knownIssues/type.md

5. Dispatch Layer Responsibility Mix

Issue: The current dispatch layer mixes scheduling logic with some computation logic. Plan: Refactor into a pure routing layer.

Details: knownIssues/dispatch.md, knownIssues/opschema.md

6. WebGPU Numerical Stability Issues

Issue: WebGPU backend may produce numerical differences across different hardware/drivers, especially in certain activation functions (like GELU, softmax) and mathematical operations, leading to NaN or precision issues.

Impact:

⚠️ Identical models may produce slightly different outputs on different GPU devices.
❌ Extreme cases may produce NaN values (e.g., unclamped GELU, softmax exp overflow).
🔴 Numerical instability caused by hardware/driver implementation differences seems unavoidable?

Known Cases:

GELU Activation NaN: Without limiting tanh input range, large activation values in certain layers can produce NaN (See knownIssues/shader.md).
Softmax Overflow: If input is not subtracted by max value, exp may overflow to Infinity.
Precision Loss Accumulation: Float32 precision loss may accumulate after multi-layer computation.

Mitigation:

✅ Numerical stability protection added to key operators (e.g., clamp for GELU, subtract max for softmax).
⚠️ Use identical hardware for testing and deployment to avoid cross-device result differences.
📊 Monitor numerical ranges of key outputs to detect anomalies in time.
🔍 Refer to knownIssues/shader.md for detailed troubleshooting guides.

Current Limitations:

Since the WebGPU specification does not mandate precise floating-point behavior, implementations across drivers/hardware may vary.
There is currently no excellent solution to completely eliminate this difference; this is an inherent limitation of the WebGPU ecosystem.

Details: knownIssues/shader.md

7. WebGPU VRAM Leaks and Memory Management

Issue: The WebGPU backend suffers from VRAM leaks because the JavaScript side cannot perceive WebGPU side memory pressure.

Root Causes:

❌ JS & WebGPU Memory Isolation: JavaScript’s Garbage Collection (GC) mechanism cannot perceive GPU VRAM pressure.
❌ FinalizationRegistry Timing Uncontrollable: Even using FinalizationRegistry to register destructors, the callback trigger timing is entirely decided by GC and may trigger after VRAM is exhausted.
⚠️ Complex View Tensor References: View Tensors created by transpose, slice, etc., share Storage with the original Tensor, creating complex reference relationships that make precise release timing difficult to determine.

Impact:

❌ Long inference sessions (e.g., generating 1000+ tokens) may crash due to VRAM exhaustion.
⚠️ Even after loading large models, intermediate Tensors that are no longer used may still occupy VRAM.
⚠️ View operations (like view(), transpose()) extend the lifecycle of the original Storage even though they don’t copy data.

My Optimization Attempts:

⚠️ Implemented a complex Memory Pool mechanism to reuse GPU Buffers, but it didn’t achieve practical results, so it is disabled in the current release. See File Location.
✅ Provided tidy() and manual dispose() APIs.
✅ Attempted to optimize reference counting for View Tensors.
⚠️ But problems persist: Due to the inherent limitation of JS/WebGPU memory isolation, perfect automatic management is impossible.

Mitigation (User Cooperation Required):

Highly Recommended: Use tidy() to wrap computation logic to automatically manage intermediate Tensor lifecycles.

const result = tidy(() => {
const temp1 = a.mul(2);
const temp2 = temp1.add(3);
return temp2.sum(); // Only the sum result is kept
});

Explicitly call dispose() to release unused Tensors.

const temp = a.mul(2);
const result = temp.add(3);
temp.dispose(); // Manual release

Periodically monitor VRAM usage (Chrome DevTools → Performance Monitor).
Avoid creating massive temporary Tensors in loops without releasing them.

Long-term Plan:

Optimize Memory Pool strategy for more aggressive memory reclamation.
Improve reference tracking mechanism for View Tensors.

Looking for expert advice!

Details: knownIssues/cache.md

🌐 Browser Compatibility

WebGPU Support Status

Browser	Minimum Version	Notes
Chrome	113+	✅ Full Support
Edge	113+	✅ Full Support
Safari	Preview	⚠️ Partial Support (macOS 14+)
Firefox	Experimental	⚠️ Requires Manual Enable

📚 Example Projects

Web Environment: Qwen3 Text Generation

Location: playground-web/qwen3/

cd playground-web
pnpm install
pnpm dev
# Access http://localhost:5173/qwen3/

Features:

WebGPU accelerated text generation
Streaming output support
Visualized Attention weights

Node.js Environment: Whisper Speech Recognition

Location: playground-node/src/whisper/

cd playground-node
pnpm install
pnpm start

Features:

Loads local audio files
Mel Spectrogram preprocessing
End-to-end speech-to-text

🚀 Roadmap

🔨 In Development (Current Version)

Architecture Refactoring: Further optimize layered design, refine Codegen system and type inference.

Autograd (Automatic Differentiation): Backpropagation system supporting gradient calculation and parameter optimization.

Currently implementing an auto-differentiation system based on `derivatives.yaml`.

Designing a TypeScript version of the parser referencing PyTorch’s DSL (Complex, AI might implement all primitive operators faster).

Automatically generate backpropagation operators via derivatives.yaml, ensuring consistency with PyTorch behavior.

Goal: Cover gradient definitions for most common forward operators, support higher-order derivatives.

nn.Module Enhancements:

✅ Generator-implemented layer-by-layer debugging.

🚧 Runtime Module Swapping.

🚧 State Checkpoints.

Custom Kernel Registration: Runtime custom kernel registration, supporting Fused Kernel optimization.

Pure JS Backend Completion: Fully synchronous CPU computation backend (analogous to PyTorch CPU).

Domain Module Completion: Continue perfecting the audio module (benchmarking torchaudio) and vision module (benchmarking torchvision).

📅 Short-term Plan (3-6 Months)

Quantization Support:

`int4`, `int8` quantization dtypes.

Dynamic Quantization.

Static Quantization.

Independent Scalar Math Library: Solve type conversion issues for mixed dtype calculations in JS.

Performance Optimization:

Kernel Fusion.

Memory Pool Optimization.

Shader Cache System.

🌟 Long-term Plan (6-12 Months)

Remote Backend: Distributed computing backend based on WebSocket/gRPC.

Training API: Complete training loop support (requires Autograd completion).

NumPy API Compatibility Layer: Reuse computation dispatch architecture, add `numpy` operators, exposed via namespace `import { np } from '@kandle/core'`.

Model Interpretability UI Component Library (React-based):

Heatmap Visualization.

Feature Maps display.

Attention Weight Visualization.

Inference Process Animation.

Pre-trained Model Ecosystem:

Launch independent `@kandle/models` package, implementing functionality similar to HuggingFace Transformers.

Provide out-of-the-box pre-trained models (LLaMA, BERT, ViT, Whisper, etc.).

Support loading models and configs directly from HuggingFace Hub.

Unified model loading and inference interface.

GitHub Agent Automated Workflow:

Implement intelligent GitHub Agent listening for specific Issue/PR formats.

When matching operator requests, automatically trigger Agent to:

Search relevant technical docs and PyTorch implementations.
Generate operator definitions (OpSchema).
Implement Kernel (WebGPU/JS dual backend).
Automatically generate functional tests and numerical validation cases.
Submit Pull Request for human review.

Lower community contribution threshold and accelerate operator ecosystem construction.

🎭 API Design Principles

Code Style Note

⚠️ Naming Convention Transition: Due to objective reasons related to Vibe Coding, the current code contains a mix of snake_case and camelCase. I will gradually unify this to camelCase in future versions to align with JavaScript/TypeScript community habits.

Compromises for JavaScript Localization

Due to language differences between JavaScript and Python, some APIs cannot be perfectly aligned:

1. Parameter Naming

Python (Keyword Arguments):

torch.zeros(size=(3, 4), dtype=torch.float32, device='cuda')

JavaScript (Object Arguments):

zeros([3, 4], { dtype: 'float32', device: 'webgpu' })

2. Operator Overloading

Since JavaScript does not support operator overloading, basic operations require explicit method calls:

Python	TypeScript (Kandle)
`a + b`	`add(a, b)` or `a.add(b)`
`a - b`	`sub(a, b)` or `a.sub(b)`
`a * b`	`mul(a, b)` or `a.mul(b)`
`a / b`	`div(a, b)` or `a.div(b)`
`a @ b`	`matmul(a, b)` or `a.matmul(b)`
`model(x)`	`model.call(x)`

💡 nn.Module’s __call__ needs to be explicitly called via .call() method.

3. Slicing Syntax

Python:

x[:, 1:5]

JavaScript (Function Simulation):

x.slice(":,1:5")

API Evolution in Future Versions

Regarding parameter positioning, two options are considered:

Full Alignment with Torch: Attempt complete alignment via complex overloading.

Most APIs feasible, but implementation is overly complex, and a few APIs will fail to align, requiring separate memorization, leading to inconsistent experience.

Design JS Specification: Design a JS benchmark specification, enforcing "alignment after translation" via rules.

Simpler development, but leads to degraded experience and lower alignment with Torch.

⚡ Performance

Design Trade-offs

Kandle uses Eager Mode (dynamic graph) execution, which differs fundamentally from static graph inference engines:

Feature	Eager Mode (Kandle)	Static Graph (ONNX)
Execution Style	Op-by-Op execution	One-time graph optimization
Intermediate State	✅ Accessible anytime	❌ Invisible after compilation
Dynamic Control Flow	✅ Supports if/loop	⚠️ Limited
Memory Overhead	⚠️ High (keeps intermediate results)	✅ Low after optimization
Inference Speed	⚠️ Slower (no global optimization)	✅ Extreme optimization
Debugging Experience	✅ Excellent	❌ Difficult

Applicable Scenarios

✅ Recommend Kandle:

Research and Prototype Development
Model Debugging and Interpretability Analysis
Applications requiring intermediate calculations (e.g., Audio Preprocessing + Model Inference)
Teaching and Learning

❌ Do Not Recommend Kandle:

High-performance production inference (Please use ONNX Runtime)
Mobile/Edge devices (Please use WebLLM or TFLite)
Real-time applications strictly sensitive to latency

Performance Optimization Suggestions

Avoid Unnecessary Data Reads: Reduce dataAsync() calls.
Use tidy() for Memory: Automatically release intermediate tensors.
Batch Inference: Increase batch size to improve GPU utilization.

🤖 About AI Assisted Development

Vibe Coding Practice and Exploration

💡 This is also an exploration of the limits of Vibe Coding.

This project adopts the Vibe Coding development mode, attempting to explore the boundaries of AI-assisted development:

**Architec

🕯️ Kandle

📖 Introduction

🕯️ Kandle

📖 Introduction

🎯 Core Value Proposition

💡 Why Choose Kandle?

🚨 Technical Verification Prototype Disclaimer

🌐 Online Experience

📍 Access Addresses

✨ Demo Core Features

🎬 Usage Suggestions

⚠️ Demo Limitations

🚀 Quick Start

Installation

Environment Requirements

Basic Usage Examples

1️⃣ Initialize Backend (WebGPU)

2️⃣ Tensor Operations and Broadcasting

3️⃣ Linear Algebra and Matrix Operations

4️⃣ Building Models with nn.Module

5️⃣ Memory Management (Like tf.tidy)

📦 Monorepo Package Structure

✨ Core Features

1. Complete Tensor Primitive System

Stride Mechanism & Non-Contiguous Memory Support

Broadcasting Mechanism

2. Rich DType Support

3. 200+ Tensor Operations

4. Complete nn.Module Ecosystem

Core Base Classes

Implemented Layers

Hook Mechanism

5. audio Module (benchmarking torchaudio)

6️⃣ Audio Signal Processing

6. I/O System

Supported Model Formats

ByteSource Abstraction

Safetensor Loading Example

7. Showcase: Full Model Implementation (Aligned with PyTorch)

🤖 Qwen3 (Text Generation)

🎤 Whisper (Speech Recognition)

Utility Components

🏗️ Architecture Design

Layered Architecture Diagram

Core Concepts

Separation of Storage & Computation (Storage & Handle)

Dispatch System (Simplified Distribution Mechanism)

DType Resolver (Logical vs. Physical Separation)

Codegen System (Reference PyTorch native_functions.yaml)

🎯 Special Handling

1. Python-style Slice Syntax

⚠️ Known Limits and Issues

1. Async Propagation

2. DType Downgrading

3. Rudimentary Complex Support

4. Type System Needs Strengthening

5. Dispatch Layer Responsibility Mix

6. WebGPU Numerical Stability Issues

7. WebGPU VRAM Leaks and Memory Management

🌐 Browser Compatibility

WebGPU Support Status

📚 Example Projects

Web Environment: Qwen3 Text Generation

Node.js Environment: Whisper Speech Recognition

🚀 Roadmap

🔨 In Development (Current Version)

Architecture Refactoring: Further optimize layered design, refine Codegen system and type inference.

Autograd (Automatic Differentiation): Backpropagation system supporting gradient calculation and parameter optimization.

Currently implementing an auto-differentiation system based on derivatives.yaml.

Designing a TypeScript version of the parser referencing PyTorch’s DSL (Complex, AI might implement all primitive operators faster).

Automatically generate backpropagation operators via derivatives.yaml, ensuring consistency with PyTorch behavior.

Goal: Cover gradient definitions for most common forward operators, support higher-order derivatives.

nn.Module Enhancements:

✅ Generator-implemented layer-by-layer debugging.

🚧 Runtime Module Swapping.

🚧 State Checkpoints.

Custom Kernel Registration: Runtime custom kernel registration, supporting Fused Kernel optimization.

Pure JS Backend Completion: Fully synchronous CPU computation backend (analogous to PyTorch CPU).

📅 Short-term Plan (3-6 Months)

Quantization Support:

Currently implementing an auto-differentiation system based on `derivatives.yaml`.

`int4`, `int8` quantization dtypes.

NumPy API Compatibility Layer: Reuse computation dispatch architecture, add `numpy` operators, exposed via namespace `import { np } from '@kandle/core'`.

Launch independent `@kandle/models` package, implementing functionality similar to HuggingFace Transformers.