Preview
Open Original
tensors and such
Tensor primitives with arbitrary view, shape, stride. Cuda and CPU backends
Goal is high performance ML stack with minimal dependencies and maximal flexibility
todo
- Backend abstraction
- Nicer syntax. macro time
tset!(tensor.view_mut(), v: 99, 1, 2)andtget!(tensor.view(), 1, 2) - Basic GPU backend
- Slicing with ranges
tensor.view().slice(0, 1..3)etc. - Test more slicing syntaxes
- Slicing macro
- Elementwise broadcasting
- Basic linear algebra helpers
- Accelerated backends (GPU / parallel) in progress
- x86 SIMD paths (currently relying on llvm auto-vectorization for CPU which only works for contiguous memory)
- Multiple gpu devices allowed
- Do not lock thread on GPU dispatch
- strides and offset as bytes
- Broadcasting
- Idx should not be re…
tensors and such
Tensor primitives with arbitrary view, shape, stride. Cuda and CPU backends
Goal is high performance ML stack with minimal dependencies and maximal flexibility
todo
- Backend abstraction
- Nicer syntax. macro time
tset!(tensor.view_mut(), v: 99, 1, 2)andtget!(tensor.view(), 1, 2) - Basic GPU backend
- Slicing with ranges
tensor.view().slice(0, 1..3)etc. - Test more slicing syntaxes
- Slicing macro
- Elementwise broadcasting
- Basic linear algebra helpers
- Accelerated backends (GPU / parallel) in progress
- x86 SIMD paths (currently relying on llvm auto-vectorization for CPU which only works for contiguous memory)
- Multiple gpu devices allowed
- Do not lock thread on GPU dispatch
- strides and offset as bytes
- Broadcasting
- Idx should not be ref. makes it less ergonomic
- Pull out ops into crate defined traits, which return Result, and call that from Add and AddAssign impls (panic there)
- Figure outt bool types, and in general those without Add, Sub, and Mul impls
to optimize
- collapse dims to allow backend to access contiguous fast paths
view_to_ownedcan probably be optimized to copy larger chunks at once- Restrict tensor values to require basic operations
- perf sucks for contiguous memory in unary CPU - fix
- perf sucks for contiguous memory in unary CUDA - fix
- perf sucks for non-contiguous memory in unary CPU - fix
- perf sucks for non-contiguous memory in unary CUDA - fix
- O(rank * size) instead of O(size) broadcasting ops is bad
- Broadcast cuda kernel puts a cap on tensor dim size - fix
missing tests
- broadcasting large tensor
- broadcasting scalar
- brooadcast after sliceing
- broadcast after negative step slicing
Some examples
Note that they may not reflect most up to date API and syntax (assume reality can only be better).
Creating Tensors
let zeros = TensorBase::<f32, Cpu>::zeros((3, 4)); // 3x4 tensor of zeros
let cpu_ones = CpuTensor::<f32>::ones((2, 2)); // 2x2 tensor of ones
let gpu_max = CudaTensor::<f32>::max((5, 5)); // 5x5 tensor of maximum f32 values on GPU
let gpu_min = CudaTensor::<i32>::min((10, 1)); // 10x1 tensor of minimum i32 values on GPU
let cpu_tensor = CpuTensor::<f64>::from_buf(vec![1.0, 2.0, 3.0, 4.0], (2, 2));
let gpu_tensor = cpu_tensor.cuda();
let cpu2 = gpu_tensor.cpu();
let buf = vec![0.0f32; 16];
let tensor = TensorBase::<f32, Cuda>::from_buf(buf, (4, 4));
Arithmetic Operations
let mut a = CpuTensor::<f32>::ones((2, 2));
a*= 3.0;
a+= 2.0;
a-= 1.0;
let b: CpuTensor::<f32> = a * 2.0;
Slicing Syntax
// Basic range slicing
// slice(dimension, range), so modifications happen on the dimension specified
tensor.slice(0, 2..5) // Select elements at indices 2, 3, 4
tensor.slice(0, 0..3) // First 3 elements
tensor.slice(0, ..5) // Elements from start to index 5 (exclusive)
tensor.slice(0, 2..) // Elements from index 2 to end
tensor.slice(0, ..) // All elements (full dimension)
// Inclusive ranges
tensor.slice(0, 2..=5) // Select elements 2, 3, 4, 5 (inclusive end)
tensor.slice(0, 0..=2) // First 3 elements (indices 0, 1, 2)
// Single index slicing
tensor.slice(0, 3) // Select only element at index 3 (reduces dimension)
// Auto-reverse: reversed ranges automatically use negative step
tensor.slice(0, 5..2) // Elements 5, 4, 3 (auto step=-1)
tensor.slice(0, 9..=0) // Elements 9, 8, 7, ..., 1, 0 (auto step=-1)
// Explicit negative step using Slice builder
tensor.slice(0, Slice::from(..).step(-1)) // Reverse entire dimension
tensor.slice(0, Slice::from(8..).step(-1)) // From index 8 to start, reversed
tensor.slice(0, Slice::from(..5).step(-1)) // last element to index 5, reversed
// Custom step values
tensor.slice(0, Slice::from(..).step(2)) // Every other element (0, 2, 4, 6, ...)
tensor.slice(0, Slice::from(..).step(-2)) // Every other element, reversed
tensor.slice(0, Slice::from(1..8).step(3)) // Elements at 1, 4, 7
// manual slice construction
tensor.slice(0, Slice::new(Some(8), Some(2), Some(-2))) // From 8 to 2, step -2: [8, 6, 4]
tensor.slice(0, Slice::new(None, None, Some(-1))) // Full reverse
tensor.slice(0, Slice::new(Some(5), None, Some(1))) // From 5 to end
// Edge cases
tensor.slice(0, Slice::from(1..3).step(-1)) // Empty slice (start < end with negative step)
tensor.slice(0, Slice::from(5..5)) // Empty slice (start == end)
Broadcasting
Broadcasting Rules
- If tensors have different ranks, prepend 1s to the shape of the smaller rank tensor
- For each dimension, the sizes must either:
- Be equal, OR
- One of them must be 1 (which gets broadcasted)
If inplace, then the tensor being modified must be the same shape the result shape after broadcasting.
Basic Broadcasting Examples
// Scalar to any shape - broadcasts everywhere
let scalar = CpuTensor::<f32>::from_buf(vec![5.0], vec![]).unwrap();
let matrix = CpuTensor::<f32>::ones((3, 4));
let result = scalar + matrix; // Shape: (3, 4), all values are 6.0
// Vector to matrix - broadcasts along rows
let vector = CpuTensor::<f32>::from_buf(vec![1.0, 2.0, 3.0], vec![3]).unwrap();
let matrix = CpuTensor::<f32>::ones((4, 3));
let result = vector + matrix; // Shape: (4, 3), each row is [2, 3, 4]
// Row vector + Column vector = Matrix (also, this works with views - for example, slices)
let row = CpuTensor::<f32>::ones((1, 4)); // Shape: (1, 4)
let col = CpuTensor::<f32>::ones((3, 1)); // Shape: (3, 1)
let result = row.view() + col.view(); // Shape: (3, 4), all 2.0
// Matrix broadcasts to 3D tensor
let matrix = CpuTensor::<f32>::ones((3, 4)); // Shape: (3, 4)
let tensor_3d = CpuTensor::<f32>::ones((2, 3, 4)); // Shape: (2, 3, 4)
let result = matrix + tensor_3d; // Shape: (2, 3, 4)
// Singleton dimensions broadcast
let a = CpuTensor::<f32>::ones((1, 3, 1)); // Shape: (1, 3, 1)
let b = CpuTensor::<f32>::ones((2, 3, 4)); // Shape: (2, 3, 4)
let result = a.view() + b; // Shape: (2, 3, 4)
// 4D Broadcasting
let vector = CpuTensor::<f32>::from_buf(vec![10.0, 20.0], vec![2]).unwrap();
let tensor_4d = CpuTensor::<f32>::ones((3, 4, 5, 2));
let result = vector + tensor_4d.view(); // Shape: (3, 4, 5, 2)
// Complex singleton pattern
let a = CpuTensor::<f32>::ones((1, 1, 5, 1)); // Shape: (1, 1, 5, 1)
let b = CpuTensor::<f32>::ones((2, 3, 5, 4)); // Shape: (2, 3, 5, 4)
let result = a + b; // Shape: (2, 3, 5, 4)
// Classic: row vector + column vector
let row = CpuTensor::<f32>::ones((1, 4)); // Shape: (1, 4)
let col = CpuTensor::<f32>::ones((3, 1)); // Shape: (3, 1)
let result = row + col; // Shape: (3, 4)
// 3D mutual broadcasting
let a = CpuTensor::<f32>::ones((1, 4, 5)); // Shape: (1, 4, 5)
let b = CpuTensor::<f32>::ones((3, 1, 5)); // Shape: (3, 1, 5)
let result = a + b; // Shape: (3, 4, 5)
// Complex 4D mutual broadcasting
let a = CpuTensor::<f32>::ones((1, 3, 1, 5)); // Shape: (1, 3, 1, 5)
let b = CpuTensor::<f32>::ones((2, 1, 4, 1)); // Shape: (2, 1, 4, 1)
let result = a + b; // Shape: (2, 3, 4, 5)
Inplace Broadcasting Operations
let mut a = CpuTensor::<f32>::zeros((3, 4));
let b = CpuTensor::<f32>::ones((4,)); // Vector of shape (4)
a += b ; // b broadcasts along rows of a