LUT Compression
APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing
⚡Parallel Computing Content type: AcademicLess-relevant results
zhongkaifu/TensorSharp: A C# inference engine for running large language models (LLMs) locally using GGUF model files. TensorSharp provides a console application, a web-based chatbot interface, and Ollama/OpenAI-compatible HTTP APIs for programmatic access. It supports Windows/MacOS/Linux with full GPU capability
💻Local LLMs Content type: CodeSET: Stream-Event-Triggered Scheduling for Efficient CUDA Graph Pipelines
⚡Parallel Computing Content type: AcademicOn GPU Implementation for Multi-Precision Integer Division
⚡Parallel Computing Content type: AcademicCodegenBench: Can LLMs Write Efficient Code Across Architectures?
⚡Parallel Computing Content type: AcademicGraph Traversal on Tensor Cores: A BFS Framework for Modern GPUs
⚡Parallel Computing Content type: AcademicNo more posts from matmat's subscribed feeds.