Please don’t follow ubergarm2
organization, this is just a temporary workaround until huggingface helps me clear up this error I get when trying to upload:
403 Forbidden: Your storage patterns tripped our internal systems! Please contact us at website@huggingface.co so we can verify your account and unlock more storage for your use-case..
This repo will hopefully become ubergarm/Ling-1T-GGUF
sooner than later and I can continue uploading some more quants.
ik_llama.cpp
imatrix Quantizations of inclusionAI/Ling-1T
This quant collection REQUIRES ik_llama.cpp fork to support the ik’s latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama,…
Please don’t follow ubergarm2
organization, this is just a temporary workaround until huggingface helps me clear up this error I get when trying to upload:
403 Forbidden: Your storage patterns tripped our internal systems! Please contact us at website@huggingface.co so we can verify your account and unlock more storage for your use-case..
This repo will hopefully become ubergarm/Ling-1T-GGUF
sooner than later and I can continue uploading some more quants.
ik_llama.cpp
imatrix Quantizations of inclusionAI/Ling-1T
This quant collection REQUIRES ik_llama.cpp fork to support the ik’s latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
NOTE ik_llama.cpp
can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.
Some of ik’s new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP with Windows builds for CUDA 12.9. Also check for Windows builds by Thireus here. which have been CUDA 12.8.
These quants provide best in class perplexity for the given memory footprint.
Big Thanks
Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!
Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models!
Quant Collection
Perplexity computed against wiki.test.raw.
This one is just a test quant for baseline perplexity comparison:
-
Q8_0
989.678 GiB (8.504 BPW) -
Final estimate: PPL = 1.9859 +/- 0.00907
smol-IQ4_KSS TODO
Final estimate: PPL = TODO
IQ2_K TODO
Final estimate: PPL = TODO
This will use full q8_0 for VRAM layers and likely suit 384 RAM/VRAM.
smol-IQ2_KS 264.984 GiB (2.277 BPW)
Final estimate: PPL = 2.4429 +/- 0.01191
Should hopefully fit in 250 GiB RAM + 15 GiB VRAM + kv-cache/context...🤞
Leaving the attn.*
/first 4 dense layers/shexp at full q8_0 would take about 20.1 GiB VRAM, might do some other quants like that for folks with more VRAM.
👈 Secret Recipe
custom="
# 80 Repeating Layers [0-79]
# Attention
blk\.(0|1|2|3)\.attn_qkv.*=q8_0
blk\.(0|1|2|3)\.attn_output.*=q8_0
blk\..*\.attn_qkv.*=iq6_k
blk\..*\.attn_output.*=iq6_k
# First 4 Dense Layers [0-3]
blk\..*\.ffn_down\.weight=iq5_ks
blk\..*\.ffn_(gate|up)\.weight=iq5_ks
# Shared Expert Layers [3-79]
blk\..*\.ffn_down_shexp\.weight=iq5_ks
blk\..*\.ffn_(gate|up)_shexp\.weight=iq5_ks
# Routed Experts Layers [3-79]
blk\..*\.ffn_down_exps\.weight=iq2_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_ks
# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/data/models/ubergarm/Ling-1T-GGUF/imatrix-Ling-1T-Q8_0.dat \
/mnt/data/models/ubergarm/Ling-1T-GGUF/Ling-1T-BF16-00001-of-00046.gguf \
/mnt/data/models/ubergarm/Ling-1T-GGUF/Ling-1T-smol-IQ2_KS.gguf \
IQ2_KS \
192
Quick Start
# Clone and checkout
$ git clone https://github.com/ikawrakow/ik_llama.cpp
$ cd ik_llama.cpp
# Build for hybrid CPU+CUDA
$ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON
$ cmake --build build --config Release -j $(nproc)
# CPU-Only Inference
# `-ger` is still fresh:
# https://github.com/ikawrakow/ik_llama.cpp/pull/836
# Omit numactl and `--numa ...` if you have only a single NUMA node
# set batches/threads/kv cache as desired
# NOTE: multiple slots e.g. `--parallel 2` may case error after canceling generation then starting a new one at the moment
SOCKET=0
numactl -N "$SOCKET" -m "$SOCKET" \
./build/bin/llama-server \
--model "$model"\
--alias ubergarm/Ling-1T-GGUF \
--ctx-size 65536 \
-fa -fmoe -ger \
-ctk q8_0 -ctv q8_0 \
-ub 4096 -b 4096 \
--parallel 1 \
--threads 128 \
--threads-batch 192 \
--numa numactl \
--host 127.0.0.1 \
--port 8080 \
--no-mmap \
--no-display-prompt
# Hybrid GPU+CPU Inference
# WARNING: Haven't tested this personally yet...
# `-ger` on CUDA may not be merged yet:
# https://github.com/ikawrakow/ik_llama.cpp/pull/838
# Omit numactl and `--numa ...` if you have only a single NUMA node
# set batches/threads/kv cache as desired
# NOTE: multiple slots e.g. `--parallel 2` may case error after canceling generation then starting a new one at the moment
SOCKET=0
numactl -N "$SOCKET" -m "$SOCKET" \
./build/bin/llama-server \
--model "$model"\
--alias ubergarm/Ling-1T-GGUF \
--ctx-size 65536 \
-fa -fmoe -ger \
-ctk q8_0 -ctv q8_0 \
-ub 4096 -b 4096 \
-ngl 99 \
-ot "blk\.(4|5|6)\.ffn_.*=CUDA0" \
-ot "blk\.(7|8|9)\.ffn_.*=CUDA1" \
-ot exps=CPU \
--parallel 1 \
--threads 128 \
--threads-batch 192 \
--numa numactl \
--host 127.0.0.1 \
--port 8080 \
--no-mmap \
--no-display-prompt
# optional use this once after downloading to confirm good files
--validate-quants