GLM-4.6 Derestricted

Quick update: I’ve fixed an issue where the chat template wasn’t inluded in the quants, the first shard of each quant has been updated to include the chat template. Please re-download the first shard to pick up the fix, sorry for the inconvenience.

This is a "derestricted" abliteration of GLM-4.6, using Jim Lai’s norm-preserving biprojected abliteration technique. For more information, you can read his blog post here

Quick update: I’ve fixed an issue where the chat template wasn’t inluded in the quants, the first shard of each quant has been updated to include the chat template. Please re-download the first shard to pick up the fix, sorry for the inconvenience.

This is a "derestricted" abliteration of GLM-4.6, using Jim Lai’s norm-preserving biprojected abliteration technique. For more information, you can read his blog post here

Essentially, I was going for a lighter abliteration. This doesn’t mean the model is 100% zero-shot unrestricted. It should be more "permissive" than normal GLM-4.6, but probably still requires a system prompt to nudge it in the right direction. From my own testing, I’ve mainly used this model for creative writing. I’ve noticed a positive change compared to how base GLM-4.6 does sentence structure and this feels more varied and organic. It does not particularly reduce or alter "slop", since this isn’t a finetune, but there’s much less of an "assistant" voice performing soft-censorship during particular scenarios and it feels less like "LLM writing". I’ve only done some light technical assistant work and it still feels competent there, but I haven’t exhaustively benched it.

Visualized here is the analysis of the refusal direction:

Provided in this repository are several quants I’ve produced from the abliteration I performed, as well as the measurements to produce your own abliteration if you want and the config that I used. I chose to ablate layers 30-45, using the measurement from layer 37 due to the SNR peak. Other measurements I tried showed an interesting dual-peak phenomenon with a second peak forming around layer 46, but the overall SNR magitude was only ~0.16 or so compared to the much better 0.25 peak present here.

If you want to abliterate GLM-4.6 yourself, you will need to download the safetensors for the model and use this PR.

For quants, I’ve provided a Q8_0 as well as others that follow the MoE quantization schema that I’ve been using. The idea being that given the huge size of the FFN tensors compared to the rest of the tensors in the model, it should be possible to achieve a better quality while keeping the overall size of the entire model smaller compared to a similar naive quantization.

The naming convention is as follows: [Default Type]-[FFN_UP]-[FFN_GATE]-[FFN_DOWN], eg: Q8_0-Q4_K-Q4_K-Q5_K. This means:

Q8_0 is the default type (attention, shared expert, etc.)
Q4_K was used for the FFN_UP and FFN_GATE conditional expert tensors
Q5_K was used for the FFN_DOWN conditional expert tensors

Quant	Size	PPL	KLD
Q8_0	353.26 GiB (8.51 BPW)	8.4801 ± 0.15099	0
Q8_0-Q5_K-Q5_K-Q6_K	248.61 GiB (5.99 BPW)	8.4881 ± 0.15112	0.009449 ± 0.000677
Q8_0-Q4_K-Q4_K-Q5_K	208.24 GiB (5.01 BPW)	8.5182 ± 0.15172	0.016299 ± 0.000839
Q8_0-IQ3_S-IQ3_S-IQ4_XS	163.74 GiB (3.94 BPW)	8.7101 ± 0.15534	0.041096 ± 0.001202
Q6_K-IQ2_XS-IQ2_XS-IQ3_S	119.79 GiB (2.88 BPW)	9.3447 ± 0.16732	0.131974 ± 0.002384

Quick update: I’ve fixed an issue where the chat template wasn’t inluded in the quants, the first shard of each quant has been updated to include the chat template. Please re-download the first shard to pick up the fix, sorry for the inconvenience.

Quick update: I’ve fixed an issue where the chat template wasn’t inluded in the quants, the first shard of each quant has been updated to include the chat template. Please re-download the first shard to pick up the fix, sorry for the inconvenience.

Similar Posts