Do Instruction-Tuned Models Always Perform Better Than Base Models? Evidence from Math and Domain-Shifted Benchmarks

arXiv:2601.13244v1 Announce Type: new Abstract: Instruction finetuning is standard practice for improving LLM performance, yet it remains unclear whether it enhances reasoning or merely induces surface-level pattern matching. We investigate this by evaluating base and instruction-tuned models on standard math benchmarks, structurally perturbed variants, and domain-shifted tasks. Our analysis highlights two key (often overlooked) limitations of instruction tuning. First, the performance advantage is unstable and depends heavily on evaluation settings. In zero-shot CoT settings on GSM8K, base models consistently outperform instruction-tuned variants, with drops as high as 32.67% (Llama3-70B). Instruction-tuned models only match or exceed this performance when provided with few-shot exemplar…

Similar Posts