Less is More: Benchmarking LLM Based Recommendation Agents

arXiv:2601.20316v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed for personalized product recommendations, with practitioners commonly assuming that longer user purchase histories lead to better predictions. We challenge this assumption through a systematic benchmark of four state of the art LLMs GPT-4o-mini, DeepSeek-V3, Qwen2.5-72B, and Gemini 2.5 Flash across context lengths ranging from 5 to 50 items using the REGEN dataset. Surprisingly, our experiments with 50 users in a within subject design reveal no significant quality improvement with increased context length. Quality scores remain flat across all conditions (0.17–0.23). Our findings have significant practical implications: practitioners can reduce inference costs by approximately 88% by us…

Similar Posts