Batch inference locally on 4080

Hi all,

I’m running ollama with Gemma 3 12b locally on my 4080 but I’d like to have my endpoint be a similar interface as OpenAI’s batch interface. I’m trying to do this with a wrapper around VLLM but I’m having issues.

I’m not super deep in this space and have been using agents to help me set everything up.

My use case is to send 200k small profiles to a recommendation engine and get 5-15 classifications on each profile.

Any advice on how to get this accomplished?

Currently the agents are running into trouble as they say the engine isn’t handling memory well. VLLM model support doesn’t list latest models for Gemma either.

Am I barking up the wrong tree? Any advice would be much appreciated

Similar Posts