and recorded it running at 857 tokens/second. It was an exciting model, but Google made no further announcements about it. That research has returned in the best possible way: as a new open weight (Apache 2 licensed) Gemma model, NVIDIA are currently on their NIM cloud API. I used that API to , which took 4.4s (according to time uv run generate.py) to return 2,409 tokens - so at least 500 tokens/second. Via Tags: <a href=" <a href=" <a href=" <a href=" <a href=" <a href=" <a href=" <a href=" ...

Read the original article