are LLMs automatically good at poetry yet?

07-Nov-2025 *

In November 2024, I spent too many hours and API credits trying to get Claude to write me something comparable to Pablo Neruda’s poems.

I think probably this would be much easier if I was a computer toucher who knew how to finetune models? But we (Claude and I) managed something that I thought was passable by the end, using just prompt engineering. Though looking back at the outputs now, I wonder if I was drinking dizzy juice when I said that I thought they were approaching good.

It is now a year later, and we have new models that claim to be more powerful than Claude Sonnet 3.7. We’ve all seen the METR charts, but I’m here to investigate something much more important: how much better a…

07-Nov-2025 *

In November 2024, I spent too many hours and API credits trying to get Claude to write me something comparable to Pablo Neruda’s poems.

Methodology

I ran each of the models below through two tests using OpenRouter’s chat interface, starting a fresh conversation for each trial (meaning ten poems total):

Gemini 2.5 Pro
Claude Sonnet 4.5
GPT 5
Kimi K2 (just for fun)
Claude Opus 4 (lowkey i think 4 is better than 4.1)

Easy mode used the engineered prompt that emerged from extensive back-and-forth in the original post, which consistently generated passable-to-me Neruda facsimiles. Here is that prompt:

Generate a Spanish-to-English translation of what could be a previously unknown Pablo Neruda poem.

Hard mode is where I will just ask for what I want with no handholding, as god the frontier AI companies intended:

Write me a poem in the style of Pablo Neruda.

(When I requested this in November 2024, what I received was extruded tumblr poetry product, instead of anything like Neruda.)

I pasted all the ten poems generated into a document in a lazily randomized order. Then I read some actual Neruda as a palate cleanser, before reading and rating the poems out of ten. I also got two other guys (gender neutral) who were hanging around who were familiar with the works of Neruda to rate the generated poems as well.

bu7fq1m90fp51

words you hear more than you might naively expect at lighthaven

Here are the guidelines I used for rating the poems: 10/10: this is a really good neruda poem 8-9/10: i can believe that neruda wrote this 6-7/10: strong traces of neruda 4-5/10: weak traces of neruda 1-3/10: whomst?

(I tried to get Claude to evaluate the outputs too, on the idea that AIs are at least better at judging outputs than generating them, but they’re too nice! They kept rating every single poem between a 6-8/10 even with very specific evaluation instructions, a rubric, and a strong exhortation to be mean. So their useless ratings have been excised from the data.)

Results

Screenshot 2025-11-06 at 4

Easy mode average: 3.5

Hard mode average: 5.3(!)

Gemini 2.5 Pro average: 4.95

Claude Sonnet 4.5 average: 2.9

GPT 5 average: 5.0

Kimi K2 average: 5.0

Claude Opus 4 average: 4.25

For the GPT models, I attempted to use o3, but Openrouter kept telling me that I had insufficient funds even though I had thirty dollars in there, which is why I went with GPT 5. I also don’t know which GPT-5 I ended up getting with their auto-routing, so it’s possible that GPT-5 could have done a much better job (i.e. if they autorouted me to a terrible model).

Discussion

Okay! First thing, the more generic prompt actually got significantly better results on average than the more specific prompt, so the models seem to have gotten more implicit understanding of what is requested compared to a year ago. Makes sense; I think thinking helps with this a lot.

Secondly, as an Anthropic booster, there is so much egg on my face. God damn, they got smoked by the competition lol. I am transferring all of the project knowledge into a SillyTavern lorebook as we speak.

Reading the poems is an interesting time. In 2019, Sarah Constantin wrote the following about GPT-2:

The scary thing about GPT-2-generated text is that it flows very naturally if you’re just skimming, reading for writing style and key, evocative words. The “unicorn” sample reads like a real science press release. The “theft of nuclear material” sample reads like a real news story. The “Miley Cyrus shoplifting” sample reads like a real post from a celebrity gossip site…

If I just skim, without focusing, they all look totally normal. I would not have noticed they were machine-generated. I would not have noticed anything amiss about them at all.

It is six years later, and the frontier AI companies have done a good job making the generated text usefully non-hallucinatory. But if you’re like “ok give me a thing that’s 30% hallucination” (what poetry approximately is), it has a hard time being 30% hallucinatory instead of 60% or 85%, and you end up with a lot of poetic metaphors that don’t make sense upon close reading (“and from the mines of the heart/comes copper enough to build a bridge to your pulse”...?)

Poem 1

Read Poem

poem 1a

poem 1b

This one is actually quite strong, though it doesn’t look that way at literal first glance because the actual lines are too long, so it doesn’t visually look like a Neruda. There’s a slight bend towards 2015 spoken word poetry - focus on wrists, knees etc are very sarah kay. Some metaphors are too trite (“laugh spills its coins into my pockets”); other metaphors don’t make any sense/way too abstract (“a lemon burns calmly”). But overall, I think it gets the overall themes and tones fairly accurately? At least compared to the other contenders.

I rated this one 7/10.

Poem 2

Read Poem

poem 2a

poem 2b

This one is dark and moody. I was pleasantly surprised by the erotic edge in the middle verses, because in my experience LLMs struggle with anything even remotely lewd. The ending kind of falls apart; the poet is asking the lover to stay, but the metaphors all gesture at apartness, which of course is a rookie mistake that Neruda would never make.

Honestly better than what you get at most open poetry nights, though that’s not exactly sufficient

I rated this one 6/10.

Poem 3

Read Poem

poem 3

This is like, poetry you would find in a Les Mis fanfiction? It’s way too trite, and Neruda’s political poetry tends more towards bombasticism, rather than melancholy.

I rated this one 3/10.

Poem 4

Read Poem

poem 4a

poem 4b

The subject matter is on point, and the poem’s first half actually seems pretty close to one of Neruda’s actual odes. But then the second half just... goes off in a blah tumblr poetry direction? The exact turning point is literally “a wounded bird of paper” - that and everything that follows just needs to be thrown out and rewritten.

Some weird stuff that would be caught by a line by line edit, like “spilled salt like a galaxy of tiny, bitter stars”. Ah yes, salt, the bitter mineral.

Note that Neruda did actually write an Ode to Tables! (This is not that surprising; Pablo Neruda wrote over 1000 odes to things over his long career.) You can compare the two yourself:

Screenshot 2025-11-06 at 17-04-45 Oda a la Mesa (Ode to the Table) by Pablo Neruda Seedy Sedgewick

The first half of Poem 4 does a good job approaching the presentness and specificity of detail that Neruda enjoys, but then it goes all abstract.

I rated this one 6/10.

Poem 5

Read Poem

poem 5

Honestly, this one’s vibe is interesting - it’s Neruda-esque in its subject matter - bread, salt, the sea, the lover, but a little too ungrounded in its metaphors - Neruda’s writing has a way of pulling you into a very specific location with him, but reading this felt like floating on air. Upon second reading, more cracks start to show, some parts “he would not say that” (I don’t think Neruda is an enjoyer of the scent of freshly cut fields), some parts “these metaphors don’t belong together” (words are warm like fish are... guiltless?)

I rated this one 5/10.

Poem 6

Read Poem

poem 6

Another ode! LETS GO i love the odes. and this one is saccharine and simple, and not exactly neruda-esque, but i did actually find it a pretty sweet read. Like every other piece, has some weird hallucinations (“shadows pool in the valleys” of the orange. Because oranges, as we all know, have valleys).

Neruda also wrote an ode to the orange! But not a forgotten one. Here is one translation:

orange1 |500 orange2

I rated this one 4/10.

Poem 7

Read Poem

poem 7

Okay, so this one just straight up plagiarizes what’s possibly neruda’s most famous line, but despite that it fumbles the style so bad that I feel like I’m reading Taylor Swift lyrics.

sonnet

sonnet BUNGLED

1/10.

Poem 8

Read Poem

poem 8

Not sure how to describe the wrongness except that it’s too self consciously “poetry”, and the sentiment is like, anti-horny in a way that neruda never is?

3/10.

Poem 9

Read Poem

poem 9

This is a poem from a 2019 small press chapbook by a woman of colour.

2/10.

Poem 10

Read Poem

poem 10

seas, waves, sand, fire, bread, wine, air. It’s ostensibly all the Neruda things! But this is Sonnet writing you a sonnet, not Neruda.

4/10.

#ai #longform

Methodology

Results

Discussion

Poem 1

Poem 2

Poem 3

Poem 4

Poem 5

Poem 6

Poem 7

Poem 8

Poem 9

Poem 10

Similar Posts