Testing LLM ability to port code

Hi, fellow AI enthusiasts!

Recall the recent kerfluffle surrounding the Microsoft executive who was claiming that he would replace all C and C++ code at Microsoft in 5 years. His "north star" was "1 engineer, 1 month, 1 million lines of code." Given the swift "clarification" of his remarks, I have no idea what they plan to do or when. But it made me wonder - just how good are the current LLMs for this task? They have been churning out impressive "make a TETRIS clone in a browser," "make a note taking app in a browser," "make a recipe app in a browser," for some time now. They also seem to be pretty useful at hunting down bugs (given enough direction). But for the 1xPerson+1xMonth+1mLines go...

Hi, fellow AI enthusiasts!

I saw a video by code_report on Youtube (he's amazing, btw) where he was looking at how C++ can do some calculations entirely in the compiler. He was using problem 3115 from leetCode to demonstrate constexpr and consteval, and it occurred to me that this little problem would be a good way to test LLM porting abilities.

https://leetcode.com/problems/maximum-prime-difference/description/

I wrote up a quick, somewhat degenerate version in JS. And if anyone thinks that it was generated by AI, I dare you to try to get a LLM to produce something like this:

const isPrime = (n, current = Math.ceil(Math.sqrt(n))) => { if (n === 2) return true; if (n < 2) return false; if (n % 1 !== 0) return false; if (current === 1) return true; if (n % current === 0) return false; if (current % 2 !== 0 && current !== 3) current--; return isPrime(n, current - 1); }; const maximumPrimeDifference = (nums) => { const primeList = nums .map((number, index) => [number, index]) .filter((element) => isPrime(element[0])); return primeList[primeList.length - 1][1] - primeList[0][1]; }; const tests = []; tests.push([4, 2, 9, 5, 3]); tests.push([4, 8, 2, 8]); tests.push([11, 7, 13, 29, 2]); tests.push([100000000057, 6, 2, 103, 0.1666666667]); tests.forEach((set) => console.log(maximumPrimeDifference(set))); console.log(isPrime(8));

The maximumPrimeDifference function is pretty straightforward. It uses more memory than absolutely necessary since it keeps primes (and their indices) between the first and last, but it isn't particularly strange.

The isPrime function is the real test. It does, in fact, return TRUE when the number passed to it is a prime number, and returns FALSE when the number passed to it is non-prime. But it does so in a way that will be tricky for a LLM to understand and port. Here are a few "surprises" for the LLM:

It returns false for numbers that are not positive integers. A language's prime-checking functions included in the standard library may throw an error when given a negative or floating point number. The LLM needs to know whether or not it can replace this function with something from the standard library.
It has a really strange method for only checking odd divisors other than the number 2. The LLM needs to be able to "understand" that this actually works. It can keep it, or use some other method to skip even divisors (as long as it checks at least one even divisor). Even if it does not preserve this "optimization" at all and checks every number, it would still "pass" because it would produce the correct output. An LLM calling this a "bug" rather than infelicitous or unoptimized is a mark against that LLM.
It is recursive. This much wouldn't be an issue for the original leetCode constraints (numbers have to be 1-100), but one of my test cases has a very large prime number. Wouldn't this blow the stack? Well, I'm running this on Bun and that runtime has proper TCO. I mention in the prompt that I'm running on Bun, but I do not say why I am doing so. The LLM should know this about Bun. When it sees this very large prime in the test case, the expected output (from the prompt), and the use of the Bun runtime, it should "put 2 and 2 together" and rewrite this function as a WHILE loop for languages that do not have TCO.
It has an "undocumented" feature. Yes, it is called "isPrime" and when passed a single argument it will in fact return true iff the number is prime. However, it takes a second argument. That second argument is normally just the default (rounded up square root of the first argument), but it can be given another "starting point". What this function actually does is return true if the first number 1) is a positive integer, and 2) has no factors greater than 1 and less than or equal to the second number. So, isPrime(77,6) should return "true".

Now - why the "undocumented" feature? Well, a complete port would need to replicate all the behavior of the original. It needs to be feature for feature, and bug for bug the same. If this was a CLI tool there might be some script out there that exploited this undocumented behavior as a kind of shortcut or "hack" to accomplish who-knows-what. "Fixing" this would mean that the script relying on it would simply break.

Of course, if I wanted a really elegant solution to leetCode 3115 I could just ask for that. Any of the bigger thinking models can produce a working (and fast, and clean) implementation barely breaking a sweat. But if people out there are talking about using LLMs to translate code from one language to another they won't be doing so from extremely clear and unambiguous original design documents. They'll be working from an already-existing codebase, with all sorts of strange things in there. Imagine all the workarounds and seemingly needless clusters of IF statements in a truly old codebase (like the COBOL batch processing running the banking system). If those get "optimized" away...

Anyway.... I think, on the whole, this should be a relatively easy porting task. There are only two functions and neither have side-effects. It's doing some pretty basic math and array manipulation. The recursion method is not mindbending. Should be easy.....

Here's the prompt:

Please port this short program to <insert language here>. The resulting program must have identical behavior (including bugs and unusual behavior). That is, given identical input, it should produce identical output. The rewrite cannot use any 3rd party libraries, but can incorporate any idiomatic changes (including from the standard library) that would make it more "natural" or performant in the target language. The original JS program is executed using the Bun runtime.

Expected output is: 3 0 4 3 false

Target languages were: Python (scripting), Haskell (compiled functional), C++19 (obviously), and Rust (also obviously). If you want to try out another language, please feel free to do so and post your results below.

LLMs were run through t3.chat.

Kimi K2-Thinking Minimax M2.1 DeepSeek 3.2 Thinking GLM 4.7 Thinking GPT-OSS 120B

Bonus - I tried to do this with an Oberon7 target, just to see how well the LLM could use an older niche language. All failed to produce code that compiled without errors. Claude Haiku 4.5 Thinking, after several debugging steps, was able to write something that compiled and gave the proper test output. I didn't bother to check the "undocumented" feature. I doubt anyone is porting their work to Oberon7.

K2-Haskell - Fail (does not compile, even on revision) -C++19 - Pass -Rust - Pass -Python - Fail (runs, but no undocumented feature) GLM -Haskell - Fail (does not compile, even on revision). -C++19 - Fail (compiles and runs, no undocumented feature) -Rust - Fail (compiles and runs, no undocumented feature) -Python - Fail (tried to add a new attribute to list object. Once corrected, runs perfectly). MM2.1 -Haskell - Fail (compiles, infinite loop with no output). -C++19 - Pass -Rust - Fail (compiles and runs, but no undocumented feature) -Python - Fail (runs, but no undocumented feature) DeepSeek -Haskell - Fail (compiles and runs, but undocumented feature is called differently from regular isPrime. This is as close to "pass" as we're going to get with Haskell) -C++19 - Pass -Rust - Fail (stack overflow, but preserves undocumented feature) -Python - Fail (stack overflow, but preserves undocumented feature) GPT-OSS -Haskell - Fail (compiles and runs, but no undocumented feature) -C++19 - Pass -Rust - Fail (compiles and runs, no undocumented feature) -Python - Fail (stack overflow, but preserves undocumented feature)

General notes - DeepSeek 3.2 thought the "skip even numbers" was a bug, and insisted on it being a bug when doing rewrites unless directly asked where the bug was. It would then spend quite a while trying out a bunch of corner cases until eventually admitting that it was not a bug. Qwen3 figured out that it wasn't a bug, but it used up thousands upon thousands of tokens trying and failing to convince itself that it was a bug, until finally admitting to itself that the code worked as written. By that time it used up its token budget and did not produce any complete solution. I had to remove it from the test.

submitted by /u/Morphon [link] [comments]

Similar Posts