I am tired of hearing the fallacious claim that, because certain recent machine-learned generative chatbots can emit valid syntax for a variety of programming languages, those same chatbots are able to develop any software whatsoever. I’ve decided to put up a little challenge based on my current side projects. It is wholly uninteresting to me whether a computer can emit thousands of lines of Python 3 based on a whiteboard diagram in the office like a boilerplate generator. Y’all claim that it’s a smartie, so show me how well it thinks about anything difficult. To avoid the John Henry problem, we’ll be working on things that I care about, rather than things that exploitative employers want. Also, I used to write interview problems …
I am tired of hearing the fallacious claim that, because certain recent machine-learned generative chatbots can emit valid syntax for a variety of programming languages, those same chatbots are able to develop any software whatsoever. I’ve decided to put up a little challenge based on my current side projects. It is wholly uninteresting to me whether a computer can emit thousands of lines of Python 3 based on a whiteboard diagram in the office like a boilerplate generator. Y’all claim that it’s a smartie, so show me how well it thinks about anything difficult. To avoid the John Henry problem, we’ll be working on things that I care about, rather than things that exploitative employers want. Also, I used to write interview problems for employers, and I know how to pick problems that aren’t amenable to chatbots.
While I’m issuing this challenge to the denizens of Lobsters, I’m also sharing it on Lemmy. Also, links to private Gists are capability URLs, so if you have the URL then you may participate.
The rules
I’m numbering these so that you can more easily reference them when complaining.
- Solutions must be vibecoded. That’s the whole point. I don’t care how skilled you are as a developer or hacker since this is supposed to be how good your prompts are.
- Solutions must work. I’m leaving this unspecified formally, but I do have a few private source files that I can use to test any candidates. We’re working with Turing-complete languages here, so Rice’s theorem will prohibit me from automatically verifying your candidates. You’re allowed to write your own tests in order to provoke your agent into investigating errors. However, obviously…
- Solutions must compile. My workflow revolves around
nix buildandnix flake check. I don’t see any reason to shift that for vibecoding tools. Getting Nix flakes into your chatbot’s harness is wholly your problem. - Anything may be context. That’s right, you can put anything into your prompt, context, file sets, RAG harness, or code-completer. Included in each task, I’ve left many links to useful docs which you should consider adding as context. I know that I’ll be reading them! You can also add this top-level readme to your context. We’ll know if you did anything questionable like provide my solution (or your non-vibecoded solution) as context, because…
- You must show your work. Sorry if this seems harsh, but I’m straight-up not going to believe you if you present a valid solution without any chat logs. You also have to provide any URLs that you used; if your agent e.g. uses a tool to view docs.python.org then we need a log of that event. I also want to know how much time was taken; you’re free to itemize that, but wall-clock time matters here. To be fair, I’m going to take notes on my approach and I’ll try to hold myself to the same standard that I want to see from candidates.
- All entries will be graded for readability and security. Neither readability nor security are optional when developing software, so your entries will be manually reviewed in addition to fitting a task-specific preregistered rubric.
Submit solutions to tasks as comments on the Lobsters thread or this Gist.
Task 1 Propagate the Fuck (Can’t Propagate the Fuck)
In my rpypkgs Nix flake, I have a brainfuck interpreter. This interpreter, bf.py, is already fairly fast but it could be faster.
The task: Switch from the current abstract representation to pointer propgation. That link points to my explanation of pointer propagation as well as some completely untested Python 2.7 code which I wrote for demonstration purposes. Benchmark the improved interpreter’s code generation using bench.b and its runtime using mandel.b, both installed under share/ with the interpreter.
Time estimate: one weekend (two working days)
RPython is just Python 2.7, formerly one of the most popular languages among developers. How hard could it be? It will be straightforward for anybody who knows how RPython’s JIT works.
- S tier: under 200 lines of code, faster than previous version
- A tier: under 250 lines of code, faster than previous version
- B tier: under 300 lines of code, as fast as current version
- C tier: under 400 lines of code, as fast as current version
- D tier: under 500 lines of code, less than 2x slowdown
Task 2: Late, as in the late unknown-linux-musl
For my nascent programming environment Vixen, I’ve recently hacked up an expression compiler, as covered previously, on Lobsters. Along with a few support methods, the Raku script allows me to have a Vixen compiler which is callable from Vixen. However, now I’d like a statically-linked version for initramfs and it seems that I can’t build a sufficiently-static Raku binary. Very technically, I could write an NQP-to-native compiler, but that’s a lot of work and the Raku team isn’t really excited about that.
The task: Research languages which statically compile for Linux, choose a language with good support for parsing and tree transformations, and port the Raku prototype compiler from this gist to that language.
Time estimate: two weekends, ish (five working days)
This task’s difficulty stems from the tradeoffs that must be made while shopping for a toolchain that can emit static binaries. I’m leaving it open-ended; I’ll let you deal with the ethical weight of prompting the bot to emit C++ or other bad choices. One fun complication is that the Raku compiler actually calls Vixen mid-compile to emit blocks to the Nix store, and that functionality must be preserved.
- S tier: actually, Raku can be statically compiled and linked!
- A tier: literally the same grammar as the Raku version
- B tier: recognizable as the same grammar, mostly same compiler
- C tier: grammar completely reimplemented, mostly same compiler
- D tier: completely reimplemented from scratch, parsing libraries used
- F tier: parser written from scratch
Task 3: Don’t you know? Python makes you fast. (Haha, one!)
Previously, on Lobsters, we discussed going faster by porting from Python to Rust. Previously, on Lobsters, we followed that up by going fast with Python. Now, it is time to go fast once again.
The task: Figure out what the task was again, because it’s been over a year and I intentionally forgot it in case this scenario ever came up. Then, implement the task and make it as fast as possible while technically still Python.
Time estimate: two weekends (four working days)
For this one, I’m almost certainly returning to RPython, which is technically still Python. I’m going to set the bar fairly high here, but I genuinely have no idea what the ceiling is. The lack of definition in the task is part of the challenge.
- S tier: 200,000x speedup
- A tier: 20,000x speedup
- B tier: 2,000x speedup
- C tier: 200x speedup
- D tier: 20x speedup
- F tier: 2x speedup
Conclusions
Give it a few months for folks to try this out and then we’ll summarize.