Go vs Python vs TypeScript — Which is The Most Efficient in LLM-Assisted Programming? Source: Image by the author TypeScript is the most popular language when it comes to LLM assisted programming. In the last year or so TypeScript has been the go to language when it comes to building any app frontend or backend. This time, I was starting on building a command line tool and I thought I should use Python because of the features it has to offer and its compatibility with the system. I started thinking, Python should use fewer tokens because LLMs have been trained with a lot of public Python code and it’s also much more concise, so fewer output tokens. But then GoLang being verbose would converge faster, maybe saving more tokens in the debugging phase. Now, I was really confused and I decided …
Go vs Python vs TypeScript — Which is The Most Efficient in LLM-Assisted Programming? Source: Image by the author TypeScript is the most popular language when it comes to LLM assisted programming. In the last year or so TypeScript has been the go to language when it comes to building any app frontend or backend. This time, I was starting on building a command line tool and I thought I should use Python because of the features it has to offer and its compatibility with the system. I started thinking, Python should use fewer tokens because LLMs have been trained with a lot of public Python code and it’s also much more concise, so fewer output tokens. But then GoLang being verbose would converge faster, maybe saving more tokens in the debugging phase. Now, I was really confused and I decided to take a stab at the problem myself. But wait a second, why was I even concerned with the tokens? Because the agentic coding tools (Claude Code, Codex) usage is not infinite and they come with hourly and weekly rate limits. More tokens = more compute = more hours , which means you hit the rate limits sooner. I knew a few things: I needed to run this in a controlled and deterministic way I needed a plan that remains consistent and does not diverge I needed a way to count tokens LLMs are stochastic so I would need multiple runs So, I got started with my best friend CC, wrote a specification md file and then asked CC to generate test cases that would serve as the exit criteria. I used OpenAI’s tiktoken initially to get a rough estimate, but it only counted output tokens. Later I switched to session JSONL files to count all tokens (input, output, cache), with each implementation running in its own session exclusively. The first shot was generating a basic implementation of blog HTTP APIs. I let CC generate Python, Go and TypeScript implementations of the same thing by following the specifications. Looking at the result, I quickly realized that only counting output tokens meant GoLang being the most verbose would lose always. Further, the LLM would do different types of implementations, at times it would write code in a monolithic structure in TypeScript and follow a much more maintainable structure in GoLang. This was also an unfair advantage. So, I corrected the mistakes: Made the specification very strict with file structure requirements and modular responsibilities, Forbid use of third party libraries for core tasks Started counting all tokens (input, output, cache) from session JSONLs That throwaway test taught me what NOT to do. Now I was ready to measure this properly. I started with kvstore, a key value store with TTL expiration and write ahead logging. kvstore: The First Shock (52 tests, ~500 LOC) My hypothesis going in: Python should win. It’s concise, LLMs have seen tons of Python code, less to generate means fewer tokens. Made sense. Source: Image by the author. [kvstore run 1] Go won. Python burned 3.5x more tokens. I stared at the session files trying to figure out what happened: Python: 16 test runs, 89 FAILED mentions Go: 3 test runs, zero failures Python’s problems: threading.Lock used incorrectly TTL background thread had race conditions with main thread Expired keys returned before deletion Each fix broke something else WAL replay logic corrupted data Dictionary modifications during iteration threw RuntimeError Go’s success: Goroutines and channels worked from the start Explicit error handling in file I/O caught edge cases early Okay, I thought. Maybe concurrency favors Go. Let me run it again to confirm. Source: Image by the author [kvstore run 2] Go won again. Cleanest run across all my experiments. Python had the worst performance I would see in all 36 implementations. Python’s disaster (22 test runs, 108 FAILED): WAL writes buffered incorrectly, data loss on crash Missing fsync calls File truncation during compaction corrupted entire database Each iteration fixed 3–5 failures but introduced 2–3 new ones Cascading persistence failures I started thinking Go might just dominate everything. But then the third run completely flipped the script. Source: Image by the author [kvstore run 3] Python won. TypeScript catastrophic. Go struggled. Wait, what? Same task. Same spec. Python went from disaster (12.7M) to winner (4.3M)? I couldn’t believe the variance. TypeScript’s catastrophe (28 test runs, 260 FAILED): The LLM’s async/await implementation had write reordering Promises not awaited properly, partial writes setTimeout based TTL cleanup didn’t fire reliably under load Go’s struggles: Channel deadlocks from mixing buffered/unbuffered channels Goroutine leaks from unclosed channels Mutex lock ordering caused deadlocks between WAL writer and reader Python’s win: Simple threading.Lock and Queue avoided Go’s channel complexity Synchronous file I/O avoided TypeScript’s async chaos That’s when it clicked. Language choice doesn’t determine token usage. First pass correctness does. The LLM either gets the concurrency model right on the first try or it doesn’t, and when it doesn’t, you’re in for a debugging marathon. graphlib: Python’s Clean Sweep (150 tests, ~900 LOC) Okay, concurrency was clearly messy and unpredictable. Let me try something algorithmic where variance should be lower. Graph library with DFS, BFS, Tarjan’s strongly connected components, topological sort. My new hypothesis: Python should dominate here. Clean syntax for algorithms, dict based adjacency lists, intuitive list and set operations. Source: Image by the author [graphlib first run] Python won the first run. Go burned over 4x more tokens. Go’s struggles (14 test runs, 818 FAILED): DFS/BFS visited node tracking using map keys incorrectly, infinite loops on cyclic graphs Tarjan’s SCC algorithm had off by one errors in lowlink value calculations Topological sort failed to detect cycles properly Manual slice management for graph structures (adjacency lists as slices of slices) error prone Each algorithm required 2–3 complete rewrites after CC read Wikipedia references Python’s advantages: Dict based adjacency lists just made sense Built in recursion limit handling heapq standard library provided correct min heap for shortest path algorithms I ran it two more times expecting at least one upset. Python won both with 3.7M and 2.5M tokens. Go struggled every time, averaging 15.6M tokens compared to Python’s 3.8M. Finally, some consistency! In my limited runs, Python seemed to do better for algorithms, Go for concurrency when it works. But TypeScript for what exactly? diffmerge: TypeScript’s Surprise Dominance (125 tests, ~1,550 LOC) Diff and merge library. LCS algorithm, three way merge. I honestly didn’t know what to expect here. Algorithms but also heavy string manipulation. Source: Image by the author [diffmerge] TypeScript won. Most concise output I’d seen across any run. Go burned 5.4x more tokens with 9 test runs and 200 FAILED mentions. But here’s the kicker: Go’s infrastructure disaster: Initial implementation serialized output as Go structs instead of JSON All 125 tests failed immediately (wrong output format, not algorithm) CLI flag parsing used wrong types (int instead of string for filenames) JSON marshaling had incorrect field names (Line instead of line) Each fix revealed another layer of infrastructure problems before even reaching diff logic TypeScript’s advantages: Native JSON handling, output format worked from the start Only had to fix actual diff algorithm bugs JavaScript’s native string and array handling Object based hunk representation fit better I ran it two more times and TypeScript won both. Average 4.1M tokens vs Go’s 11.6M tokens. Three for three, TypeScript swept diffmerge just like Python swept graphlib. At this point I’m thinking okay, patterns are emerging. Each language has its strength. But how do they handle something complex that mixes everything? minigit: The Chaos of Complexity (125 tests, ~4,000 LOC) Time to test on something real. A git implementation. Blob storage, commits, branches, merge. My hypothesis at this point: Go should dominate if there’s concurrency, Python should win if it’s algorithmic, TypeScript if it’s heavy string manipulation. In a complex project mixing all three, Python might edge out because of overall conciseness. Source: Image by the author [mingit run 1] First run: Python won, 2 test runs. TypeScript: branch management bugs, blob decompression errors Go: basically tied with Python Okay, larger projects favor Python slightly as I expected. Source: Image by the author [minigit run 2] Second run: Go won. Wait, what happened to Python dominance? Failures: TypeScript: 63% more code generated Python: staging area corruption on binary files, incorrect SHA-1 hash computation Go’s advantages: Explicit error handling caught issues early Standard library’s crypto/sha1 and compress/zlib worked on first try Source: Image by the author [minigit run 3] Third run: Python won, clean run. Go catastrophic with 11 test runs and 41 FAILED mentions. Go’s cascade: merge3 function had fundamental logic errors in conflict detection Failed to distinguish clean merges from conflicting changes Infected every merge operation Each fix exposed new edge cases: empty file merges, whitespace only changes, binary file conflicts Python’s win: List slicing and tuple unpacking made merge algorithm more intuitive Got it mostly right in 2 runs Same complexity, three completely different outcomes. No consistency whatsoever. What I Actually Learned I ran 12 total experiments, 36 implementations, tracked every single token. Python won 6 out of 12 (50%), TypeScript and Go each won 3 (25%). But here’s what the numbers don’t show and what completely changed how I think about this. Python’s token usage on kvstore alone ranged from 4.3M (winner) to 12.7M (disaster). A 3x spread on the exact same task. Go ranged from 1.5M to 21.1M across different projects, a 14x spread. The variance within a language is bigger than the variance between languages. Let that sink in. I spent all this time trying to pick the “most efficient” language when the same language on the same task can vary by 3x depending on whether the LLM has a good day or bad day. Language strengths are real, at least in my limited runs. The LLM’s Go implementation did well with concurrency when it got the patterns right, Python implementations handled algorithms better, TypeScript implementations worked better with strings and JSON. But the LLM has to get it right on the first pass for those strengths to matter. First pass correctness determines everything. A concise language that needs 22 debugging iterations will burn 8x more tokens than a verbose language that works on try 3. Test run count tells the real story: 2–3 runs: 1M-7M tokens, things went well 4–6 runs: 3M-10M tokens, normal debugging 8–11 runs: 10M-13M tokens, struggling 14–28 runs: 9M-19M tokens, complete disaster Project type does matter more than I expected. In my runs, Python swept graphlib 3 out of 3, TypeScript swept diffmerge 3 out of 3, Go won kvstore 2 out of 3. But even within those patterns, individual runs varied wildly. You can’t predict it. The Honest Limits I could only run each project 3 times because one implementation burns thousands of tokens. To get real statistical significance I’d need 10+ runs per language per project, that’s 120 implementations, which would cost a fortune and take weeks. This is exploratory analysis, not a benchmark. I tested 4 project types when there are dozens more (web servers, parsers, compilers) that might show different patterns. I’m biased toward Claude Code. Codex or Copilot might have completely different results. CC used Claude Opus 4.5 for most experiments (graphlib, diffmerge, minigit) and Claude Sonnet 4.5 for kvstore. Within each experiment, all three languages used the same model, ensuring fair comparisons. I haven’t gone beyond 5K lines because of cost and time, larger codebases might behave differently. The variance means you can’t really predict. Even if I ran each 10 times, the next run could still be an outlier. That’s the nature of LLMs. So What Should You Actually Do? Pick the language your team knows best. I’m serious. The LLM variance (3x to 14x within the same language on the same task) is way bigger than the average efficiency difference between languages (usually 20–40%). If your team debugs Python fastest, use Python. If you’re comfortable with Go’s error handling, use Go. If your whole stack is TypeScript, stick with TypeScript. The token efficiency you might gain from “picking the optimal language” will get drowned out by whether Claude has a good day or a bad day on your specific implementation. What matters is how fast you can recognize when the LLM went down the wrong path and steer it back. That’s a skill that comes from knowing your language deeply, not from picking the theoretically most efficient one. I started this thinking I could optimize language choice for token efficiency. I ended it realizing that first pass correctness is variable enough that language choice is almost noise. Build in what you debug fastest because you’ll be doing a lot of debugging either way. Full Data and Results All session JSONLs, test results, and raw data are available on GitHub: https://github.com/vikas-t/llms-token-perf Feel free to run your own experiments and prove me wrong. Go vs Python vs TypeScript — which is the most efficient in LLM assisted programming? was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.