The Joy of Being Bottlenecked: What a $300 Cluster Taught Me About Real Performance

8 min readJust now

–

In my previous article, I chronicled the journey of turning a discarded office tower server (Fujitsu TX1310 M3) into a quiet, capable local AI lab. That was a journey toward perfecting a “single” entity.

But recently, I decided to destroy that silence. I added two small chassis under my desk and bound them together with Ethernet cables.

My goal was simple: instead of hitting a cloud API, I wanted to reproduce the smallest unit of a “supercomputer” with my own hands. The result was not a dream-like linear performance boost. What I got instead was a physical wall called 1GbE, fan noise, and a physical lesson in “the true nature of the latency that the cloud hides from us.”

This is the log of building a diskless cluster from $300 worth of e-waste.

…

8 min readJust now

–

But recently, I decided to destroy that silence. I added two small chassis under my desk and bound them together with Ethernet cables.

This is the log of building a diskless cluster from $300 worth of e-waste.

1. The Cast: The “Samurai” Rejected by Windows 11

Press enter or click to view image in full size

Source: Image by the author. A photo of the machines under the desk. The tower server (TX1310) and two small PCs (Q556) connected with cables.

New residents under my desk. The TX1310 M3 acts as the master, flanked by two Q556 nodes. It looks humble, but this is a microcosm of a data center.

In the world of High-Performance Computing (HPC), the latest GPUs get all the glory. But in my lab, the protagonists are far more modest.

Master Node: Fujitsu TX1310 M3 (Xeon E3–1270 v6, 4C/8T)
Worker Nodes: Fujitsu ESPRIMO Q556/M x2 (Core i5–6500T, 4C/4T)

Let me be honest about the cost of these worker nodes. The first one cost me 37 USD (5,776 JPY). The second one was 70 USD (11,000 JPY). They came with 16GB of RAM but no storage.

The key component here is the Core i5–6500T. The “T” suffix indicates a low-power 35W version. Being a 6th Gen (Skylake) CPU, it falls outside Microsoft’s list of supported processors for Windows 11, which generally requires 8th Gen or newer. To the corporate world, these are “decommissioned leftovers” that cannot be upgraded — essentially e-waste.

However, to the Linux kernel, a Skylake quad-core is still a perfectly viable warrior. I don’t need it to render a GUI. If the job is simply to crunch floating-point operations (FLOPs), silicon doesn’t rot. I rescued them at junk prices and gave them a new job.

2. Architecture: Booting the “Ghosts” (Diskless Boot)

My cluster configuration follows one aesthetic rule: “Worker nodes shall possess no local storage.”

The Q556 cases are empty. No SSDs, no HDDs. When powered on, they search for the Master node via the network, boot via PXE, and mount the root filesystem over NFS. Thanks to this configuration, the child nodes behave as effectively stateless computational resources at the hardware level.

Press enter or click to view image in full size

Source: Image by the author. A photo of the Q556 display showing “START PXE Over IPv4”

The long road to connectivity. I had to prevent the Windows installer from rising like a ghost and guide the machine toward a network boot.

This design drastically lowers management costs. To update the OS, I only need to modify the image on the Master node, and the change is reflected across all machines.

However, implementing this required a gritty struggle. In the cloud, things like DHCP, TFTP, and NFS Exports are abstracted away behind a single checkbox. Here, I had to write every configuration file by hand.

The hardest part was simply establishing communication. I would turn on the power, but the node stubbornly refused to pick up the Master’s OS image. After a few seconds of waiting, it would emptily return to the BIOS screen or launch the Windows 10 setup screen I thought I had wiped.

This is where “Vibe Infrastructure-ing” came in.

My day job is Project Management (PM). I manage resources and deadlines; writing code is just a hobby. I don’t possess the deep, specialized knowledge of a veteran infrastructure engineer. But today, with AI as a “Co-pilot,” even a hobbyist programmer can dive into the abyss of infrastructure.

I built this infrastructure through a long, deep dialogue with an LLM. Troubleshooting PXE boot issues used to be a “secret sauce” of seasoned SysAdmins. Now, with AI, it is a weekend project that can be completed by a software engineer.

Press enter or click to view image in full size

Source: Image by the author. A photo of the console showing “Titan Compute Node (Diskless)” and a successful login

The ghost gets a soul. The moment the diskless compute node was recognized as part of the cluster.

3. Benchmarking: Colliding with Physics (The Reality Check)

The system was complete. I proceeded to run the HPL (High-Performance Linpack) benchmark.

Note: What is HPL?

HPL is the traditional benchmark used to determine the rankings of the TOP500 supercomputers.

(While benchmarks like HPCG are gaining importance for measuring modern, complex workloads, HPL remains the standard for testing the maximum load limits of a system’s CPU and interconnects.)

The mechanism is simple yet brutal. It solves a massive system of linear equations (Ax = b). While the calculation itself is performed furiously by each node, boundary data must be exchanged (communicated) with neighboring nodes during the calculation. In other words, the HPL score is decided by a tug-of-war between “CPU Calculation Speed” and “Network Communication Speed.”

Checking the Launchpad: The Reality that “Single is Faster”

For comparison, I first tested the Master node (TX1310 M3) alone, using a standard problem size of N=10,000 as a baseline.

Condition: N=10,000, Process Grid P × Q = 2 × 2 (Total 4 processes)
Result: 174.80 GFLOPS. The calculation finished in just 3.8 seconds. This was my baseline.

Next, I combined the power of all three machines and ran the command with the same N=10,000.

Condition: N=10,000, Process Grid P × Q = 3 × 4 (Total 12 processes)
Config: Master (4 cores) + Node A (4 cores) + Node B (4 cores) = 12 Cores

The fan noise increased in sync. I looked at the output number, and I was stunned.

Get Yusaku Hosoya’s stories in your inbox

Join Medium for free to get updates from this writer.

49 GFLOPS.

Press enter or click to view image in full size

Source: Image by the author. The benchmark result numbers

This wasn’t just “slow.” It was “broken.” A single machine could do 174, but bundling three of them dropped the performance to 49. At first, I suspected a configuration error. Was the OpenMPI flag wrong? Was the BLAS link broken? Or was it the heterogeneous mix of CPUs?

But when I re-read the SSH logs and watched the frantic blinking of the switch LEDs, a hypothesis crossed my mind.

“The calculation is too fast, and the communication can’t keep up.”

A problem size of N=10,000 is a light workload for the Master node. However, when split 12 ways and thrown across a network, the time spent waiting for data (synchronization) exceeds the time spent actually calculating. While process mapping and memory bandwidth play a role, the latency of 1GbE is likely the dominant factor dragging the computation down here.

I learned the hard way how inefficient it is to distribute small-scale calculations across a slow network.

4. Tuning: Changing the “Etiquette” of Calculation

Since I couldn’t change the hardware (network bandwidth), I had to change the software (the etiquette of calculation). I opened the HPL parameter file HPL.dat and adjusted the problem size (N).

Before: N = 10,000 (Score: 49 GFLOPS)
After: N = 25,000

Why increase it? In matrix calculations, the computational complexity increases by O(N³), but the communication volume only increases by O(N²). In other words, my strategy was: “Force the calculation time to be longer to relatively lower the ratio of communication wait time (crush latency with brute-force calculation).”

I ran it again. The processing time extended from 15 seconds to about 60 seconds. And the result appeared.

174.85 GFLOPS. This is the limit of this hardware, and the destination.

You might notice a certain fact here. Yes, the 3-node cluster (174.85 GFLOPS) is almost exactly the same speed as the single Master node running N=10,000 (174.80 GFLOPS).(Just to be sure, I ran N=25,000 on the single Master node, and it hit 188.58 GFLOPS. So, the single node is actually faster!)

I used extra power, consumed more space, increased the noise, and struggled with wiring, only to successfully lower the calculation speed by about 8%. Normally, this project would be considered a failure. If I saw this ROI (Return on Investment) in my job as a PM, I would kill the project immediately.

But this is exactly the “learning” I was looking for. The moment when Amdahl’s Law ruthlessly punishes “communication waits.” There is immense value in witnessing this not in a textbook, but on your own console.

5. Requiem for Zombie Processes (Troubleshooting)

There was one final physical drama at the end of operations. When I sent the shutdown command after the benchmark, the SSH connection cut. However, the Q556 fans continued to spin at full speed, and the power LEDs remained on. Because the root filesystem was mounted over NFS, the moment the OS cut the network connection at the end of the shutdown sequence, it lost sight of its own file system and panicked (froze).

To deal with this, I added a single line of “spell” to the kernel parameters: acpi=force reboot=pci (Note: This is a workaround specific to my hardware environment. It forces a specific reboot mechanism via PCI to ensure the system halts correctly before the network is cut).

With this setting, the OS commands the hardware to forcibly reset. “Software controlling the physical state.” This sensation is the reality behind the cloud’s “Stop Instance” button.

Press enter or click to view image in full size

Source: Image by the author. A photo of the Switching Hub

Every packet passes through here. This cheap 1GbE switch is the heart of my supercomputer, and also its Achilles’ heel.

Conclusion: Beyond Efficiency — “Suikyo”

I spent $300 and a weekend to acquire a “supercomputer that is slower than a single node.” If you only think about efficiency, this is nonsense.

However, in Japanese, there is a word: “Suikyo” (酔狂). It translates roughly to “eccentricity” or “whimsy,” but implies a refined curiosity that goes beyond practical utility.

If I used infinite resources in the cloud, 174 GFLOPS would be instant. But there is no texture there. Here, there is “my number,” squeezed out after I wired the cables, fought with the BIOS, and tuned the kernel arguments.

And above all, this cluster has a path other than “HPL (Tightly Coupled Calculation).” For tasks like Forex backtesting or distributed AI inference which are “Loosely Coupled” (requiring less communication), the 1GbE bottleneck disappears, and the power of 12 pure cores should finally breathe fire. The fact that it is “slower than a single node” is a message from the system: “Choose your use case wisely.”

In the now-quiet room, I am planning my next move. If there is a bottleneck, I just need to upgrade the network to 2.5GbE.

Constraints are, after all, the mother of creativity.

…

1. The Cast: The “Samurai” Rejected by Windows 11

2. Architecture: Booting the “Ghosts” (Diskless Boot)

3. Benchmarking: Colliding with Physics (The Reality Check)

Checking the Launchpad: The Reality that “Single is Faster”

Get Yusaku Hosoya’s stories in your inbox

“The calculation is too fast, and the communication can’t keep up.”

4. Tuning: Changing the “Etiquette” of Calculation

5. Requiem for Zombie Processes (Troubleshooting)

Conclusion: Beyond Efficiency — “Suikyo”

Similar Posts