Hands On Most GenAI models are trained and run in massive datacenter clusters, but the ability to build, test, and prototype AI systems locally is no less relevant today.
Until recently, this required high-end, multi-GPU workstations often costing tens of thousands of dollars. With the launch of the GB10-based DGX Spark in October, Nvidia set out to change this. While nowhere near as powerful, with 128 GB of video memory, the system is essentially an AI lab in a box capable of running just about any AI workload you throw at it.
As we mentioned in our initial hands-on, the Spark isn’t the first or even the cheapest option out there. AMD and Apple also offer systems with large quantities of unified memory which are shared bet…
Hands On Most GenAI models are trained and run in massive datacenter clusters, but the ability to build, test, and prototype AI systems locally is no less relevant today.
Until recently, this required high-end, multi-GPU workstations often costing tens of thousands of dollars. With the launch of the GB10-based DGX Spark in October, Nvidia set out to change this. While nowhere near as powerful, with 128 GB of video memory, the system is essentially an AI lab in a box capable of running just about any AI workload you throw at it.
As we mentioned in our initial hands-on, the Spark isn’t the first or even the cheapest option out there. AMD and Apple also offer systems with large quantities of unified memory which are shared between the CPU and GPU, something that has made them incredibly popular among AI developers and enthusiasts.
AMD’s Ryzen AI Max+ 395 APU, which for the sake of brevity we’re simply going to refer to as "Strix Halo" from here on out, is particularly interesting. In addition to selling for between three-quarters and half the cost of the Spark, Strix Halo builds on roughly the same ROCm and HIP software stack as the company’s datacenter products. This provides a clearer, though not necessarily seamless migration path from desktop to datacenter.
To see how Strix Halo stacks up against the Spark, HP sent over its Z2 Mini G1a workstation so we could find out how each of these little boxes of TOPS fares in a variety of AI workloads, ranging from single-user and batched inference to fine-tuning and image generation.
System overview
Compared to the Spark, HP’s Z2 Mini G1a is a fair bit bigger thanks in part to its integrated PSU and larger cooling solution
The first thing you’ll notice is the HP is significantly larger than the Spark. This is partly because Nvidia opted for an external power brick that connects over USB-C while HP opted for a slightly larger chassis with an integrated power supply.
We generally prefer HP’s approach here, especially because the larger chassis allows for a beefier cooling solution, though the Spark’s fit and finish definitely comes across as the more premium of the two.
While the Spark uses an all-metal chassis that doubles as a heat sink, the G1a feels much more like an HP product with a clean albeit plastic shell covering a stiff metal chassis. The benefit of this design philosophy is serviceability. Getting into the G1a is as easy as pressing a button at the back of the machine sliding off the top cover.
However, because the machine uses soldered LPDDR5x memory, there’s not actually that much to do in either system. The HP does feature two standard 2280 PCIe 4.0 x4 M.2 SSDs which are user-serviceable.
For comparison, the Spark is much more appliance-like, though the SSD can also be swapped by removing a magnetic plate and four screws at the bottom of the system.
HP’s design is far easier than the Spark to service with a lid that removes with a press of a button
Inside the machines are a pair of blower fans which pull in cool air from the front and exhaust it out the back. If you’re curious, the G1a’s dual M.2 SSDs are located directly under those fans, which should keep them from overheating under heavy load.
Around the back of the machines, we see HP has taken a very different approach to Nvidia in terms of I/O.
While the Spark prioritizes high-speed networking, the I/O on HP’s G1a is far more pedestrian
From left to right, we see a 2.5 GbE RJ45 port, four standard USB ports (2x 10 Gbps, 2x USB 2.0), a pair of 40 Gbps Thunderbolt ports alongside two mini DisplayPorts. On the side of the machine, you’ll find a 3.5 mm headphone-microphone combo jack and two additional 10 Gbps USB 3.0 ports in both standard and USB-C form factors.
You’ll also notice two blank spaces that can be configured with any number of HP’s Flex IO modules, including serial, USB, and gigabit, 2.5 GbE or 10 GbE ports.
The Spark, meanwhile, prioritizes high-speed networking for multinode AI compute environments. Alongside the power button are four USB-C ports, the leftmost of which is used for power delivery. For display out, there’s an HDMI port along with a 10 GbE RJ45 network port and a pair of QSFP cages offering a combined 200 Gbps of network bandwidth via the system’s onboard ConnectX-7 NIC.
These ports are designed to enable clustering of multiple Spark or other GB10 systems using the same hardware and software you’d find in the datacenter.
As we understand it, you could also use the G1a’s Thunderbolt ports as a high-speed network interface for interconnecting multiple systems together, though we weren’t able to test that use case.
Speeds and feeds
| | DGX Spark | HP Z2 Mini G1a (As Tested) | | | ——— | –––––––––––––– | | Platform | Grace Blackwell (GB10) | AMD Ryzen AI Max+ Pro 395 | | GPU | Blackwell Architecture | Radeon 8060S | | CPU | 20-core Arm (10x X925 + 10x A725) | 16 Zen 5 cores clocked at up to 5.1GHz | | CUDA Cores / Stream processors | 6,144 | 2,560 | | Tensor Cores / Compute Units | 192 5th-Gen | 40 | | RT Cores | 48 4th-Gen | 40 | | Tensor Perf (GPU) | 1 petaFLOPS sparse FP4 | 56 teraFLOPS dense BF16/FP16 (est) | | NPU | NA | XDNA 2 50 TOPS | | System Mem | 128 GB LPDDR5x 8533 MT/s | 128 GB LPDDR5x ECC 8000 MT/s | | Memory Bus | 256-bit | 256-bit | | Memory BW | 273 GB/s | 256 GB/s | | Storage | 4 TB NVMe | 2x 1 TB M.2 TLC NVMe | | USB | 4x USB 3.2 Type-C (20 Gbps) | 3x USB 3.2 Type-A (10 Gbps) 2x USB 2.0 Type-A | | Thunderbolt | NA | 2x Thunderbolt 4 (40 Gbps) | | Ethernet | 1 RJ-45 (10 GbE) | 1x RJ-45 (2.5 Gbps) | | NIC | ConnectX-7 200 Gbps | Realtek RTL8125BPH-CG | | WiFi | WiFi 7 | MediaTek WiFi 7 | | Bluetooth | Bluetooth 5.4 | Bluetooth 5.4 | | Audio Out | HDMI Multi-channel | 3.5 mm headphone microphone combo port | | Peak Power | Adapter power: 240 W | PSU rating 300 W | | Display | 1x HDMI 2.1A | 2x Mini DisplayPort 2.1 2x Thunderbolt 1x USB Type C (Side) | | OS | Nvidia DGX OS | Windows 11 Pro / Ubuntu 24.04 | | Dimensions | 150mm x 150mm x 50.5mm | 85mm x 168mm x 200mm | | Weight | 1.2 kg | 2.3 kg | | Price | $3,999 (MSRP) | $2,949 (Retail) |
To be clear, neither system is the cheapest chariot for their respective silicon. The DGX Spark retails for $3,999 while HP’s Z2 Mini G1a, as specced, is currently selling for about $2,950.
You can find similarly equipped GB10 and Strix Halo boxes that can be had for significantly less if you’re willing to compromise on storage, connectivity, or I/O.
HP, ASUS, and a few others have OEM versions of the Spark that start around $3,000 for 1 TB of storage. We’ve also seen Strix Halo systems with 128 GB for a little over $2,000, though the memory shortage appears to have driven up prices some, and you’ll be missing out on the Enterprise features like ECC offered by the "Pro" variant of the chip.
So, if either of these systems strike your fancy, but you’re unconvinced by the pricing, you may be able to find a better deal from one of the other OEMs. In the case of the GB10 systems, you’re not giving up much apart from aesthetics, opting for an OEM rebadge over the Founders Edition.
CPU perf
Before we dig into generative AI performance, which we expect most folks to care about, we’d like to take a moment to talk about the machines’ respective CPUs.
Strix Halo is a rather interesting processor. Much like its desktop counterparts, it features 16 full-fat Zen 5 cores spread across two core-complex dies (CCDs) that are capable of clocking to 5.1 GHz. Those CCDs are bonded using advanced packaging to an I/O die that handles memory, PCIe, and graphics processing.
The Z2 Mini G1a actually uses the Pro variant of the chip, which adds a number of hardware security and management capabilities, which may be attractive to enterprises deploying these systems in volume or in sensitive environments.
The Spark’s GB10 Grace Blackwell superchip meanwhile features an Arm CPU die developed in collaboration with MediaTek containing 10 X925 performance cores and 10 Cortex A725 efficiency cores for a total of 20.
While these cores are by no means slow, in our admittedly limited testing, AMD’s Zen 5 microarchitecture delivered between 10 and 15 percent higher performance across our Sysbench, 7zip compression/decompression, and HandBrake transcoding workloads.
However, in the High-Performance Linpack benchmark, which is representative of many HPC workloads, the G1a achieved more than twice the double-precision performance at 1.6 teraFLOPS versus 708 gigaFLOPS on the Spark. We’ll note that this score was achieved using only the X925 cores as enabling the A725 for the test actually nerfed performance, suggesting there may be room for improvement.
While GenAI performance is heavily dependent on low-precision GPU FLOPS, Strix Halo’s beefier CPU may make it a more flexible option for those looking for a PC that can run GenAI models rather than an appliance for AI.
GenAI performance
Moving on to GenAI, we should talk for a minute about some of the performance claims being made about both of these systems.
While Nvidia may claim a petaFLOPS of AI compute, the reality is most users will never get close to that. The reason is simple: achieving that level of performance requires structured sparsity, a feature with little if any benefit to inference workloads.
Because of this, Spark’s peak performance is really closer to 500 dense teraFLOPS, and only for workloads that can take advantage of the FP4 data type. More often than not that means the Spark will actually run at 8 or 16-bit precision, limiting peak performance to 250 and 125 teraFLOPS respectively.
Sustained performance usually falls a bit short of the theoretical. Testing the GB10 in the Max Achievable MatMul FLOPS (MAMF) benchmark, we achieved 101 teraFLOPS at BF16 and 207 teraFLOPS at FP8.
But what about the Strix Halo part powering the G1a? Well, here we see one of AMD’s biggest weaknesses. While the House of Zen claims 126 platform TOPS for its top-specced Strix Halo SKUs, you’ll be hard pressed to find any app that can take full advantage of that. Fifty of those TOPS are delivered by the NPU, which requires specialized software to harness – more on that later. The remaining TOPS are achieved using the CPU and GPU.
Strix Halo’s GPU is no slouch. By our estimate – AMD doesn’t actually give peak floating point performance for the chip – the GPU is capable of churning out about 56 teraFLOPS of peak BF16 performance. In MAMF, we achieved about 82 percent of that at 46 teraFLOPS, which again isn’t bad.
But because the GPU is based on AMD’s older RDNA 3.5 architecture, it lacks support for the lower-precision data types offered by the Spark.
Technically, the architecture does support INT8, but the performance is essentially the same as BF16. In theory, it should deliver about 112 TOPS of INT4, but the trick is finding software that actually does computation at that precision. Sixteen distinct values just doesn’t offer much granularity.
On paper, this gives the Spark a 2.2-9x performance advantage over the Strix Halo in raw AI compute capacity.
And while this played out repeatedly in our testing, compute is only one side of the GenAI coin. The other is memory bandwidth. Depending on your use case, it may even render the performance gap between the AMD and Nvidia systems a non-issue.
LLM inference
We’re going to start by talking about large language model (LLM) inference precisely because it illustrates why more TOPS and FLOPS don’t always translate into better AI performance.
For consistency, we ran most of our tests in Linux: Ubuntu 24.04 LTS on the HP and Nvidia’s lightly customized version of the distro, DGX OS.
Editor’s note: We ran into some issues with GPU hangs in Ubuntu when testing the G1a. However, by adding a few kernel arguments, we were able to resolve the issue.
The tweak can be made by editing the Grub boot config by running:
sudo nano /etc/default/grub
Then update the GRUB_CMDLINE_LINUX="" entry to look like so:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdgpu.cwsr_enable=0 amd_iommu=offwq! amdgpu.gttsize=131072 ttm.pages_limit=33554432"
Once complete, save and exit the editor by pressing Ctrl X. Finally, update the bootloader and restart the machine by running:
sudo update-grub
sudo reboot
If you’re still running into GPU hangs, we recommend checking out this thread on the Framework forum.
Just looking at single-batch performance in Llama.cpp – one of the most popular frameworks for running LLMs on consumer CPUs and GPUs – we can see that the GB10 and Strix Halo churn out tokens at a similar pace, with the AMD box pulling off a narrow lead when using the Vulkan backend.
In single batch inference, AMD’s Strix Halo APU trades blows with the Spark on token generation, but falls well behind on time to first token
In single-user scenarios, token generation is usually bottlenecked by memory bandwidth. The GB10 claims about 273 GB/s of memory bandwidth while AMD’s Strix Halo manages about 256 GB/s.
This is likely one of the reasons why many AI enthusiasts were so disappointed in the Spark when it first debuted. For between two-thirds and half the price, you could get a Strix Halo box that churns out tokens just as quickly.
However, if you turn your attention to the time-to-first-token column, you’ll notice the GB10’s GPU is roughly 2-3x faster than the one in the Strix Halo box, and that’s when processing a relatively short 256-token prompt. With larger sequence lengths, this gap becomes more pronounced. This is because prompt processing tends to become compute bound quite quickly.
For shorter prompts or multi-turn conversations, Llama.cpp’s prompt caching mitigates a lot of this performance deficit. In this scenario, we’re only talking about waiting a second or two longer on the AMD platform, something customers just looking to run LLMs at home may be willing to overlook considering Strix Halo’s lower average selling price.
For those whose workloads require feeding large documents into the model’s context, the Spark’s more potent GPU gives it a clear advantage here, just one that customers will need to weigh against its higher price.
Multi-batch performance
Alongside single-batch performance, we also tested the two machines at larger batch sizes. It’s not uncommon for users to batch up jobs like extracting information from a stack of documents or emails rather than processing them sequentially one at after another.
In this case, we’re using vLLM, which in our experience handles large batch sizes and concurrency more gracefully than Llama.cpp, which is better optimized for single-user applications. We’re also using Qwen3-30B-A3B-Instruct-2507 at its native BF16 precision to avoid quantization overheads.
To see how the machines performed, we tasked them with processing a 1,024-token input and generating a 1,024-token response at batch sizes ranging from one to 64.
This graph charts overall throughput (tok/s) against end-to-end latency at various batch sizes ranging from 1-64
On the X axis, we’ve plotted the time in seconds required to complete the batch job, while on the Y axis we show the overall throughput in tokens per second at each batch size.
Once again, the Spark’s faster graphics processor gives it a leg up over the G1a. While this is clearly a win for the Spark, unless you’re routinely running batch jobs, the performance advantage is likely to go unnoticed, especially if you can schedule them to run overnight. Batch inference isn’t exactly interactive and so you can easily walk away and come back when it’s done.
Fine-tuning
It’s a similar story when we look at using fine-tuning techniques to teach models new skills by exposing them to new information.
Fine-tuning requires a lot of memory, potentially as much as 100 GB for a model like Mistral 7B. As we’ve previously discussed, techniques like LoRA or QLoRA can dramatically reduce the memory required to train a model.
With up to 128 GB of memory available on either platform, both the Spark and G1a are well suited to this workload, though they aren’t particularly fast.
For full fine-tuning, the GB10 managed to pull well ahead of AMD’s Strix Halo, but falls short of even last-gen workstation cards like the W7900 or RTX 6000 Ada
Running a full fine-tune of Meta’s Llama 3.2 3B, we see that the Spark completes the job in roughly two-thirds the time of the G1a. However, compared to workstation cards like the Radeon Pro W7900 or RTX 6000 Ada, which offer both higher floating point performance as well as much faster GDDR6 memory, the Spark and G1a are simply outclassed.
Where things really get interesting is when we start looking at using QLoRA on larger models. To fine-tune a model like Llama 3.1 70B at home, you’d normally need multiple workstation cards. But thanks to their massive memory footprint, this job is entirely possible using either the AMD or Nvidia boxes.
Moving up to a larger 70B parameter and the GB10’s faster GPU gives it a leg up in QLoRA fine-tuning
With a relatively small dataset – something we’ve previously shown can be more than adequate to tweak the style of a model – performance was more in line with what we were expecting. The G1a completed the job in a little over 50 minutes, compared to the Spark at around 20.
For bigger fine-tuning jobs using larger databases or LoRA ranks, this could easily extend to hours or potentially days, making the Spark’s performance advantage more significant.
But just like we discussed with our multi-batch inference tests, unless you’re fine-tuning models regularly, the Spark’s higher performance may not be worth the price over a similarly equipped Strix Halo system from HP, Minisforum, Framework, or any one of the other mini-PC vendors.
Image generation
One area where the Spark’s higher performance does give it a definitive advantage is in image and video generation workloads. Like fine-tuning, image gen is an especially compute and memory-hungry workload, but doesn’t tend to be bandwidth bound.
This is partially because image models aren’t as easily compressed as LLMs without major concessions to output quality. As such, many prefer to run these models at their native precision, whether that be FP32, BF16, or FP8.
If you’re planning to generate images or videos using ComfyUI, Nvidia GPUs are still your best bet
Running Black Forest Lab’s FLUX.1 Dev in ComfyUI, our test systems scale almost exactly as expected relative to their 16-bit floating point performance.
With 120 and 125 teraFLOPS of BF16 grunt respectively, the Spark roughly matches AMD’s Radeon Pro W7900, while achieving a roughly 2.5x lead over the Strix Halo-based G1a, which in our testing achieved about 46 teraFLOPS of real-world performance.
Suffice to say image generation is clearly not the Strix box’s strong suit.
But what about the NPU?
AMD’s Strix Halo APUs are also equipped with a pretty competent neural processing unit (NPU) courtesy of the company’s Xilinx acquisition. The XDNA 2 NPU is capable of churning out an extra 50 TOPS of AI performance. The trick, of course, is finding software that can take advantage of it. Most NPU use cases focus on minimizing the power consumption of things like noise reduction in audio and video, background blurring, and optical character recognition.
However, AMD and others have started to utilize the NPU for generative AI applications with mixed results. Thanks to apps like Lemonade Server, you can now run LLMs entirely on the NPU. Unless you’re trying to save on power, you probably won’t want to just yet.
As of this writing, model support is somewhat limited and it doesn’t appear that the NPU has access to all of the GPU’s 250 GB/s of memory bandwidth. Running Mistral 7B on the NPU in Windows, we observed decode performance of just 4-5 tok/s, where we would have expected to see closer to 40 tok/s.
However, AMD is clearly pushing the idea of disaggregated inference, where compute-heavy prompt processing is offloaded to the NPU while the memory bandwidth-intensive decode phase is handled by the GPU. Performance was better, but still not as good as if you’d just run the model on the GPU.
This disaggregated approach makes a lot of sense for power-constrained notebooks, but less so for a desktop system like the G1a. Having said that, we’re interested to see where AMD takes this.
We were also able to get the NPU working in Amuse, a beginner-friendly image generation suite. AMD recently added support for running Stable Diffusion 3 directly on the NPU and, in this case, performance was actually quite a bit better than running the same model on the GPU.
Here we see Amuse using the XDNA 2 NPU in Strix Halo to generate an image using Stable Diffusion 3
Running on the NPU, Amuse was able to generate a 1,024 x 1,024 image using 20 steps in a little over a minute, while running that same test on the GPU required roughly twice that.
There were some caveats worth pointing out. The integration is quite limited at this point, available only in the beginner mode with the performance slider set to balanced. Switching to the "expert mode" disabled the NPU, forcing the model to run on the graphics processor.
The integration is also limited to Stable Diffusion 3, which is growing rather long in the tooth at this point having made its debut over a year ago. Still, it’s good to see more applications taking advantage of the NPU for more than background blurring in video calls.
Nvidia’s CUDA moat is getting shallower
One selling point that frequently comes up in any comparison between AMD and Nvidia is software compatibility, aka the CUDA moat.
While you can expect just about any software that runs on CUDA to work on the Spark without issue, that’s not guaranteed on the Strix Halo-based G1a.
Nearly two decades of development on CUDA is hard to overlook, but, while AMD has traditionally trailed in software support for its ROCm and HIP libraries, the company has made significant gains in recent months.
A year ago, we faced numerous headaches with libraries that either weren’t available or relied on forks built specifically for AMD’s CDNA-based datacenter chips, which meant they didn’t run on consumer platforms. Today, this isn’t nearly as big a problem. In fact, most of our PyTorch test scripts ran without modification on the AMD platform. However, we’d be lying if we said the experience was anywhere close to as seamless as on the Spark.
A lot of software can be made to work on AMD’s consumer hardware, but it’s not always as simple as running something like pip install xyz-package. We still needed to build libraries from source or use forks made specifically for Radeon GPUs on several occasions — vLLM, BitsandBytes, and Flash Attention 2 are just a few examples.
In many cases, particularly when working with software written closer to the hardware, software needs to be compiled specifically for that generation of Radeon graphics. Llama.cpp is just one example where we needed to compile against a gfx1151 target in order to get the software running.
Wrangling these dependencies isn’t easy, regardless of the platform you’re working with, so it’s nice to see AMD and Nvidia offering Docker containers that have been pre-configured with everything you need to get start started. For our vLLM tests, we used both team Red and Green’s vLLM Docker containers to ensure we were getting the best possible performance.
Perhaps our biggest software challenges weren’t actually software related. Strix Halo is based on AMD’s older RDNA 3.5 architecture, which means it lacks support for many of the lower-precision data types offered by the Spark’s Blackwell GPU. As a result, we were often forced to run models at 16-bit precision, even when FP8 or FP4 would have been preferable.
AMD’s RDNA 4 architecture should resolve some of this by adding support for both sparsity and FP8. However, much of the industry is now reorienting around microscaling data types, like MXFP4, for its smaller memory footprint and wider effective range.
While AMD is rapidly closing the gap, Nvidia still holds a meaningful lead on both hardware and software.
The answer you’ve all been waiting for
Yes, indeed the DGX Spark can run Crysis
We know you’re all going to ask. Yes. Both of these boxes run Crysis.
At 1440p, medium settings, Crysis Remastered ran at a very respectable 90-100 FPS on the G1a. No real surprises here, as the HP is using an x86 CPU and GPU from a company with a long graphics pedigree.
Getting the game running on the DGX Spark was a little bit more involved because of the GB10’s Arm CPU, which, for better or worse, doesn’t support 32-bit instructions. Thankfully, we were able to get it running using a utility called FEX. If you’re curious, you can find the install script we used here.
Unfortunately, we couldn’t get the Steam performance overlay working on the Spark, which meant we couldn’t get concrete performance metrics. At medium settings, the game was perfectly playable even without resorting to using Nvidia’s AI upscaling tech, which actually worked in game.
While you can get games running on the Spark or other GB10 systems, we’re not sure we’d recommend it over the Strix Halo box or any number of cheaper gaming PCs out there.
Summing up
Which of these systems is right for you really depends on how much you care about GenAI
Which of these systems is right for you really depends on whether you want a machine specifically for AI or a PC that just happens to be able to run most AI workloads you might throw at it.
We suspect many folks who’ve made it this far likely fall into the latter camp. If you’re going to spend $2K-4K on a new PC, we don’t think it’s unreasonable to expect it to do more than one thing well.
- AMD taking AI fight to Nvidia with Helios rack-scale system
- AMD red-faced over random-number bug that kills cryptographic security
- AMD Ryzen CPUs fry twice in the face of heavy math load, GMP says
- AMD warns of new Meltdown, Spectre-like bugs affecting CPUs
In this respect, HP’s Z2 Mini G1a is one of the better options out there, especially if you’re mostly interested in running single-batch LLM inference as opposed to fine-tuning or image gen. AMD’s Strix Halo SoCs may not have the computational grunt of Nvidia’s GB10 boxes, but it runs Windows and Linux competently and doesn’t require jumping through hoops just to play your favorite games.
Despite the performance gap, for software engineers building apps for the growing AI PC segment, the AMD-based system may still be the better development platform if for no other reason than Microsoft’s NPU mandate.
But for those who really want an AI appliance for prototyping agents, fine-tuning models, or generating text, image, and video content, the Spark or one of its GB10 siblings is probably the better choice, assuming you can stomach the asking price.
In our testing, the machine consistently delivered performance 2-3x that of the AMD-based HP system, while also benefiting from a significantly more mature and active software ecosystem. As we’ve shown, you can also get non-AI workloads running on the Spark in a pinch, but that’s not what it’s meant for. At its heart, the Spark is an AI lab in a box and is best used as such. ®