< Back to Blog

How to Run llama.cpp on a Mac Pro 6,1 with Dual AMD FirePro D300 GPUs on Ubuntu

Running 3B LLMs at 22 tok/s on a 2013 Mac Pro with dual FirePro D300 GPUs under Ubuntu. Covers Vulkan setup, llama.cpp build, systemd config, and three hardware-specific traps that silently break inference.

How to Run llama.cpp on a Mac Pro 6,1 with Dual AMD FirePro D300 GPUs on Ubuntu

The 2013 Mac Pro with dual FirePro D300 GPUs can run 3 billion parameter LLMs at 22 tokens per second under Ubuntu with llama.cpp and Vulkan. That is fast enough for single-user agentic workloads, and the hardware costs nothing if you already own one.

Getting there requires navigating a few hardware-specific traps that are not documented anywhere. This guide covers the full setup and the three configuration mistakes that will silently break your inference if you don't know about them.

Hardware overview

The Mac Pro 6,1 ships with one of three GPU configurations: D300, D500, or D700. All three share the same AMD GCN 1.0 (Southern Islands) architecture -- Pitcairn and Tahiti LE silicon. Despite lspci reporting "Curacao XT" on some D300 SKUs, the RADV driver identifies them as PITCAIRN. PCI ID: 1002:6810.

GPU Silicon VRAM per card VRAM total
D300 Pitcairn 2 GB 4 GB
D500 Tahiti LE 3 GB 6 GB
D700 Tahiti XT 6 GB 12 GB

This guide covers D300. The setup process is identical for D500 and D700 -- only the VRAM ceiling changes. D700 owners can fit 7B Q4_K_M models.

What fits in 4 GB of VRAM

Plan model selection before installing anything.

At 4096-token context, the KV cache adds roughly 0.5-1 GB on top of the model weights. Models tested on D300:

Model Size tok/s (production) Verdict
3B Q4_K_M 1.87 GB ~30 1.1 GB free for KV -- fits comfortably
3B Q8_0 3.18 GB 22 Tight at ctx=4096 but reliable
4B Q4_K_M ~2.5-2.8 GB ~20-22 Fits with KV margin
8B+ >4 GB N/A Does not fit

On 8B and larger: do not attempt partial offload. The PCIe 2.0 mezzanine bus in the Mac Pro means bus crossings from mixing CPU and GPU layers exceed whatever GPU benefit you gain. Pure CPU inference is faster for models that don't fit in VRAM.

The stack that makes this possible

Before the steps, it is worth understanding what you are actually assembling. This stack is the reason a 12-year-old workstation GPU can run modern LLM inference at all -- and it is also the reason the configuration gotchas exist.

amdgpu and Southern Islands support

The amdgpu kernel driver was originally written for GCN 2.0 and newer AMD silicon. Support for Southern Islands (GCN 1.0 -- the Pitcairn and Tahiti chips in the Mac Pro) was added later and eventually landed in mainline Linux via the amdgpu.si_support kernel parameter. Ubuntu 24.04 with the HWE kernel ships with this enabled by default, which is why the D300s bind to amdgpu without any manual configuration.

The older radeon driver also supports GCN 1.0, but it does not expose a Vulkan interface. amdgpu is required for everything that follows.

RADV: the open-source AMD Vulkan driver

RADV is AMD's open-source Vulkan implementation, maintained as part of the Mesa project. It implements the Vulkan API on top of the amdgpu kernel driver, with support going back to GCN 1.0. When you run vulkaninfo --summary and see RADV PITCAIRN, you are looking at Mesa's Vulkan driver exposing a full Vulkan device interface on hardware from 2013.

Two environment variables unlock better performance on RADV:

  • ACO (RADV_PERFTEST=aco) is RADV's in-house shader compiler. The alternative is LLVM, which produces correct code but with higher compilation overhead. On older hardware, ACO produces faster shader code and lower startup latency when llama.cpp compiles its compute shaders.
  • GPL (RADV_PERFTEST=gpl, for General Pipeline Library) enables pipelined shader creation. It reduces the stall when the Vulkan pipeline objects are being built at server startup.

You can set both together: RADV_PERFTEST=aco,gpl.

llama.cpp's GGML Vulkan backend

llama.cpp has a compute backend (ggml-vulkan) that targets the Vulkan API directly. When built with -DGGML_VULKAN=ON, it compiles the transformer's core operations -- matrix-vector multiply, softmax, layer normalization -- as Vulkan compute shaders. At runtime, those shaders dispatch to whatever Vulkan device is available. RADV PITCAIRN is a valid Vulkan device. That is the entire bridge.

GGML_VK_VISIBLE_DEVICES=0,1 tells the GGML Vulkan backend to enumerate and use both D300s as separate Vulkan devices. Without it, llama.cpp defaults to device 0 only and your second GPU sits idle. Combined with --split-mode layer, llama.cpp distributes transformer layers across both devices -- roughly half the layers on each card -- and handles the inter-GPU transfers automatically.

Why the threading gotcha exists

Vulkan is an explicit, low-level API. Unlike CUDA or Metal, GPU-CPU synchronization is not managed by the driver -- it must be written into the application. In llama.cpp's Vulkan backend, CPU threads submit compute work and then wait on Vulkan synchronization primitives (fences and semaphores) between layer dispatches.

With --n-gpu-layers 99, every transformer layer runs on GPU. The CPU threads have no computation to do -- they exist only to submit work and wait for completion. With 20 threads, all 20 are simultaneously spinning on sync waits. With two GPUs in split-mode, there are twice as many synchronization points per forward pass. The per-thread overhead compounds. Dropping to 2 threads eliminates 18 idle spinning threads while leaving the GPU submission pipeline intact. Throughput is unaffected because the GPU, not the CPU, is the bottleneck.

Why flash attention regresses on this hardware

Flash attention rewrites the standard attention algorithm to use smaller, tiled operations that fit in fast on-chip SRAM. The implementation assumes FP16 (half-precision) arithmetic is hardware-accelerated -- on modern GPUs, FP16 throughput is 2-4x higher than FP32, which is where flash attention's speed advantage comes from.

GCN 1.0 predates native FP16 acceleration. On Pitcairn, FP16 operations fall back to FP32 emulation. The throughput advantage flash attention was designed to capture does not exist on this silicon. What remains is the additional algorithmic overhead of the tiled rewrite with none of the speed benefit -- which is why enabling --flash-attn produces a 79% regression rather than an improvement.

Step 1: Verify the driver

Ubuntu 24.04 with the HWE kernel ships with amdgpu.si_support=1 enabled by default. The amdgpu driver handles Southern Islands without any kernel flags or GRUB edits.

Verify both cards are bound to amdgpu:

You should see Kernel driver in use: amdgpu on both entries. If you see radeon instead, add radeon.si_support=0 amdgpu.si_support=1 to GRUB_CMDLINE_LINUX in /etc/default/grub and run sudo update-grub. On Ubuntu 24.04 this is not necessary, but it is harmless.

Step 2: Install Vulkan userspace and verify

Verify both GPUs are accessible to Vulkan:

You should see two devices, both reporting as RADV PITCAIRN. If you see one or zero, stop and debug the driver stack before proceeding. A missing device at this step means inference runs silently on one GPU or falls back to CPU.

Step 3: Build llama.cpp with Vulkan

Install build dependencies:

Clone and build:

Note: the spirv-headers apt package (installed in Step 2) is required during the Vulkan shader compilation step -- llama.cpp pulls in unified1/spirv.hpp. If you get a header-not-found error, confirm the apt package is installed, not just the spirv-tools binary.

Step 4: Run with the correct configuration

This is where most guides send you the wrong direction.

The instinct is to maximize CPU threads. With both D300s handling full GPU offload (--n-gpu-layers 99) and --split-mode layer (the default), setting --threads 20 produces 1717% CPU usage at idle. The threads are not computing anything -- they are busy-spinning on Vulkan sync barriers between every layer dispatch. Dual-GPU split-mode doubles the number of sync points, so the problem scales with GPU count. The machine becomes sluggish even though inference technically works.

The correct thread count for fully GPU-offloaded dual-Vulkan inference is 2:

Throughput at --threads 2 vs --threads 20 is identical for fully GPU-offloaded inference. Thread count only earns its keep when there are CPU-resident layers to compute -- and with -ngl 99, there are none.

Step 5: Set up the systemd service

Running via systemd gives you automatic restart and the correct environment variables for dual-GPU Vulkan. Create /etc/systemd/system/llama-server.service:

GGML_VK_VISIBLE_DEVICES=0,1 tells llama.cpp to use both GPUs. Without it, inference uses one card and you leave half your VRAM idle. RADV_PERFTEST=aco,gpl enables the ACO shader compiler and General Pipeline Library on RADV -- both reduce Vulkan dispatch latency.

Enable and start:

Confirm the server is up and both GPUs are loaded:

Cooling

The Mac Pro 6,1 on Ubuntu does not manage fans automatically under inference load without help. The macfanctld daemon reads the applesmc sensors and ramps the fan accordingly.

Edit /etc/macfanctl.conf (note: no 'd' in the filename):

Set temperature floors at 45°C and ceilings at 58°C for TC0P (CPU proximity) and TG0P (GPU proximity). The D300s run at 32-39°C idle and reach roughly 60°C under sustained inference with this configuration -- about 25°C of thermal headroom before throttling.

Enable log output to confirm the daemon is reading sensors correctly:

Tail /var/log/macfanctl.log -- lines show Speed: NNNN, AVG: XX.XC, [*]TC0P: XX.XC, [*]TG0P: XX.XC. The * marks which sensor source is currently driving fan speed.

Three things that will silently break your setup

Flash attention. Do not enable --flash-attn. Flash attention relies on FP16 operations that GCN 1.0 silicon does not accelerate. On a 3B model, enabling it produces a 79% throughput regression -- from ~22 tok/s down to ~4-5 tok/s.

KV offload. Do not set --no-kv-offload. On RADV with D300 hardware, this does not fall back gracefully. It produces garbage output immediately -- the attention layer reads corrupted values from the CPU-resident cache. The KV cache must live on GPU.

Partial offload. Do not mix CPU and GPU layers for models that exceed your VRAM. Set either --n-gpu-layers 99 (full GPU for models that fit) or --n-gpu-layers 0 (full CPU for models that don't). Anything in between is slower than pure CPU on this hardware due to PCIe 2.0 bus crossings.

Benchmarking safely

Before running llama-bench, stop the production server:

A running server holding ~2.5 GB VRAM will skew benchmark results by roughly 6x. Confirm the GPUs are clear before benching:

Do not run llama-bench with -t 20 on this hardware. High-thread CPU inference combined with the FirePro thermal load inside the Mac Pro's small chassis can cause a hard crash requiring a physical power cycle. Safe thread counts for benchmarking: 2 or 10.

What to expect

On Qwen 2.5 3B Q8_0 at ctx=4096 with full dual-GPU offload:

Metric Value
Decode -- llama-bench ~28 tok/s
Decode -- production server 22 tok/s
Prompt processing (short) ~75 tok/s

The ~22% gap between benchmark and production is structural. llama-server adds per-token overhead -- slot management, sampling pipeline, tokenization -- that no exposed flag eliminates. Plan production throughput as bench × 0.75.

Power draw is approximately 110 W under average load. In Texas at $0.13/kWh, that runs to roughly $13/month for a 24/7 endpoint. Cloud API alternatives (gpt-4o-mini, Claude Haiku) are cheaper at low volumes. The break-even depends on your request volume and whether latency or privacy requirements make local inference worth the overhead.

I won't retire hardware that still produces useful output. The purchase cost is already sunk. The question is whether the electricity cost is worth the API spend you avoid -- and at $13/month for a reliable agentic fallback, the answer is yes for every use case a 3B model handles well.

If you hit something this guide doesn't cover, drop a comment below.