Skip to main content
GPUTinyGradmacOSAMDMLeGPUExperiment

We Got a Radeon RX 7900 XTX Running on macOS Then Stopped Using It

A technical experiment in doing what everyone said was impossible β€” and learning why they had a point.

February 2026FreeRiverHouse9 min read
Share

⚑ TL;DR

We got a Radeon RX 7900 XTX (24GB VRAM) running ML inference on a MacBook Pro M1 via Thunderbolt eGPU, using TinyGrad with a small patch. Everyone said it was impossible β€” we asked Claude, Grok, ChatGPT, and Perplexity. All of them said it couldn't be done. It technically works β€” but it's slow, impractical, and we ended up switching to MLX on Apple Silicon for actual production use. This post documents the experiment for anyone curious about the technical details.

βœ… GPT-2 inference (~3.6 tok/s)βœ… LLaMA 3.1 8B (works, barely)❌ Slower than native Apple Silicon❌ Not practical for daily use

🧩The Problem

Apple Silicon Macs are great, but if you want to run larger ML models, you're limited to the unified memory of your Mac. An M1 has 16GB shared between CPU and GPU β€” not enough for serious LLM work.

External GPUs (eGPUs) seem like the obvious solution: plug in a powerful desktop GPU via Thunderbolt and get 24GB of dedicated VRAM. But:

1.

Apple dropped eGPU support in macOS Ventura for Apple Silicon

2.

NVIDIA doesn't work on macOS at all (no drivers since Mojave)

3.

AMD Radeon + TinyGrad on macOS? β€œNot possible” β€” said everyone on Discord, Reddit, GitHub Issues, and every AI we asked (Claude, Grok, ChatGPT, Perplexity)

We're stubborn, so we tried anyway. Spoiler: they were mostly right β€” not about whether it's possible, but about whether it's practical.

πŸ”§Hardware Setup

GPUAMD Radeon RX 7900 XTX (24GB GDDR6)
EnclosureRazer Core X V2 (Thunderbolt 3)
HostMacBook Pro M1 (2021)
ConnectionThunderbolt 3/4
Key SoftwareTinyGPU.app
AMD Radeon RX 7900 XTX with Corsair PSU in eGPU enclosure
The Radeon RX 7900 XTX nestled next to the Corsair 750W PSU
Radeon 7900 XTX power connectors and cabling detail
Power delivery: dual 8-pin connectors feeding the beast

Critical Detail: TinyGPU.app

The magic ingredient is TinyGPU.app β€” a macOS app that creates a virtual device interface, allowing TinyGrad's AMD backend to communicate with the Radeon GPU over Thunderbolt. Without it, the GPU is invisible to TinyGrad.

You must:

  1. Open TinyGPU.app before any GPU operations
  2. Connect to the correct Thunderbolt port (Port 2, Receptacle 2 on M1)
  3. Verify connection with the system profiler command below
bash
system_profiler SPDisplaysDataType | grep -i AMD

Yes, just getting the GPU recognized is already an achievement. That should tell you something about how β€œsupported” this setup is.

🩹The Patch: float16 Casting for GGML

The core issue: TinyGrad's GGML quantization decoder returns float32 tensors. On NVIDIA and CPU, this works fine. On AMD via the LLVM backend, certain operations fail silently or produce garbage output.

The fix: Force-cast all GGML dequantization outputs to float16.

python
# Before (fails on AMD):
if ggml_type == 2: 
    return (q_to_uint8(blocks[:,2:], 4).bitcast(dtypes.int8) - 8) \
        * blocks[:,:2].bitcast(dtypes.float16).cast(dtypes.float32)

# After (works on AMD):
if ggml_type == 2: 
    return ((q_to_uint8(blocks[:,2:], 4).bitcast(dtypes.int8) - 8) \
        * blocks[:,:2].bitcast(dtypes.float16).cast(dtypes.float32)) \
        .cast(dtypes.float16)

This applies to GGML types Q4_0, Q4_1, Q8_0, Q4_K, Q5_K, and IQ4_NL β€” covering all common quantization formats.

Why it works: The AMD LLVM backend handles float16 natively and efficiently on RDNA3 architecture. The intermediate float32 computation causes buffer alignment issues that the float16 cast resolves.

The patch is ~20 lines changed in tinygrad/nn/state.py. Small change, big impact β€” at least for getting it to run at all.

πŸš€Running Models

Environment Variables

bash
export AMD=1        # Use AMD backend
export AMD_LLVM=1   # Use LLVM compiler (required for eGPU)

GPT-2 (Quick Test)

bash
cd ~/Projects/tinygrad-fix
PYTHONPATH=. AMD=1 AMD_LLVM=1 python3 examples/gpt2.py \
    --model_size gpt2 --prompt "Hello" --count 20
# ~3.6 tokens/second

LLaMA 3.1 8B (The β€œReal” Test)

bash
PYTHONPATH=. AMD=1 AMD_LLVM=1 python3 examples/llama3.py \
    --model /path/to/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
    --prompt "Explain quantum computing" --count 200

πŸ“ŠPerformance (The Honest Numbers)

Here's where the dream meets reality. Thunderbolt bandwidth is a severe bottleneck, and TinyGrad's AMD LLVM backend on macOS is not optimized for this kind of setup.

ModelTokens/secVRAM
GPT-2 (124M)~3.6~500MB
GPT-2 XL (1.5B)~1.5~3GB
LLaMA 3.1 8B Q4~0.8~5GB

Now here's the uncomfortable comparison:

The same models on native Apple Silicon:

  • LLaMA 3.1 8B via llama.cpp on M1: ~15 tok/s β€” that's nearly 20x faster than our eGPU setup
  • LLaMA 3.1 8B via MLX on M4 Pro: ~30+ tok/s β€” in a different universe entirely
  • No special drivers, no TinyGPU.app, no correct-port-guessing, no patching source code

The 24GB VRAM argument sounds good on paper: β€œrun models that don't fit in 16GB!” In practice, even a base M1 can run 8B models just fine via quantization, and Apple Silicon machines with 32GB+ unified memory are common now. The VRAM advantage doesn't compensate for being 20x slower.

πŸ’€What We Also Tried (And Failed)

NVIDIA RTX 5060 Ti via eGPU

We also attempted an NVIDIA RTX 5060 Ti (Blackwell/GB206) via the same Thunderbolt setup. It doesn't work. The GSP (GPU System Processor) firmware fails during Display Engine initialization:

Unboxing: PNY RTX 5060 Ti, Corsair RM750e PSU, and ADT-Link adapter
The NVIDIA attempt: RTX 5060 Ti + Corsair RM750e + ADT-Link USB4 adapter
PNY RTX 5060 Ti connected to ADT-Link T3G USB4 to PCIe x16 adapter
The T3G adapter board: USB4 to PCIe x16 β€” the bridge that didn't work
PNY GeForce RTX 5060 Ti front view with ADT-Link adapter
PNY GeForce RTX 5060 Ti β€” fans spinning but no compute happening
NVIDIA RTX 5060 Ti on open test bench overhead view
The open test bench setup β€” as DIY as it gets
GPU comparison table showing options: RTX 4070 Ti Super, RTX 4080, RTX 3090, RX 7900 XTX
Our research notes: comparing GPU options for eGPU use on Mac
log
FBFLCN error: UNRECOGNIZED_CLIENT -> HUBCLIENT_CE0 -> HUBCLIENT_VIP
GSP_INIT_DONE returns NV_ERR_TIMEOUT
PNY RTX 5060 Ti 16GB GDDR7 and Corsair RM750e boxes
Brand new hardware, zero compute: the RTX 5060 Ti (16GB GDDR7) and Corsair RM750e

The 570.x firmware doesn't support Thunderbolt 4/USB4 bus types. This is a firmware-level limitation that can't be worked around.

πŸ”¬How to Reproduce

We're documenting this for the technically curious β€” not because we recommend it. If you want to run LLMs on a Mac, just use MLX or llama.cpp. Seriously.

1

Hardware

Any AMD RDNA2/RDNA3 GPU + Thunderbolt eGPU enclosure + Apple Silicon Mac

2

Software

TinyGPU.app (contact us for access)

3

Clone & patch

Clone TinyGrad and apply the float16 patch to tinygrad/nn/state.py

4

Run

bash
AMD=1 AMD_LLVM=1 python3 -c \
  "from tinygrad import Device; print(Device['AMD'])"

If you see the AMD device, you're good. If not, check:

  • Is TinyGPU.app running?
  • Is the eGPU on the correct Thunderbolt port?
  • Does system_profiler SPDisplaysDataType show AMD?

πŸͺ¦Why We Moved On

After the initial excitement of β€œholy shit, it actually works,” we had to be honest with ourselves. This setup is:

❌

Slow β€” 0.8 tok/s for LLaMA 8B vs ~15 tok/s on the same machine's native GPU via llama.cpp

❌

Fragile β€” wrong Thunderbolt port? Doesn't work. TinyGPU.app not running? Doesn't work. macOS update? Might break everything.

❌

Expensive β€” a 7900 XTX + eGPU enclosure costs more than upgrading to a Mac with more unified memory

❌

Unnecessary β€” MLX on Apple Silicon is genuinely excellent now, well-supported, and getting better fast

We switched to MLX on an M4 Pro for all our production ML workloads. It's faster, simpler, and doesn't require praying to the Thunderbolt gods every time you plug in a cable. The eGPU now collects dust.

So why publish this?

  • The technical details are real β€” the float16 patch, the TinyGPU.app approach, the GGML buffer alignment issue. Someone might find this useful.
  • The NVIDIA failure data is valuable β€” if you're considering an RTX card via Thunderbolt, don't waste your money.
  • Honest experiment logs matter β€” too many blog posts only share successes. We tried something ambitious, it worked technically, but it wasn't worth using. That's a valid outcome.

If Thunderbolt 5 (80 Gbps) and better AMD drivers ever materialize on macOS, this approach might become practical. But we're not holding our breath.

πŸ“„The Full Patch

diff
--- a/tinygrad/nn/state.py
+++ b/tinygrad/nn/state.py
@@ -324,15 +324,15 @@
-    if ggml_type == 2: return (q_to_uint8(...) - 8) * blocks[:,:2]...
+    if ggml_type == 2: return ((q_to_uint8(...) - 8) * blocks[:,:2]...).cast(dtypes.float16)
     # Same pattern for types 3, 8, 12, 14, 39
     # Add .cast(dtypes.float16) to the return value
Share

Built by FreeRiverHouse β€” sometimes you learn more from the experiments that don't stick.

πŸ”₯0
🧠0
πŸ’‘0
πŸ‘0
πŸ€”0

πŸ“–Read Next