< Back to Blog

Local AI Models: A Plain-English Guide to the Jargon

bf16, Q4_K_M, GGUF, MTP — model names are a wall of acronyms. Here's what they actually mean and whether the model will run on your Mac.

If you've tried running an AI model locally on your Mac, you've probably stared at something like this and felt nothing:

This guide decodes it. By the end, you'll be able to look at any model name on HuggingFace or the Ollama library and know immediately whether it'll run on your machine and how good it'll be.

The anatomy of a model name

Model names aren't random. They follow a loose but consistent pattern:

[publisher]/[model-family][size]-[precision or quantization]-[format]

Let's break down each piece.

The publisher prefix

mlx-community/ or meta-llama/ or mistralai/ — this is just the organization on HuggingFace that uploaded the file. Think of it like a GitHub username. The one worth knowing:

  • mlx-community — weights converted specifically for Apple Silicon (MLX format). If you're on a Mac with M1/M2/M3/M4, this is usually what you want.

The model family

Qwen3, Llama3, Mistral, Gemma, Phi4 — this is the base model. Think of it as the brand. Each one was trained differently, has different strengths, and comes from a different lab (Alibaba, Meta, Mistral AI, Google, Microsoft, respectively).

For practical purposes at 64GB RAM: size matters more than family for most use cases. A well-quantized 32B model from any major family will outperform a full-precision 7B.

The size: B = billion parameters

7B, 27B, 70B — billion parameters. Parameters are the numbers that make up the model's "knowledge." More parameters generally means smarter, but also requires more RAM.

Rough RAM requirements:

Model size Full precision (bf16) Quantized (Q4)
7B ~14GB ~4GB
14B ~28GB ~8GB
27-32B ~54GB ~18GB
70B Too large for 64GB ~40GB

With 64GB RAM, you can comfortably run up to 32B quantized, or a 14B at full precision.

Precision: the number format

When you see bf16, fp16, or fp32 in a model name, it describes how precisely each parameter is stored.

  • fp32 — 32 bits per number, 4 bytes. Full precision. Most accurate, but uses the most memory. Rarely needed for local inference.
  • bf16 (Brain Float 16) — 16 bits, 2 bytes. Same dynamic range as fp32 at half the memory. The standard full-quality format for inference.
  • fp16 — 16 bits, 2 bytes. Slightly more prone to numerical instability than bf16, but widely supported.

If a model is labeled bf16 or fp16, no quality has been sacrificed. It's the uncompressed version. You just need the RAM to load it.

Quantization: the compression tier

Quantization is lossy compression for model weights. Instead of 16 or 32 bits per number, you use 4, 5, 6, or 8 bits. You lose some precision but get dramatically lower memory usage, and in practice the quality loss is often hard to notice.

The format: Q[bits]_[variant]

Common variants:

  • Q4_K_M — 4-bit quantization, K-means method, medium size/quality balance. The most popular. Good default.
  • Q4_K_S — 4-bit, small variant. Slightly smaller file, slightly lower quality than _M.
  • Q5_K_M — 5-bit, better quality than Q4 at a modest size increase.
  • Q6_K — 6-bit. Near-full quality. Use this if you have the RAM.
  • Q8_0 — 8-bit. Nearly indistinguishable from bf16 in most benchmarks. Larger files.

Practical guide:

  • If you want the best quality your RAM can handle, go as high a Q number as fits.
  • Q4_K_M is the community default for a reason. Good balance everywhere.
  • Avoid Q2 and Q3 for serious use. Noticeable degradation.

Per-layer precision (advanced)

Some model files go further and specify different quantization for different parts of the model. A letter prefix tells you which layer:

  • o (output) — the final layer that converts the model's internal state into actual token probabilities. More sensitive to quantization than the bulk weights.
  • e (embedding) — the input side, where tokens get converted into the model's internal representation.

So Q4_K_M-oQ8 means most weights are 4-bit, but the output layer is kept at 8-bit precision. You get Q4 memory usage with less quality degradation where it matters most.

The IQ prefix (like IQ4_XS) is a related idea. "Importance-matrix" quantization uses a smarter algorithm to allocate bits based on which weights actually matter most, rather than treating them all equally.

The file format

GGUF — the current standard file format for running models locally via llama.cpp, which powers most local inference tools. If you're using Ollama or LM Studio, you want GGUF files.

Safetensors — HuggingFace's native format. Used when working directly with the Python transformers library. Not what you want for casual local use.

MLX — Apple's format for their Metal-accelerated inference framework. Models in this format run natively on the M-series GPU. If you're using the mlx_lm Python library, this is what you want. The mlx-community namespace on HuggingFace hosts these. Note: MLX models do not run in Ollama. They require the MLX runtime.

instruct vs base

You'll often see two versions of the same model:

  • base (or no suffix) — trained only to predict the next token. Good for text completion tasks and fine-tuning. Not what you want for chat.
  • instruct (sometimes chat or it) — fine-tuned to follow instructions and have conversations. This is almost always what you want.

When in doubt, grab the instruct version.

Speed tricks: MTP and speculative decoding

Getting a model to run is one thing. Getting it to run fast is another.

Standard inference is one token at a time. The model generates a word, feeds it back in, generates the next word, and repeats. Two techniques can significantly speed this up.

Multi-Token Prediction (MTP) — some models (Qwen3 is the main example) are trained with extra "draft heads" baked in. Instead of generating one token and waiting, the model speculates several tokens ahead simultaneously, then verifies them. You'll sometimes see +MTP in the model name or a separate -Draft file alongside the main model. The result is 2-3x faster generation with no quality loss.

Speculative decoding — a more general version of the same idea, but using an external draft model. A small, fast model predicts several tokens ahead. The main model checks them in parallel. If the draft was right, you skip ahead. If wrong, you fall back. Ollama and llama.cpp both support this via a --draft-model flag.

Why this matters for a 64GB Mac: Apple Silicon is memory-bandwidth constrained for inference. Tokens per second is often limited by how fast weights move through memory, not raw compute. MTP and speculative decoding can meaningfully close that gap. If you're running a 27B model and it feels slow, enabling MTP (if your runtime supports it) is the first thing to try.

The runtime: where you actually run it

The model file is inert without something to load and run it. Your options:

Ollama — the easiest path. Download the app, run ollama run llama3 in Terminal, and it handles everything: downloads the model, serves it locally, exposes an OpenAI-compatible API on localhost:11434. GGUF under the hood. Good default for most people.

LM Studio — a GUI app. Good if you want to browse and download models visually, adjust parameters, and test chat without touching a terminal.

MLX / mlx_lm — a Python library from Apple. Runs models in MLX format natively on the M-series GPU. Fastest option on Apple Silicon for models that have been converted. Requires some comfort with Python.

llama.cpp — the engine that powers Ollama and LM Studio. You can use it directly if you want maximum control, but most people don't need to.

Putting it together

Back to the original example: mlx-community/Qwen3-27B-Q4_K_M

  • mlx-community — uploaded by the MLX community, converted for Apple Silicon
  • Qwen3 — Alibaba's third-generation Qwen model family
  • 27B — 27 billion parameters
  • Q4_K_M — quantized to 4-bit, medium quality variant

Will it run on a 64GB Mac? Yes. A 27B Q4_K_M needs roughly 18GB of RAM. You have room.

Will the quality be good? Q4_K_M on a 27B model is excellent.

Quick reference cheat sheet

Term Meaning Notes
bf16 / fp16 Full precision (16-bit) Only if you have the RAM
Q4_K_M 4-bit quantized, best common balance Your default
Q8_0 8-bit, near-lossless If RAM allows
oQ8 Output layer kept at Q8 Often combined: Q4_K_M-oQ8
IQ4_XS Importance-matrix 4-bit Smarter bit allocation than standard Q4
GGUF File format for Ollama/LM Studio What you want for most tools
MLX Apple Silicon native format Use with mlx_lm, not Ollama
instruct Fine-tuned for chat/instructions Always pick this
base Raw pre-trained weights Only for fine-tuning
B (7B, 27B) Billion parameters Larger = smarter + more RAM
MTP Multi-Token Prediction Built-in speed boost (Qwen3, others)
Speculative decoding External draft model for speed Supported in Ollama + llama.cpp