Published: 3/5/2026

gpt-oss-120b-mxfp4

LLM Llama

A 120B parameter open-source GPT-style model, 4-bit mixed-floating-point quantization, and designed to run very large models on fewer GPUs / less RAM

Approx memory usage:

~70-80 GB VRAM or RAM depending on runtime and KV cache

gpt-oss-120b

gpt-oss" indicates an open-weights GPT-style model family/distribution name as used by whoever packaged the GGUF.
"120b" indicates ~120 billion parameters (very large model class).

mxfp4

Indicates a 4-bit mixed floating-point quantization variant ("MXFP4").
Quantization reduces memory footprint and can improve throughput, with some potential quality loss versus FP16/FP32.

The model is sharded into 3 files. You need ALL shards present in the same directory for loading to succeed:

gpt-oss-120b-mxfp4-00001-of-00003.gguf
gpt-oss-120b-mxfp4-00002-of-00003.gguf
gpt-oss-120b-mxfp4-00003-of-00003.gguf

Example llama-server command (safe baseline)

/usr/local/bin/llama-server \
  --device CUDA0 \
  --model /usr/local/models/openai/gpt-oss-120b-mxfp4-00001-of-00003.gguf \
  --host 0.0.0.0 --port 10000 \
  --n-gpu-layers 99 \
  --ctx-size 16384 \
  --batch-size 1024 \
  --ubatch-size 256 \
  --no-mmap

Why:

16k context keeps KV cache sane while you validate throughput/latency.
Moderate batching won’t explode memory but still benefits from continuous batching once you enable it.

Profile B: “High throughput / many concurrent users”

/usr/local/bin/llama-server \
  --device CUDA0 \
  --model /usr/local/models/openai/gpt-oss-120b-mxfp4-00001-of-00003.gguf \
  --host 0.0.0.0 --port 10000 \
  --n-gpu-layers 99 \
  --ctx-size 16384 \
  --cont-batching \
  --batch-size 2048 \
  --ubatch-size 512 \
  --no-mmap

Notes:

--batch-size is the logical max; --ubatch-size is the physical max.
If you aren’t consistently saturating the server, huge batch sizes can hurt responsiveness.

Notes:
• --batch-size is the logical max; --ubatch-size is the physical max.
• If you aren’t consistently saturating the server, huge batch sizes can hurt responsiveness.

Profile C: “64k context (careful)”

64k is doable, but treat it like a special mode, not the default:

/usr/local/bin/llama-server \
 --device CUDA0 \
 --model /usr/local/models/openai/gpt-oss-120b-mxfp4-00001-of-00003.gguf \
 --host 0.0.0.0 --port 10000 \
 --n-gpu-layers 99 \
 --ctx-size 64000 \
 --cont-batching \
 --batch-size 1024 \
 --ubatch-size 256 \
 --no-mmap

If you try 64k and massive batches and many concurrent users, KV cache will dominate memory and you’ll hit OOM/slowdowns.

Practical tuning workflow on DGX Spark

Start with Profile A (16k, moderate batch) until it’s rock solid.
If you need more throughput:
* increase --batch-size (1024 → 2048 → 4096)
* increase --ubatch-size (256 → 512 → 1024)
Only then move to 64k context, and expect to reduce batch/concurrency.

How to estimate memory risk (rule of thumb)

Total memory pressure = model weights + KV cache + workspace/batching overhead.

For huge models, KV cache grows roughly linearly with:

context length (--ctx-size)
number of simultaneous sequences (parallel requests)
batch/ubatch choices

So if you go from 16k → 64k, KV cache can jump ~4× for the same concurrency.

“Best flags” that often matter with llama-server

--cont-batching for server workloads with overlapping requests (you already use it).
Keep --ubatch-size reasonable; it’s the “real” batch that hits memory first.
Only set one --n-gpu-layers. (99 is fine if it works for your build.)
Consider --ctx-size as your “budget knob”: bigger ctx = fewer concurrent users or smaller batches.

gpt-oss-120b-mxfp4

Latest Posts

Tag Could