Skip to Main Content
Create Llama Server Service Building llama.cpp

gpt-oss-120b-mxfp4

LLM Llama
A 120B parameter open-source GPT-style model, 4-bit mixed-floating-point quantization, and designed to run very large models on fewer GPUs / less RAM

Approx memory usage:

  • ~70-80 GB VRAM or RAM depending on runtime and KV cache

gpt-oss-120b

  • gpt-oss" indicates an open-weights GPT-style model family/distribution name as used by whoever packaged the GGUF.
  • "120b" indicates ~120 billion parameters (very large model class).

mxfp4

  • Indicates a 4-bit mixed floating-point quantization variant ("MXFP4").
  • Quantization reduces memory footprint and can improve throughput, with some potential quality loss versus FP16/FP32.


The model is sharded into 3 files. You need ALL shards present in the same directory for loading to succeed:

gpt-oss-120b-mxfp4-00001-of-00003.gguf
gpt-oss-120b-mxfp4-00002-of-00003.gguf
gpt-oss-120b-mxfp4-00003-of-00003.gguf

Example llama-server command (safe baseline)

/usr/local/bin/llama-server \
  --device CUDA0 \
  --model /usr/local/models/openai/gpt-oss-120b-mxfp4-00001-of-00003.gguf \
  --host 0.0.0.0 --port 10000 \
  --n-gpu-layers 99 \
  --ctx-size 16384 \
  --batch-size 1024 \
  --ubatch-size 256 \
  --no-mmap

Why:

  • 16k context keeps KV cache sane while you validate throughput/latency.

  • Moderate batching won’t explode memory but still benefits from continuous batching once you enable it.

Profile B: “High throughput / many concurrent users”

/usr/local/bin/llama-server \
  --device CUDA0 \
  --model /usr/local/models/openai/gpt-oss-120b-mxfp4-00001-of-00003.gguf \
  --host 0.0.0.0 --port 10000 \
  --n-gpu-layers 99 \
  --ctx-size 16384 \
  --cont-batching \
  --batch-size 2048 \
  --ubatch-size 512 \
  --no-mmap

Notes:

  • --batch-size is the logical max; --ubatch-size is the physical max. 

  • If you aren’t consistently saturating the server, huge batch sizes can hurt responsiveness.

Notes:
    •    --batch-size is the logical max; --ubatch-size is the physical max.  
    •    If you aren’t consistently saturating the server, huge batch sizes can hurt responsiveness.

Profile C: “64k context (careful)”

64k is doable, but treat it like a special mode, not the default:

/usr/local/bin/llama-server \
 --device CUDA0 \
 --model /usr/local/models/openai/gpt-oss-120b-mxfp4-00001-of-00003.gguf \
 --host 0.0.0.0 --port 10000 \
 --n-gpu-layers 99 \
 --ctx-size 64000 \
 --cont-batching \
 --batch-size 1024 \
 --ubatch-size 256 \
 --no-mmap

If you try 64k and massive batches and many concurrent users, KV cache will dominate memory and you’ll hit OOM/slowdowns.

Practical tuning workflow on DGX Spark

  • Start with Profile A (16k, moderate batch) until it’s rock solid.
  • If you need more throughput:
    * increase --batch-size (1024 → 2048 → 4096)
    * increase --ubatch-size (256 → 512 → 1024)
  • Only then move to 64k context, and expect to reduce batch/concurrency.

How to estimate memory risk (rule of thumb)

Total memory pressure = model weights + KV cache + workspace/batching overhead.

For huge models, KV cache grows roughly linearly with:

  • context length (--ctx-size)
  • number of simultaneous sequences (parallel requests)
  • batch/ubatch choices

So if you go from 16k → 64k, KV cache can jump ~4× for the same concurrency.

“Best flags” that often matter with llama-server

  • --cont-batching for server workloads with overlapping requests (you already use it).
  • Keep --ubatch-size reasonable; it’s the “real” batch that hits memory first.  
  • Only set one --n-gpu-layers. (99 is fine if it works for your build.)
  • Consider --ctx-size as your “budget knob”: bigger ctx = fewer concurrent users or smaller batches.