Published: 3/5/2026 gpt-oss-120b-mxfp4 LLM Llama A 120B parameter open-source GPT-style model, 4-bit mixed-floating-point quantization, and designed to run very large models on fewer GPUs / less RAM Approx memory usage:~70-80 GB VRAM or RAM depending on runtime and KV cachegpt-oss-120bgpt-oss" indicates an open-weights GPT-style model family/distribution name as used by whoever packaged the GGUF."120b" indicates ~120 billion parameters (very large model class).mxfp4Indicates a 4-bit mixed floating-point quantization variant ("MXFP4").Quantization reduces memory footprint and can improve throughput, with some potential quality loss versus FP16/FP32.The model is sharded into 3 files. You need ALL shards present in the same directory for loading to succeed:gpt-oss-120b-mxfp4-00001-of-00003.gguf gpt-oss-120b-mxfp4-00002-of-00003.gguf gpt-oss-120b-mxfp4-00003-of-00003.ggufExample llama-server command (safe baseline)/usr/local/bin/llama-server \ --device CUDA0 \ --model /usr/local/models/openai/gpt-oss-120b-mxfp4-00001-of-00003.gguf \ --host 0.0.0.0 --port 10000 \ --n-gpu-layers 99 \ --ctx-size 16384 \ --batch-size 1024 \ --ubatch-size 256 \ --no-mmapWhy:16k context keeps KV cache sane while you validate throughput/latency.Moderate batching won’t explode memory but still benefits from continuous batching once you enable it.Profile B: “High throughput / many concurrent users”/usr/local/bin/llama-server \ --device CUDA0 \ --model /usr/local/models/openai/gpt-oss-120b-mxfp4-00001-of-00003.gguf \ --host 0.0.0.0 --port 10000 \ --n-gpu-layers 99 \ --ctx-size 16384 \ --cont-batching \ --batch-size 2048 \ --ubatch-size 512 \ --no-mmapNotes:--batch-size is the logical max; --ubatch-size is the physical max. If you aren’t consistently saturating the server, huge batch sizes can hurt responsiveness.Notes: • --batch-size is the logical max; --ubatch-size is the physical max.  • If you aren’t consistently saturating the server, huge batch sizes can hurt responsiveness.Profile C: “64k context (careful)”64k is doable, but treat it like a special mode, not the default:/usr/local/bin/llama-server \ --device CUDA0 \ --model /usr/local/models/openai/gpt-oss-120b-mxfp4-00001-of-00003.gguf \ --host 0.0.0.0 --port 10000 \ --n-gpu-layers 99 \ --ctx-size 64000 \ --cont-batching \ --batch-size 1024 \ --ubatch-size 256 \ --no-mmapIf you try 64k and massive batches and many concurrent users, KV cache will dominate memory and you’ll hit OOM/slowdowns.Practical tuning workflow on DGX SparkStart with Profile A (16k, moderate batch) until it’s rock solid.If you need more throughput:* increase --batch-size (1024 → 2048 → 4096)* increase --ubatch-size (256 → 512 → 1024)Only then move to 64k context, and expect to reduce batch/concurrency.How to estimate memory risk (rule of thumb)Total memory pressure = model weights + KV cache + workspace/batching overhead.For huge models, KV cache grows roughly linearly with:context length (--ctx-size)number of simultaneous sequences (parallel requests)batch/ubatch choicesSo if you go from 16k → 64k, KV cache can jump ~4× for the same concurrency.“Best flags” that often matter with llama-server--cont-batching for server workloads with overlapping requests (you already use it).Keep --ubatch-size reasonable; it’s the “real” batch that hits memory first. Only set one --n-gpu-layers. (99 is fine if it works for your build.)Consider --ctx-size as your “budget knob”: bigger ctx = fewer concurrent users or smaller batches.