OpenAI has released two open-source language models, gpt-oss-120b and gpt-oss-20b. These are OpenAI’s first openly licensed LLMs since GPT-2. Aiming to create the best state-of-the-art reasoning and tool-use models available. The models were launched to considerable fanfare in the AI community.
By open-sourcing gpt-oss, OpenAI allows people to freely use and adapt within the bounds of Apache 2.0. These two models certainly consider a democratic approach for professional personalization and customization of the technology to local, contextual tasks. In this article guide, we’ll go through how to access gpt-oss-120b and gpt-oss-20b, and when to use which model.
OpenAI’s new open-weight models are the most robust public models since GPT-2. It uses the latest approaches from the most advanced systems and is built to really work and be easy to use and adapt.
Both gpt-oss models are Transformer-based networks employing a Mixture-of-Experts (MoE) design. In an MoE, only a subset of the full parameters (“experts”) is active for each input token, reducing computation. In terms of numbers:
The architecture also includes several advanced features: all attention layers use Rotary Positional Embeddings (RoPE) to handle very long contexts (up to 128,000 tokens). Attention itself alternates between a full-global and a 128-token sliding window, similar to GPT-3’s design.
These models use grouped multi-query attention with a group size of 8 to save memory while maintaining fast inference. Activations are SwiGLU. Importantly, all expert weights are quantized to a 4-bit MXFP4 format, allowing the large model to fit in one 80GB GPU and the smaller model in 16GB without a major accuracy loss.
The table below summarizes the core specs:
| Model | Layers | Total Params | Active Params/Token | Experts (total/active) | Context |
|---|---|---|---|---|---|
| gpt-oss-120b | 36 | 117B | 5.1B | 128 / 4 | 128K |
| gpt-oss-20b | 24 | 21B | 3.6B | 32 / 4 | 128K |
| Specification | gpt-oss-120b | gpt-oss-20b |
|---|---|---|
| Total Parameters | 117 billion | 21 billion |
| Active Parameters per Token | 5.1 billion | 3.6 billion |
| Architecture | Mixture-of-Experts with 128 experts (4 active/token) | Mixture-of-Experts with 32 experts (4 active/token) |
| Transformer Blocks | 36 layers | 24 layers |
| Context Window | 128,000 tokens | 128,000 tokens |
| Memory Requirements | 80 GB (fits on a single H100 GPU) | 16 GB |
Here are the ways to get started with gpt-oss:
1. Hugging Face Transformers: Install the latest libraries and load the model directly. The following command installs the necessary prerequisites:
pip install --upgrade accelerate transformers
The code below downloads the required model from the Hugging Face hub.
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-20b")
model = AutoModelForCausalLM.from_pretrained(
"openai/gpt-oss-20b", device_map="auto", torch_dtype="auto")
Once the model has been downloaded, you can test it out using:
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain why the sky is blue."}
]
inputs = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, return_tensors="pt"
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))
This setup was documented in OpenAI’s guide and runs on any GPU. (For best speed on NVIDIA A100/H100 cards, install triton kernels to use MXFP4; otherwise the model will use 16-bit internally).
2. vLLM: For high-throughput or multi-GPU serving, you can use the vLLM library. OpenAI notes that on 2x H100s. You can install vLLM using:
pip install vllm
One can start a server with:
vllm serve openai/gpt-oss-120b --tensor-parallel-size 2
Or in Python:
from vllm import LLM
llm = LLM("openai/gpt-oss-120b", tensor_parallel_size=2)
output = llm.generate("San Francisco is a")
print(output)
This uses optimized attention kernels on Hopper GPUs.
3. Ollama (Local on Mac/Windows): Ollama is a turnkey local chat server. After installing Ollama, simply run:
ollama pull gpt-oss:20b
ollama run gpt-oss:20b
This will download the model (quantized) and launch a chat UI. Ollama auto-applies a chat template (the “harmony” format) by default. You can also call it via API. For example, using Python and the OpenAI SDK pointed at Ollama’s endpoint:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="gpt-oss:20b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain what MXFP4 quantization is."}
]
)
print(response.choices[0].message.content)
This sends the prompt to the local gpt-oss model, just like the official API.
4. Llama.cpp (CPU/ARM): Pre-built GGUF versions of the models are available (e.g, ggml-org/GPT-Oss-120b-GGUF on Hugging Face). After installing llama.cpp, you can serve the model locally:
# macOS:
brew install llama.cpp
# Start a local HTTP server for inference:
llama-server -hf ggml-org/gpt-oss-120b-GGUF -c 0 -fa --jinja --reasoning-format none
Then send chat messages to http://localhost:8080 in the same format. This option allows running even on a CPU or GPU-agnostic environment with JIT or Vulkan support.
Overall, gpt-oss models can be used with most common frameworks. The above methods (Transformers, vLLM, Ollama, llama.cpp) cover desktop and server setups. You can mix and match – for instance, run one setup for fast inference (vLLM on GPU) and another for on-device testing (Ollama or llama.cpp).
Prompt: “”” Select the option that is related to the third term in the same way as the second term is related to the first term.
IVORY : ZWSPJ :: CREAM : ?
A. NFDQB
B. SNFDB
C. DSFCN
D. BQDZL
”””
import os
os.environ['HF_TOKEN'] = 'HF_TOKEN'
from openai import OpenAI
client = OpenAI(
base_url="https://router.huggingface.co/v1",
api_key=os.environ["HF_TOKEN"],
)
completion = client.chat.completions.create(
model="openai/GPT-Oss-20b", # openai/GPT-Oss-120b Change to use 120b model
messages=[
{
"role": "user",
"content": """Select the option that is related to the third term in the same way as the second term is related to the first term.
IVORY : ZWSPJ :: CREAM : ?
A. NFDQB
B. SNFDB
C. DSFCN
D. BQDZL
"""
}
],
)
# Check if there's content in the main content field
if completion.choices[0].message.content:
print("Content:", completion.choices[0].message.content)
else:
# If content is None, check reasoning_content
print("Reasoning Content:", completion.choices[0].message.reasoning_content)
# For Markdown display in Jupyter
from IPython.display import display, Markdown
# Display the actual content that exists
content_to_display = (completion.choices[0].message.content or
completion.choices[0].message.reasoning_content or
"No content available")
gpt-oss-120b Response:

gpt-oss-20b Response:

Comparative Analysis
gpt-oss-120B correctly identifies the relevant pattern in the analogy and selects option C with deliberate reasoning. Since it methodically interprets the character transformation between word pairs to obtain the correct mapping. On the other hand, gpt-oss-20B fails to yield any result on this task, likely due to the limits of the output tokens.
This might suggest difficulties with output length, as well as computational inefficiencies. Overall, gpt-oss-120B is better able to manage symbolic reasoning with much more control and accuracy; therefore, it is more reliable than gpt-oss-20B for this reasoning task involving verbal analogy.
Prompt: “”” Given two sorted arrays nums1 and nums2 of size m and n respectively, return the median of the two sorted arrays.
The overall run time complexity should be O(log (m+n)) in C++.
Example 1:
Input: nums1 = [1,3], nums2 = [2]
Output: 2.00000
Explanation: merged array = [1,2,3] and median is 2.
Example 2:
Input: nums1 = [1,2], nums2 = [3,4]
Output: 2.50000
Explanation: merged array = [1,2,3,4] and median is (2 + 3) / 2 = 2.5.
Constraints:
nums1.length == m
nums2.length == n
0 <= m <= 1000
0 <= n <= 1000
1 <= m + n <= 2000
-106 <= nums1[i], nums2[i] <= 106
”””
import os
from openai import OpenAI
client = OpenAI(
base_url="https://router.huggingface.co/v1",
api_key=os.environ["HF_TOKEN"],
)
completion = client.chat.completions.create(
model="openai/GPT-Oss-120b", # openai/GPT-Oss-20b change to use 20b model
messages=[
{
"role": "user",
"content": """Given two sorted arrays nums1 and nums2 of size m and n respectively, return the median of the two sorted arrays.
The overall run time complexity should be O(log (m+n)) in C++.
Example 1:
Input: nums1 = [1,3], nums2 = [2]
Output: 2.00000
Explanation: merged array = [1,2,3] and median is 2.
Example 2:
Input: nums1 = [1,2], nums2 = [3,4]
Output: 2.50000
Explanation: merged array = [1,2,3,4] and median is (2 + 3) / 2 = 2.5.
Constraints:
nums1.length == m
nums2.length == n
0 <= m <= 1000
0 <= n <= 1000
1 <= m + n <= 2000
-106 <= nums1[i], nums2[i] <= 106
"""
}
],
)
# Check if there's content in the main content field
if completion.choices[0].message.content:
print("Content:", completion.choices[0].message.content)
else:
# If content is None, check reasoning_content
print("Reasoning Content:", completion.choices[0].message.reasoning_content)
# For Markdown display in Jupyter
from IPython.display import display, Markdown
# Display the actual content that exists
content_to_display = (completion.choices[0].message.content or
completion.choices[0].message.reasoning_content or
"No content available")
gpt-oss-120b Response
include <bits/stdc++.h>
using namespace std;
class Solution {
public:
double findMedianSortedArrays(const vector<int>& nums1,
const vector<int>& nums2) {
// Let A be the shorter array.
const vector<int>* A = &nums1;
const vector<int>* B = &nums2;
if (A->size() > B->size()) swap(A, B); // now |A| ≤ |B|
int m = A->size();
int n = B->size();
int total = m + n;
int half = (total + 1) / 2; // size of left part
int low = 0, high = m; // we search over number of elements taken from A
while (low <= high) {
int i = (low + high) / 2; // elements from A on the left side
int j = half - i; // must come from B
// Edge handling: use infinities for out‑of‑range indices
int a_left = (i == 0 ? INT_MIN : (*A)[i - 1]);
int a_right = (i == m ? INT_MAX : (*A)[i]);
int b_left = (j == 0 ? INT_MIN : (*B)[j - 1]);
int b_right = (j == n ? INT_MAX : (*B)[j]);
if (a_left > b_right) { // i too large
high = i - 1;
} else if (b_left > a_right) { // i too small
low = i + 1;
} else { // correct split
int max_left = max(a_left, b_left);
if (total % 2 == 1) // odd
return static_cast<double>(max_left);
int min_right = min(a_right, b_right);
return (max_left + min_right) / 2.0;
}
}
// The code never reaches here because the constraints guarantee at least one element.
return 0.0;
}
};
/* Example usage
int main() {
Solution s;
vector<int> a = {1, 3};
vector<int> b = {2};
cout << fixed << setprecision(5) << s.findMedianSortedArrays(a, b) << endl; // 2.00000
a = {1,2};
b = {3,4};
cout << fixed << setprecision(5) << s.findMedianSortedArrays(a, b) << endl; // 2.50000
}
gpt-oss-20b Response
# Reference Implementation (C++17)
include <bits/stdc++.h>
using namespace std;
class Solution {
public:
double findMedianSortedArrays(vector<int>& nums1,
vector<int>& nums2) {
// Ensure nums1 is the shorter array
if (nums1.size() > nums2.size())
return findMedianSortedArrays(nums2, nums1);
int m = nums1.size();
int n = nums2.size();
int low = 0, high = m;
int halfLen = (m + n + 1) / 2; // size of left part
while (low <= high) {
int i = (low + high) / 2; // partition in nums1
int j = halfLen - i; // partition in nums2
// i too small - must move right
if (i < m && nums2[j-1] > nums1[i])
low = i + 1;
// i too big - must move left
else if (i > 0 && nums1[i-1] > nums2[j])
high = i - 1;
else {
// correct partition found
int maxLeft;
if (i == 0)
maxLeft = nums2[j-1];
else if (j == 0)
maxLeft = nums1[i-1];
else
maxLeft = max(nums1[i-1], nums2[j-1]);
// odd total, median is maxLeft
if ((m + n) % 2 == 1)
return static_cast<double>(maxLeft);
// even total, need minRight as well
int minRight;
if (i == m)
minRight = nums2[j];
else if (j == n)
minRight = nums1[i];
else
minRight = min(nums1[i], nums2[j]);
return (maxLeft + minRight) / 2.0;
}
}
// Should never reach here for valid input
return 0.0;
}
};
/* Example usage:
int main() {
Solution s;
vector<int> a = {1, 3};
vector<int> b = {2};
cout << fixed << setprecision(5) << s.findMedianSortedArrays(a, b) << endl; // 2.00000
}
Comparative Analysis
gpt-oss-120B completes the analogy accurately, selecting option C, and shows strong rationale by successfully identifying the letter substitution pattern. The model shows good reasoning in terms of handling shifts in the letters’ oscillatory and keeping track of relationships between things. On the other hand, gpt-oss-20B is unable to even complete the task! The model exceeded the output token limit and did not return an answer. This indicates gpt-oss-20B is inefficient in either its resource usage or its handling of the prompt. Overall, gpt-oss-120B demonstrates much better performance in structured reasoning tasks, making it a much better choice than gpt-oss-20B for tasks related to symbolic analogies.
Choosing between the 120B and 20B models depends on the needs of one’s project or the task on which we are working:
On standard benchmarks, OpenAI’s gpt-oss shares results. The 120B model works its way upward, scoring higher than the 20B model on tough reasoning and knowledge tasks, both still having excellent performances.
| Benchmark | gpt-oss-120b | gpt-oss-20b | OpenAI o3 | OpenAI o4-mini |
|---|---|---|---|---|
| MMLU | 90.0 | 85.3 | 93.4 | 93.0 |
| GPQA Diamond | 80.1 | 71.5 | 83.3 | 81.4 |
| Humanity’s Last Exam | 19.0 | 17.3 | 24.9 | 17.7 |
| AIME 2024 | 96.6 | 96.0 | 95.2 | 98.7 |
| AIME 2025 | 97.9 | 98.7 | 98.4 | 99.5 |


Here are some applications for gpt-oss:
The 120B model clearly outperforms across the board: generating sharper content, solving harder problems, writing better code, and adapting faster in research and autonomous tasks. Its only real tradeoff is resource intensity, which makes local deployment a challenge. But if you’ve got the infrastructure, there’s no contest. This isn’t just an upgrade! It’s a whole new tier of capability.