Reasoning
Modern language models can "think before answering" — generating an internal chain of thought before producing a final response. In msgFlux there are two distinct mechanisms for reasoning, and understanding the difference is key to using them effectively.
Two Kinds of Reasoning
| Model-level reasoning | Schema-level reasoning | |
|---|---|---|
| What it is | The model's native thinking capability (e.g. reasoning_effort="high") |
A generation_schema that forces structured thinking (CoT, ReAct, SelfConsistency) |
| Where reasoning lives | response.reasoning — a first-class field on the response object |
Inside response.consume() — as a field of the structured output (e.g. result.reasoning) |
| Configured via | Model params: reasoning_effort, return_reasoning, reasoning_max_tokens |
Agent param: generation_schema=ChainOfThought |
| Works with any model | No — requires a reasoning-capable model (Groq gpt-oss, OpenAI o-series, etc.) | Yes — any model that supports structured output |
| Controllable budget | Yes — reasoning_effort and reasoning_max_tokens |
No — the model decides how much to write in the schema field |
| Can combine with tools | Yes — reasoning_in_tool_call=True preserves the chain across calls |
Yes — ReAct is specifically designed for tool use |
Tip
You can use both simultaneously. A reasoning model with reasoning_effort="high" and generation_schema=ChainOfThought will think internally and produce a structured reasoning field. The internal trace goes to response.reasoning, the schema field goes to response.consume().reasoning.
For schema-level reasoning (CoT, ReAct, SelfConsistency), see Generation Schemas. This page focuses on model-level reasoning — the first-class response.reasoning field.
1. Configuration
Model-level reasoning is configured at model initialization through parameters forwarded to the provider:
import msgflux as mf
model = mf.Model.chat_completion(
"groq/openai/gpt-oss-120b",
reasoning_effort="low", # How much to think: "minimal", "low", "medium", "high"
return_reasoning=True, # Store the trace in response.reasoning (default: True)
reasoning_max_tokens=1024, # Cap the thinking budget in tokens
reasoning_in_tool_call=True, # Preserve reasoning across tool call rounds
)
At the Agent level, one additional config key controls how reasoning is surfaced in the Agent's output:
| Config key | Type | Default | Effect |
|---|---|---|---|
reasoning_in_response |
bool |
False |
When True, the Agent wraps its output as dotdict(answer=raw_response, reasoning=reasoning) instead of returning the raw response. |
import msgflux.nn as nn
class Solver(nn.Agent):
model = model
instructions = "Solve the problem step by step."
config = {"reasoning_in_response": True}
Why reasoning_in_response exists
By default, the Agent returns the raw response (str for text, dict for structured output). The reasoning is still available on the underlying ModelResponse, but the Agent's forward() returns only the content. If your downstream code needs both the answer and the reasoning in a single object, enable reasoning_in_response. The contract is uniform: always dotdict(answer=..., reasoning=...) regardless of whether the raw response is a str or dict.
2. Non-Streaming
In non-streaming mode, the model completes its full response before returning. Reasoning is available immediately as a string field.
Basic usage
Example
import msgflux as mf
import msgflux.nn as nn
model = mf.Model.chat_completion(
"groq/openai/gpt-oss-120b",
reasoning_effort="low",
return_reasoning=True,
)
class Solver(nn.Agent):
model = model
instructions = "Solve the problem. Answer with just the result."
agent = Solver()
response = agent("What is 15 * 7 + 3?")
print(type(response)) # <class 'str'>
print(response) # "108"
The reasoning trace is not in the Agent's return value. It lives on the ModelResponse inside the pipeline. To access it, use reasoning_in_response.
import msgflux as mf
import msgflux.nn as nn
model = mf.Model.chat_completion(
"groq/openai/gpt-oss-120b",
reasoning_effort="low",
return_reasoning=True,
)
class Solver(nn.Agent):
model = model
instructions = "Solve the problem. Answer with just the result."
config = {"reasoning_in_response": True}
agent = Solver()
response = agent("What is 15 * 7 + 3?")
print(type(response)) # <class 'dotdict'>
print(response.answer) # "108"
print(response.reasoning) # "15 * 7 = 105, then 105 + 3 = 108"
import msgflux as mf
import msgflux.nn as nn
model = mf.Model.chat_completion(
"groq/openai/gpt-oss-120b",
reasoning_effort="low",
return_reasoning=False, # Discard reasoning
)
class Solver(nn.Agent):
model = model
instructions = "Solve the problem."
config = {"reasoning_in_response": True}
agent = Solver()
response = agent("What is 2 + 2?")
# No reasoning available → no wrapping, plain str
print(type(response)) # <class 'str'>
print(response) # "4"
When return_reasoning=False (or the model simply doesn't reason), reasoning is None and reasoning_in_response has no effect — the Agent returns the raw response as-is.
import msgflux as mf
import msgflux.nn as nn
model = mf.Model.chat_completion(
"groq/openai/gpt-oss-120b",
reasoning_effort="low",
return_reasoning=True,
)
class Solver(nn.Agent):
model = model
instructions = "Solve the problem. Answer with just the result."
config = {"reasoning_in_response": True}
agent = Solver()
response = await agent.acall("What is 15 * 7 + 3?")
print(response.answer) # "108"
print(response.reasoning) # "15 * 7 = 105, then 105 + 3 = 108"
How it works internally
When the Agent calls the model in non-streaming mode, the pipeline is:
Agent.forward("What is 15 * 7?")
│
├── _execute_model() → ModelResponse
│ ├── .data = "108" ← the answer
│ ├── .reasoning = "15*7=105..." ← the trace
│ ├── .has_reasoning = True
│ └── .response_type = "text_generation"
│
└── _process_model_response()
├── raw_response = model_response.consume() → "108"
├── reasoning = model_response.reasoning → "15*7=105..."
│
└── _prepare_response()
└── _apply_reasoning_in_response(raw_response, reasoning)
├── reasoning_in_response=True → dotdict(answer="108", reasoning="15*7=105...")
└── reasoning_in_response=False → "108"
The key insight is that model_response.reasoning is always populated (when the provider returns it), regardless of the reasoning_in_response config. The config only controls whether the Agent wraps the output.
3. Streaming
Streaming with reasoning introduces a dual-queue architecture. Content and reasoning flow through independent queues, allowing consumers to process them in parallel or sequentially.
Consuming streams
When stream=True, the Agent returns a ModelStreamResponse. Both consume() and consume_reasoning() are async generators:
Example
import msgflux as mf
import msgflux.nn as nn
model = mf.Model.chat_completion(
"groq/openai/gpt-oss-120b",
reasoning_effort="low",
return_reasoning=True,
)
class Assistant(nn.Agent):
model = model
instructions = "Answer concisely."
config = {"stream": True}
agent = Assistant()
response = await agent.acall("What is 2+2?")
# Consume content chunks
async for chunk in response.consume():
print(chunk, end="", flush=True)
print()
# Consume reasoning chunks
async for chunk in response.consume_reasoning():
print(chunk, end="", flush=True)
The queues are independent — you can consume reasoning before content. This is useful when you want to display the chain of thought in a UI before showing the answer:
In sync contexts, the stream runs in a background thread. After the stream completes, the accumulated fields are available:
import time
agent = Assistant()
response = agent("What is 2+2?")
# first_chunk_event fires on the first token (often reasoning)
response.first_chunk_event.wait(timeout=10)
# Wait for stream to complete
for _ in range(50):
if response.metadata is not None:
break
time.sleep(0.1)
# After completion
print(response.reasoning) # Full accumulated reasoning
print(response.has_reasoning) # True
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import msgflux as mf
import msgflux.nn as nn
app = FastAPI()
model = mf.Model.chat_completion(
"groq/openai/gpt-oss-120b",
reasoning_effort="low",
return_reasoning=True,
)
class Assistant(nn.Agent):
model = model
instructions = "Answer concisely."
config = {"stream": True}
agent = Assistant()
@app.get("/chat")
async def chat(query: str):
response = await agent.acall(query)
return StreamingResponse(
response.consume(),
media_type="text/plain",
)
@app.get("/chat/reasoning")
async def chat_reasoning(query: str):
response = await agent.acall(query)
return StreamingResponse(
response.consume_reasoning(),
media_type="text/plain",
)
The two-event system
Streaming responses use two events to signal different stages:
| Event | Fires when | Purpose |
|---|---|---|
first_chunk_event |
First token arrives (reasoning or content) | Signals the stream is alive. Fires early — often on the first reasoning token. |
_response_type_event |
response_type is determined ("text_generation" or "tool_call") |
The Agent waits on this event before deciding how to process the response. |
This separation matters because reasoning models emit reasoning tokens before any content. Without it, the Agent would block on response_type until the first content token arrives — potentially seconds of wasted time.
Timeline:
┌─ reasoning tokens ──────────────────┐┌── content tokens ───────────┐
│ think think think think think ... ││ Hello, the answer is ... │
▲ ▲ ▲
│ │ │
first_chunk_event _response_type_event metadata set
(fires here) (fires here) (stream done)
Inside the Agent, the flow is:
# Agent._process_model_response():
if isinstance(model_response, ModelStreamResponse):
wait_for_event(model_response._response_type_event) # blocks here, not on first_chunk
# Now response_type is guaranteed to be set
if "tool_call" in model_response.response_type:
# enter tool call loop...
else:
# return stream response to caller
How the dual-queue works internally
Provider stream (background thread)
│
├── reasoning chunk → add_reasoning(chunk)
│ ├── has_reasoning = True (first time only)
│ ├── first_chunk_event.set() (first time only)
│ └── chunk → reasoning queue
│
├── content chunk → add(chunk)
│ ├── set_response_type("text_generation")
│ │ └── _response_type_event.set() (first time only)
│ ├── first_chunk_event.set() (if not already)
│ └── chunk → content queue
│
└── finally:
├── stream_response.reasoning = accumulated ← full text available
├── add_reasoning(None) ← sentinel (end of reasoning)
├── add(None) ← sentinel (end of content)
├── _response_type_event.set() ← safety net
└── set_metadata(usage)
Each queue uses a deque as a pending buffer that is flushed into an asyncio.Queue when a consumer first calls consume() / consume_reasoning(). The None sentinel signals end-of-stream to the async generator.
reasoning_in_response has no effect in streaming
When stream=True, the Agent returns the ModelStreamResponse directly — it does not wrap it. The consumer accesses reasoning through consume_reasoning() and content through consume(). The reasoning_in_response config only applies to non-streaming responses.
4. Reasoning Across Tool Calls
When a reasoning model calls tools, it normally loses its chain of thought between rounds. Enable reasoning_in_tool_call=True on the model to preserve the reasoning context.
How it works
After each tool call round, the ToolCallAggregator formats the assistant message that goes back into the conversation history. When reasoning_in_tool_call=True, the reasoning is embedded in <think> tags inside that message:
Message history:
[
{"role": "user", "content": "What is (14+28)*3-7?"},
{"role": "assistant",
"content": "<think>I need to compute (14+28) first, then multiply by 3, then subtract 7. Let me use the calculator.</think>",
"tool_calls": [{"function": {"name": "calc", "arguments": {"expr": "14+28"}}}]},
{"role": "tool", "tool_call_id": "call_1", "content": "42"},
{"role": "assistant",
"content": "<think>14+28=42. Now I need 42*3. Let me call calc again.</think>",
"tool_calls": [{"function": {"name": "calc", "arguments": {"expr": "42*3"}}}]},
{"role": "tool", "tool_call_id": "call_2", "content": "126"},
{"role": "assistant", "content": "The answer is 119."}
]
The model sees its own previous reasoning at each step, enabling coherent multi-step problem solving.
Two separate reasoning stores
The ToolCallAggregator keeps its own copy of the reasoning for message formatting (<think> tags in the conversation). The ModelResponse.reasoning field on the final model call reflects only the reasoning from that last call. These are intentionally separate — the conversation history needs the full chain, while the response field exposes the latest trace.
Example
import msgflux as mf
import msgflux.nn as nn
def add(a: int, b: int) -> int:
"""Add two numbers together."""
return a + b
def multiply(a: int, b: int) -> int:
"""Multiply two numbers together."""
return a * b
model = mf.Model.chat_completion(
"groq/openai/gpt-oss-120b",
reasoning_effort="high",
return_reasoning=True,
reasoning_in_tool_call=True,
)
class Calculator(nn.Agent):
model = model
instructions = "Use the tools to compute the result. Answer with just the number."
tools = [add, multiply]
agent = Calculator()
response = agent("What is (3 + 5) * 4?")
print(response) # "32"
class Calculator(nn.Agent):
model = model
instructions = "Use the tools to compute the result. Answer with just the number."
tools = [add, multiply]
config = {"reasoning_in_response": True}
agent = Calculator()
response = agent("What is (3 + 5) * 4?")
# If the final model call includes reasoning:
if isinstance(response, dict):
print(response.answer) # "32"
print(response.reasoning) # "3+5=8, 8*4=32"
else:
# Final call had no reasoning (model may skip it on simple answers)
print(response) # "32"
Reasoning on the final call is not guaranteed
After the tool call loop completes, the model makes a final call to produce the answer. This final call may or may not include reasoning — it depends on the model and the complexity of the remaining task. When reasoning_in_response=True and the final call has no reasoning, the Agent returns the raw response (no dotdict wrapping).
5. Combining with Generation Schemas
Model-level reasoning and schema-level reasoning serve different purposes and can be combined:
| Approach | Reasoning lives in | Use case |
|---|---|---|
| Model-level only | response.reasoning |
When you want the model's native thinking without constraining the output format |
| Schema-level only | response.consume().reasoning |
When you want explicit, structured reasoning visible in the output |
| Both combined | Both fields populated | Maximum reasoning quality — the model thinks internally and produces structured reasoning |
Example
import msgflux as mf
import msgflux.nn as nn
model = mf.Model.chat_completion(
"groq/openai/gpt-oss-120b",
reasoning_effort="high",
return_reasoning=True,
)
class Solver(nn.Agent):
model = model
instructions = "Solve the problem."
config = {"reasoning_in_response": True}
agent = Solver()
response = agent("What is 25 * 4 + 17?")
# response.answer = "117"
# response.reasoning = "25 * 4 = 100, 100 + 17 = 117" (model's internal trace)
import msgflux as mf
import msgflux.nn as nn
from msgflux.generation.reasoning import ChainOfThought
model = mf.Model.chat_completion("openai/gpt-4.1-mini") # no reasoning_effort
class Solver(nn.Agent):
model = model
generation_schema = ChainOfThought
agent = Solver()
result = agent("What is 25 * 4 + 17?")
# result.reasoning = "Step 1: 25 * 4 = 100..." (schema field)
# result.final_answer = "117"
import msgflux as mf
import msgflux.nn as nn
from msgflux.generation.reasoning import ChainOfThought
model = mf.Model.chat_completion(
"groq/openai/gpt-oss-120b",
reasoning_effort="high",
return_reasoning=True,
)
class Solver(nn.Agent):
model = model
generation_schema = ChainOfThought
config = {"reasoning_in_response": True}
agent = Solver()
response = agent("What is 25 * 4 + 17?")
# response.answer = {"reasoning": "Step 1: ...", "final_answer": "117"} (schema)
# response.reasoning = "The user asks 25*4+17. Let me compute..." (model trace)
The schema reasoning is a structured, user-facing explanation. The model reasoning is the raw internal trace — often more detailed and less polished.
6. Verbose Mode
When verbose=True, the Agent prints both the reasoning trace and the response to the console:
class Solver(nn.Agent):
model = model
instructions = "Solve the problem."
config = {"verbose": True}
agent = Solver()
agent("What is 15 * 7?")
Console output:
This is useful for debugging the relationship between the model's thinking and its final answer.
7. Quick Reference
Model parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
reasoning_effort |
str |
— | "minimal", "low", "medium", "high" |
return_reasoning |
bool |
True |
Store reasoning in response.reasoning |
reasoning_max_tokens |
int |
— | Cap reasoning token budget |
reasoning_in_tool_call |
bool |
False |
Embed reasoning in <think> tags across tool call rounds |
enable_thinking |
bool |
False |
Provider-level switch (e.g. Anthropic) |
Agent config
| Config key | Type | Default | Description |
|---|---|---|---|
reasoning_in_response |
bool |
False |
Wrap output as dotdict(answer=..., reasoning=...) |
Response API
| Non-streaming | Streaming | Description |
|---|---|---|
response.consume() → str or dict |
response.consume() → AsyncGenerator[str, None] |
Final answer |
response.consume_reasoning() → str or None |
response.consume_reasoning() → AsyncGenerator[str, None] |
Reasoning trace |
response.reasoning → str or None |
response.reasoning → str or None (after stream ends) |
Direct attribute |
response.has_reasoning → bool (property) |
response.has_reasoning → bool (mutable flag) |
Discoverability |
See also
- Chat Completion — Reasoning Models — Model-level reasoning reference with internal architecture details
- Generation Schemas — Reasoning Schemas — CoT, ReAct, SelfConsistency
- Tools — Tool calling overview
- Streaming — General streaming overview