Skip to content

nn.Transcriber

✦₊⁺ Overview

The nn.Transcriber module wraps speech-to-text models to transcribe audio into text or structured data.


1. Quick Start

Initialization styles

import msgflux as mf
import msgflux.nn as nn

class Speech2Text(nn.Transcriber):
    """Transcribes user voice notes."""
    model          = mf.Model.speech_to_text("openai/whisper-1")
    response_mode  = "content"
    message_fields = {"task_multimodal": {"audio": "user_audio"}}

transcriber = Speech2Text()
result = transcriber("/path/to/audio.mp3")
import msgflux as mf
import msgflux.nn as nn

transcriber = nn.Transcriber(
    model=mf.Model.speech_to_text("openai/whisper-1")
)
result = transcriber("/path/to/audio.mp3")

2. Input Types

Supported audio inputs

result = transcriber("/path/to/audio.mp3")
result = transcriber("https://example.com/audio.wav")
with open("audio.mp3", "rb") as f:
    audio_bytes = f.read()

result = transcriber(audio_bytes)

Extract audio from a structured message via message_fields:

import msgflux as mf
import msgflux.nn as nn

class Speech2Text(nn.Transcriber):
    model          = mf.Model.speech_to_text("openai/whisper-1")
    message_fields = {"task_multimodal": {"audio": "user_audio"}}
    response_mode  = "transcription"

transcriber = Speech2Text()

msg = mf.dotdict(user_audio="/path/to/audio.mp3")
transcriber(msg)
print(msg.transcription)

3. Parameters

Parameter Description
model Speech-to-text model client instance
message_fields Map inputs (audio) from Message fields
response_mode Where to write the output in the Message
response_template Jinja template to format the output string
response_format "text" (default), "json", "verbose_json", "srt", "vtt"
prompt Optional text prompt to guide style or vocabulary
config Runtime params passed to the model: language, stream, timestamp_granularities

4. Configuration

Controlling transcription behavior

Specify the spoken language to improve accuracy and speed via config:

class PortugueseTranscriber(nn.Transcriber):
    model  = mf.Model.speech_to_text("openai/whisper-1")
    config = {"language": "pt"}  # ISO 639-1 code

transcriber = PortugueseTranscriber()
result = transcriber("audio.mp3")

Request word or segment-level timestamps. Requires response_format="verbose_json" and whisper-1:

class TimestampTranscriber(nn.Transcriber):
    model           = mf.Model.speech_to_text("openai/whisper-1")
    response_format = "verbose_json"
    config          = {"timestamp_granularities": ["word"]}

transcriber = TimestampTranscriber()
result = transcriber("audio.mp3")
# result["text"]  — full transcript
# result["words"] — [{"word": "Hello", "start": 0.0, "end": 0.5}, ...]

Export directly as SRT or VTT for video workflows:

class SubtitleGenerator(nn.Transcriber):
    model           = mf.Model.speech_to_text("openai/whisper-1")
    response_format = "srt"

gen = SubtitleGenerator()
srt_content = gen("video_audio.mp3")

5. Integration with Agents

Transcribers are often the first step in a voice processing pipeline.

Transcriber → Agent pipeline

import msgflux as mf
import msgflux.nn as nn

class Speech2Text(nn.Transcriber):
    model          = mf.Model.speech_to_text("openai/whisper-1")
    message_fields = {"task_multimodal": {"audio": "user_audio"}}
    response_mode  = "content"

class Analyzer(nn.Agent):
    """Analyzes the transcribed text."""
    model          = mf.Model.chat_completion("openai/gpt-4.1-mini")
    message_fields = {"task": "content"}
    response_mode  = "analysis"

transcriber = Speech2Text()
analyzer = Analyzer()

pipeline = mf.Inline(
    "{user_audio is not None? transcriber} -> analyzer",
    {"transcriber": transcriber, "analyzer": analyzer},
)

msg = mf.dotdict(user_audio="/path/to/voice_note.mp3")
pipeline(msg)
print(f"Transcript: {msg.content}")
print(f"Analysis: {msg.analysis}")

6. Async

result = await transcriber.acall("/path/to/audio.mp3")

7. Debugging

params = transcriber.inspect_model_execution_params("/path/to/audio.mp3")
print(params)