Multilingual Support

The vocode project can be configured to support multiple languages for speech synthesis and automatic speech recognition (ASR).

Speech Synthesis

The speech synthesizer used in vocode is configurable. By default, the AzureSynthesizer is used which supports over 75 voices across over 45 languages.

To configure a different language, modify the SynthesizerConfig when initializing the conversation:

from vocode.streaming.models.synthesizer import AzureSynthesizerConfig

synthesizer_config = AzureSynthesizerConfig(
    voice_name="de-DE-KatjaNeural" # German voice
    language_code="de-DE" # Language of text to be synthesized
)

Transcription

The transcriber used in vocode is also configurable. By default, DeepgramTranscriber is used which supports over 35 languages.

To configure a different language model, modify the language code passed to TranscriberConfig when initializing the config object (en-US is the default):

from vocode.streaming.models.transcriber import DeepgramTranscriberConfig

transcriber_config = DeepgramTranscriberConfig(
    language="es" # Spanish
    model="nova-2" # Most languages are supported on the Nova 2 model
)

Note: the default model for Deepgram is Nova, so you must pass model="nova-2" to use that model.

Other transcription services like Google Cloud Speech or Assembly AI could also be used by configuring the appropriate TranscriberConfig.

Language Configuration

It is recommended to load the speech synthesizer voice and transcription model from environment variables or configuration to avoid hard-coding language choices:

import os
from vocode.streaming.models.synthesizer import AzureSynthesizerConfig
from vocode.streaming.models.transcriber import DeepgramTranscriberConfig

SYNTHESIZER_VOICE = os.getenv("SYNTHESIZER_VOICE", default="en-US-AriaNeural")
ASR_MODEL = os.getenv("DEEPGRAM_MODEL", default="en-US")

synthesizer_config = AzureSynthesizerConfig(voice=SYNTHESIZER_VOICE)
transcriber_config = DeepgramTranscriberConfig(model=DEEPGRAM_MODEL)

This allows dynamically configuring the speech language without code changes.