Overview

Synthesizers are used to convert text into speech; this guide will show you how to configure and use synthesizers in Vocode.

Supported Synthesizers

Vocode currently supports the following synthesizers:

  1. Azure (Microsoft)
  2. Google
  3. Cartesia
  4. Eleven Labs
  5. Rime
  6. Play.ht
  7. GTTS (Google Text-to-Speech)
  8. Stream Elements
  9. Bark
  10. Amazon Polly

These synthesizers are defined using their respective configuration classes, which are subclasses of the SynthesizerConfig class.

Configuring Synthesizers

To use a synthesizer, you need to create a configuration object for the synthesizer you want to use. Here are some examples of how to create configuration objects for different synthesizers:

Example 1: Using Eleven Labs with a phone call

from vocode.streaming.telephony.hosted.inbound_call_server import InboundCallServer
from vocode.streaming.models.synthesizer import ElevenLabsSynthesizerConfig

server = InboundCallServer(
    ...
    synthesizer_config=ElevenLabsSynthesizerConfig.from_telephone_output_device(
        api_key=os.getenv("ELEVENLABS_API_KEY"),
        voice_id=os.getenv("YOUR VOICE ID")
    )
    ...
)

In this example, the ElevenLabsSynthesizerConfig.from_telephone_output_device() method is used to create a configuration object for the Eleven Labs synthesizer. The method hardcodes some values like the sampling_rate and audio_encoding for compatibility with telephone output devices.

ElevenLabs Input Streaming

You can try out our experimental implementation of ElevenLabs’ input streaming API by passing in experimental_websocket=True into the config and using the ElevenLabsWSSynthesizer, like:

from vocode.streaming.synthesizer.eleven_labs_websocket_synthesizer import ElevenLabsWSSynthesizer
from vocode.streaming.models.synthesizer import ElevenLabsSynthesizerConfig

...
synthesizer_config=ElevenLabsSynthesizerConfig.from_telephone_output_device(
    api_key=os.getenv("ELEVENLABS_API_KEY"),
    voice_id=os.getenv("YOUR VOICE ID"),
    experimental_websocket=True
)
...
synthesizer=ElevenLabsWSSynthesizer(ElevenLabsSynthesizerConfig.from_output_device(
    speaker_output,
    api_key=os.getenv("ELEVENLABS_API_KEY"),
    voice_id=os.getenv("YOUR VOICE ID"),
    experimental_websocket=True
))
...

Play.ht v2

We now support Play.ht’s new gRPC streaming API, which runs much faster than their HTTP API and is designed for realtime communication.

...
synthesizer_config=PlayHtSynthesizerConfig.from_telephone_output_device(
    api_key=os.getenv("PLAY_HT_API_KEY"),
    user_id=os.getenv("PLAY_HT_USER_ID"),
)
...

Example 2: Using Cartesia’s streaming synthesizer

We support Cartesia’s low-latency streaming API enabled by WebSockets. You can use the CartesiaSynthesizer with the CartesiaSynthesizerConfig to enable this feature.

Telephony

synthesizer_config=CartesiaSynthesizerConfig.from_telephone_output_device(
    api_key=os.getenv("CARTESIA_API_KEY"),
    voice_id=os.getenv("CARTESIA_VOICE_ID"),
)

In this example, the CartesiaSynthesizerConfig.from_output_device() method is used to create a configuration object for the Cartesia synthesizer. The method takes a speaker_output object as an argument, and extracts the sampling_rate and audio_encoding from the output device.

Controlling Speed & Emotions

You can set the speed and emotion parameters in the CartesiaSynthesizerConfig object to control the speed and emotions of the agent’s voice! See this page for more details.

CartesiaSynthesizerConfig(
    api_key=os.getenv("CARTESIA_API_KEY"),
    voice_id=os.getenv("CARTESIA_VOICE_ID"),
    experimental_voice_controls={
        "speed": "slow",
        "emotion": "positivity: high"
    }
)

Example 3: Using Azure in StreamingConversation locally

from vocode.streaming.models.synthesizer import AzureSynthesizerConfig
from vocode.helpers import create_microphone_input_and_speaker_output

microphone_input, speaker_output = create_microphone_input_and_speaker_output(
        streaming=True, use_default_devices=False
)

conversation = StreamingConversation(
    ...
    synthesizer=AzureSynthesizer(
        AzureSynthesizerConfig.from_output_device(speaker_output)
    ),
    ...
)

In this example, the AzureSynthesizerConfig.from_output_device() method is used to create a configuration object for the Azure synthesizer. The method takes a speaker_output object as an argument, and extracts the sampling_rate and audio_encoding from the output device.