How it works
Core concepts that power Vocode.
Conversation orchestration as a service
In order to have a back-and-forth conversation, you have to do several things:
- Stream audio/receive audio asynchronously
- Generate responses & understand when to generate responses
- Handle innacuracies and interruptions
And all of this is done via orchestration of:
- Speech Recognition
- AI/NLU Layer
- Speech Synthesis
Vocode conveniently abstracts away much of the complexity while giving developers the flexibility to control every piece of the conversation.
Our core abstraction: the Conversation
Vocode breaks down a Conversation into 5 core pieces:
- Transcriber (used for speech recognition)
- Agent (AI/NLU layer)
- Synthesizer (used for speech synthesis)
- Input Device (microphone for audio in)
- Output Device (speaker for audio out)
In order to run an entire conversation, developers can specify each of these 5 pieces with the various types provided by Vocode.
As an example, there are several Transcriber
options (ex.
DeepgramTranscriber
, AssemblyAITranscriber
, GoogleTranscriber
) that allow you to specify
which providers you would like to use and their parameters.
After specifying all of the types, Vocode handles everything else necessary to have the conversation.