Fully local conversation
💃 Run a voice agent on your computer without Internet
Introduction
You can use Vocode to interact with open-source transcription, large language, and synthesis models. Many of these models have been optimized to run on CPU, which means that you can have a conversation with an AI locally without Internet (and thus for free!).
Disclaimer: Many of these models are optimized for Apple Silicon, so this may work best on a M1 or M2 Mac computer.
Setting up the conversation
Start by copying the StreamingConversation
quickstart.
This example uses Deepgram for transcription, ChatGPT for LLM, and Azure for synthesis - we’ll be replacing each piece with
a corresponding open-source model.
Whisper.cpp
Follow the steps in the whisper.cpp repo to download one of the models.
As of now (2023/05/01), here’s an example flow to do this:
- Clone the whisper.cpp repo
- From the whisper.cpp directory, run:
Find your (absolute) paths for the whisper.cpp shared library file and the model you’ve just downloaded.
If whisper.cpp is downloaded at /whisper.cpp
, the paths from the previous example would be:
/whisper.cpp/libwhisper.so
/whisper.cpp/models/ggml-tiny.bin
Set up your streaming WhisperCPPTranscriber
in StreamingConversation
as follows:
GPT4All
Install the pygpt4all
package by running:
Download the latest GPT4All-J model from the pygpt4all repo.
As of today (2023/05/01), you can download it by visiting: https://gpt4all.io/models/ggml-gpt4all-j-v1.3-groovy.bin
Set up your agent in StreamingConversation
as follows:
Llama.cpp
You can use any model supported by llama.cpp with Vocode. This includes LLaMA, Alpaca, Vicuna, Koala, WizardLM, and more. We will use NousResearch/Nous-Hermes-13b in this example because it currently ranks highly on HuggingFace’s Open LLM Leaderboard.
Our implementation is built on top of langchain, which integrates with llama.cpp through llama-cpp-python.
Install llama-cpp-python
by running the following:
or run the following to install it with support for offloading model layers to a GPU via CUDA:
llama-cpp-python has more installation commands for different BLAS backends.
Set up your agent in StreamingConversation
as follows:
You can add the key n_gpu_layers
to the llamacpp_kwargs
to offload some of the model’s layers to a GPU.
Coqui TTS
Install the Coqui TTS package by running:
See the Coqui TTS repo for more instructions in case you run into bugs.
Find which OS speech synthesis model you’d like to use. One way to do this is to run:
For this example, we’ll use Tacotron2.
Set up your synthesizer in StreamingConversation
as follows:
Run the conversation
Putting this all together, our StreamingConversation
instance looks like:
Start the conversation by running: