How to quickly build a voice-activated chatbot to interact with Vertex AI models

pushpdeep_g · 08-04-2023 07:36 AM

With the boom in artificial intelligence (AI), businesses are looking for a quick interface to validate their machine learning models, API, or data science workflow. Chatbots are a popular application of Large Language Models (LLMs) similar to Google’s Bard. LLMs are gaining traction and require quick chatbots to make conversational interfaces.

To make the conversation even more vernacular, businesses are now beginning to use voice-based chatbots or voice bots. Voice bots have been on the rise for a couple of years because of the convenience they bring. It’s much easier to speak rather than type. A voice-activated chatbot brings frictionless experiences for the end customer.

In this article, we’ll learn how to quickly launch a bot application that not only understands the keyboard inputs, but also the voice-based messages.

Considerations

The bot’s interface is built using Gradio framework.
Automatic speech recognition (ASR), the conversation of spoken speech to text, is handled by Google’s Speech-To-Text.
For this article, the bot is built to converse in US-English language. However, the language code can be changed as per your locale.
The steps presented in this article are for a Linux platform.

Prerequisites

Before you can send a request to the Speech-to-Text API, you must have completed the following actions:

1. Enable Speech-to-Text on a Google Cloud project.

Make sure billing is enabled for Speech-to-Text.
Create and/or assign one or more service accounts to Speech-to-Text.
Download a service account credential key.

2. Set your authentication environment variable.

Note: You can skip creating a service account if you plan to use default login/application credentials.

Install libraries

sudo apt-get install python3-pip python-dev
sudo apt-get install ffmpeg

pip install gradio==3.38.0 --use-deprecated=legacy-resolver
pip install --upgrade google-cloud-speech==2.21.0
pip install torch

#[If you encounter space issue on your VM, create a temp folder(/home/user/tmp) and install torch inside it as shown below]
pip install --cache-dir=/home/user/tmp torch

Code sample

config.py to store static values:

bot = {
    "banner": """<h1 align="left" style="min-width:200px; margin-top:0;"> Chat with Expert Advisor </h1>""",
    "title": "Expert Advisor",
    "initial_message": "Hi, I'm your expert advisor. How may I help you today?",
    "temp_response": "Apologies, I'm not ready yet :(",
    "text_placeholder": "Enter Text"
}

main.py the entry point for application:

import time
import gradio as gr
import config as cfg
from google.cloud import speech


def transcribe_file(speech_file: str) -> speech.RecognizeResponse:
    """Transcribe the audio file."""
    text = ""
    client = speech.SpeechClient()

    with open(speech_file, "rb") as audio_file:
        content = audio_file.read()

    audio = speech.RecognitionAudio(content=content)
    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
        language_code="en-US"
    )

    response = client.recognize(config=config, audio=audio)

    # Each result is for a consecutive portion of the audio. Iterate through
    # them to get the transcripts for the entire audio file.
    for result in response.results:
        # The first alternative is the most likely one for this portion.
        text = result.alternatives[0].transcript
        print(f"Transcript: {text}")

    return text

def add_user_input(history, text):
    """Add user input to chat hostory."""
    history = history + [(text, None)]
    return history, gr.update(value="", interactive=False)

def bot_response(history):
    """Returns updated chat history with the Bot response."""

    # Intergate with ML models to load response.
    response = cfg.bot["temp_response"]
    history[-1][1] = response
    time.sleep(2)
    return history

with gr.Blocks() as bot_interface:
    with gr.Row():
        gr.HTML(cfg.bot["banner"])
    
    with gr.Row(scale=1):
        chatbot=gr.Chatbot([(cfg.bot["initial_message"], None)], elem_id="chatbot").style(height=750)
    with gr.Row(scale=1):
        with gr.Column(scale=12):
            user_input = gr.Textbox(
                show_label=False, placeholder=cfg.bot["text_placeholder"],
            ).style(container=False)
        with gr.Column(min_width=70, scale=1):
            submitBtn = gr.Button("Send")
    with gr.Row(scale=1):
        audio_input=gr.Audio(source="microphone", type="filepath")

    
    input_msg = user_input.submit(add_user_input, [chatbot, user_input], [chatbot, user_input], queue=False).then(bot_response, chatbot, chatbot)
    submitBtn.click(add_user_input, [chatbot, user_input], [chatbot, user_input], queue=False).then(bot_response, chatbot, chatbot)
    input_msg.then(lambda: gr.update(interactive=True), None, [user_input], queue=False)
    inputs_event = audio_input.stop_recording(transcribe_file, audio_input, user_input).then(add_user_input, [chatbot, user_input], [chatbot, user_input], queue=False).then(bot_response, chatbot, chatbot)
    inputs_event.then(lambda: gr.update(interactive=True), None, [user_input], queue=False)

bot_interface.title = cfg.bot["title"]
bot_interface.launch(share=True)

If you want to expose the bot externally, change the launch method as below:

bot_interface.launch(server_name="0.0.0.0", share=False,ssl_certfile="localhost.crt", ssl_keyfile="localhost.key", ssl_verify=False)

Launch

python3 main.py
or 
gradio main.py

Below is how the bot should appear in your browser. You can start chatting using the keyboard or microphone. Feel free to integrate the bot with your machine learning models to get desired answers from the bot.

Congratulations! You have successfully launched a voice-based chatbot.

Want to add streaming audio to this bot? Refer to Transcribe streaming audio
Want to convert speech as you speak? Refer to Real Time speech recognition