像专业人士一样使用 LLAMA-3 打造你自己的 SIRI!🧙♂️ 🪄
TL;DR✨
在这个简单易懂的教程中,您将学习如何使用 LLAMA-3 AI 模型构建自己的语音助手 Siri。😎
你将学到:👀
- 了解如何使用 OpenAI TTS / Pyttsx3 / gTTS 在 Python 项目中设置TTS 。
- 学习使用带有LLAMA-3模型的Groq生成聊天响应。
- 学习捕捉网络摄像头图像并使用 Google Generative AI 进行处理。
- 学习使用shell 脚本自动执行所有手动任务。
让我们一起创造奇迹!😵💫
设置环境🛠️
创建一个文件夹来保存项目的所有源代码:
mkdir siri-voice-llama3
cd siri-voice-llama3
创建几个新的子文件夹,我们将在其中存储源代码、shell 脚本、日志和聊天历史记录:
mkdir -p src logs src/scripts data/ai_response data/chat_history
现在已经设置了初始文件夹结构,是时候创建一个新的虚拟环境并安装我们将在项目中使用的所有模块了。
运行以下命令在项目根目录中创建并激活一个新的虚拟环境:
python3 -m venv .venv
source .venv/bin/activate # If you are using fish shell, change the activate binary to activate.fish
运行此命令来安装我们将在项目中使用的所有必要模块:
pip3 install SpeechRecognition opencv-python openai google-generativeai gTTS pyttsx3 groq faster-whisper numpy python-dotenv pyperclip pydub PyAudio pillow
⚠️注意:如果软件包将来发生变化,以这种方式安装可能会导致问题。如需查看具体版本,请在此处
requirements.txt
找到我的文件。复制内容以在项目根目录中创建此文件。要安装,请运行:
pip3 install -r requirements.txt
每个模块的用途如下:
SpeechRecognition
:支持从音频文件或流中进行语音识别。opencv-python
:用于处理网络摄像头图像。groq
:一个用于 Groq 的库。用于生成 LLAMA-3 的响应。google-generativeai
:用于图像处理以提供上下文。faster-whisper
:Whisper 语音识别的更快实现。python-dotenv
:用于从 .env 文件读取键值对。pyperclip
:方便剪贴板操作(复制和粘贴)。pydub
:处理音频处理任务。PyAudio
:管理音频输入/输出。numpy
:支持数值计算和高效的数组处理。pillow
:用于图像处理的 Python 图像库 (PIL) 的一个分支。
可选模块:
ℹ️ 其中,只需要一个。
openai
:使用 OpenAI 的流音频实现文本转语音。gTTS
:Google 文本转语音库,用于从文本生成语音。pyttsx3
:用于离线语音合成的 Python 文本转语音库。
让我们开始编码吧💻
设置聊天历史记录支持📋
💡 我们将在日志文件中单独添加每天聊天历史记录的支持。
在目录中src
,添加一个名为的文件utils.py
,其代码如下:
在这个文件中,我们将存储程序中需要的所有辅助函数。
# 👇 siri-voice-llama3/src/utils.py
import os
import sys
from datetime import datetime
from pathlib import Path
from typing import Literal, NoReturn, Optional
import pyperclip
from PIL import ImageGrab
import utils
def get_log_file_for_today(project_root_folder_path: Path) -> Path:
"""
Retrieves the log file path for today's date, ensuring that the necessary
directories are created. If the log file for the current day does not exist,
it creates an empty log file.
Args:
project_root_folder_path (Path): The root folder of the project, where the 'data'
directory resides.
Returns:
Path: The absolute path to the log file for today's date.
"""
today = datetime.today()
# The year is always 4 digit and the month, day is always 2 digit using this format.
year = today.strftime("%Y")
month = today.strftime("%m")
day = today.strftime("%d")
base_folder = os.path.join(
project_root_folder_path, "data", "chat_history", year, month
)
os.makedirs(base_folder, exist_ok=True)
chat_log_file = os.path.join(base_folder, f"{day}.log")
Path(chat_log_file).touch(exist_ok=True)
return Path(os.path.abspath(chat_log_file))
def log_chat_message(
log_file_path: Path,
user_message: Optional[str] = None,
ai_message: Optional[str] = None,
) -> None:
"""
Logs user and assistant chat messages to the provided log file, along with
a timestamp. Either the user message or the assistant message (or both) can
be provided.
Args:
log_file_path (Path): The absolute path to the log file where messages will be logged.
user_message (Optional[str]): The message sent by the user. Defaults to None.
ai_message (Optional[str]): The message generated by the assistant. Defaults to None.
Returns:
None: This function appends the messages to the log file in a readable format
with a timestamp. It does not return anything.
"""
# If neither of the message is given, return.
if not user_message and not ai_message:
return
timestamp = datetime.now().strftime("[%H : %M : %S]")
with open(log_file_path, "a") as log_file:
if user_message:
log_file.write(f"{timestamp} - USER: {user_message}")
if ai_message:
log_file.write(f"{timestamp} - ASSISTANT: {ai_message}\n")
log_file.write("\n")
该get_log_file_for_today
函数接受项目根文件夹的路径,这通常是我们的main.py
文件所在的位置。
它构建指向存储在 中的今日日志文件的路径data/chat_history/{month}/{day}.log
。如果该文件不存在,则创建一个空文件并返回路径。如果该文件存在,则直接返回现有路径。
该log_chat_message
函数获取日志文件的路径、用户消息和 AI 消息,然后使用特定的时间戳记录接收到的消息。
API 密钥配置🔑
对于此项目,我们需要一些 API 密钥。其中包括 Groq 密钥、Google Generative AI 密钥以及可选的 OpenAI 密钥。
.env
在项目根目录中创建一个新文件,并用 API 密钥填充它。
# Required
GROQ_API_KEY=
GOOGLE_GENERATIVE_AI_API_KEY=
# Optional
OPENAI_API_KEY=
在使用 API 密钥填充.env
文件后,就可以在 Python 代码中访问它了。
在目录内,使用以下代码src
创建一个新文件:setup.py
# 👇 siri-voice-llama3/src/setup.py
import os
from dotenv import load_dotenv
import utils
def get_credentials() -> tuple[str, str, str | None]:
"""
Load API keys from environment variables and return them as a tuple.
This function loads environment variables from a `.env` file using `dotenv`.
It retrieves the Groq API key, Google Generative AI API key, and OpenAI API key.
If any of the keys are missing, it exits the program with an error message.
Returns:
tuple[str, str, str | None]: A tuple containing the Groq API key, Google Generative AI API key,
and OpenAI API key.
Raises:
SystemExit: If any of the required API keys are not found, the program exits with an error message.
"""
load_dotenv()
groq_api_key: str | None = os.getenv("GROQ_API_KEY")
google_gen_ai_api_key: str | None = os.getenv("GOOGLE_GENERATIVE_AI_API_KEY")
openai_api_key: str | None = os.getenv("OPENAI_API_KEY")
if groq_api_key is None or google_gen_ai_api_key is None:
return utils.exit_program(
status_code=1,
message="Missing required API key(s). Make sure to set them in `.env` file. If you are using the OpenAI approach, then populate the OpenAI api key as well.",
)
return groq_api_key, google_gen_ai_api_key, openai_api_key
该get_credentials
函数使用库从环境变量加载 API 密钥dotenv
并将它们作为元组返回。
如果缺少 Groq 或 Google API 密钥,该函数将退出程序并显示错误消息,提示用户在.env
文件中设置必要的密钥。OpenAI 密钥作为可选密钥返回,如果未设置,则返回 None。
定义附加辅助函数👷
上面的get_credentials
函数中setup.py
,我们使用了utils.exit_program
,但是我们还没有定义它。
让我们继续努力并添加一些我们在项目中需要的辅助函数。
utils.py
在目录中的文件内src
,添加以下代码行:
# 👇 siri-voice-llama3/src/utils.py
# Rest of the code...
def exit_program(status_code: int = 0, message: str = "") -> NoReturn:
"""
Exit the program with an optional error message.
Args:
status_code (int): The exit status code. Defaults to 0 (success).
message (str): An optional error message to display before exiting.
"""
if message:
print(f"ERROR: {message}\n")
sys.exit(status_code)
def get_path_to_folder(folder_type: Literal["webcam", "screenshot"]) -> Path:
"""
Get the path to the specified folder type (webcam or screenshot).
Args:
folder_type (Literal["webcam", "screenshot"]): The type of folder to retrieve the path for.
Returns:
Path: The path to the specified folder.
Raises:
ValueError: If the folder_type is not valid.
"""
base_path = Path(os.path.join(Path.home(), "Pictures", "llama3.1"))
folder_map = {
"screenshot": Path(os.path.join(base_path, "Screenshots")),
"webcam": Path(os.path.join(base_path, "Webcam")),
}
if folder_type not in folder_map:
raise ValueError(
f"ERROR: Invalid folder_type: {folder_type}. Expected 'webcam' or 'screenshot'."
)
return folder_map[folder_type]
顾名思义exit_program
,它会以指定的状态码和可选的错误消息退出程序。如果提供了消息,它会在退出前打印该消息。
该get_path_to_folder
函数构造并返回指定文件夹类型(例如“webcam”或“screenshot”)的文件夹路径。它将用户的主目录与预定义的基本路径(“Pictures/llama3.1”)组合,并附加相应的文件夹名称。我们将使用此函数将图像存储在相应的文件夹中,用于存储 webcam 或屏幕截图。
现在,我们将定义更多用于捕获和删除屏幕截图以及获取剪贴板文本的函数。
# 👇 siri-voice-llama3/src/utils.py
# Rest of the code...
def capture_screenshot() -> Path:
"""
Captures a screenshot and saves it to the designated folder.
Returns:
Path: The file path of the saved screenshot.
"""
screenshot_folder_path = utils.get_path_to_folder(folder_type="screenshot")
os.makedirs(screenshot_folder_path, exist_ok=True)
screen = ImageGrab.grab()
time_stamp = datetime.now().strftime("%Y_%m_%d_%H_%M_%S")
rgb_screenshot = screen.convert("RGB")
image_filename = f"screenshot_{time_stamp}.png"
image_file_path = Path(os.path.join(screenshot_folder_path, image_filename))
rgb_screenshot.save(image_file_path, quality=20)
return image_file_path
def remove_last_screenshot() -> None:
"""
Remove the most recent screenshot file from the designated screenshots folder.
The function checks if the folder exists and if there are any .png files. If
found, it deletes the most recently created screenshot.
"""
folder_path = utils.get_path_to_folder(folder_type="screenshot")
if not os.path.exists(folder_path):
return
files = [
file
for file in os.listdir(folder_path)
if os.path.isfile(os.path.join(folder_path, file)) and file.endswith(".png")
]
if not files:
return
most_recent_file = max(
files, key=lambda f: os.path.getctime(os.path.join(folder_path, f))
)
os.remove(os.path.join(folder_path, most_recent_file))
def get_clipboard_text() -> str:
"""
Retrieves the current text content from the system clipboard.
This function uses the `pyperclip` module to access the clipboard. If the clipboard
content is a valid string, it returns the content. If the content is not a string,
it returns an empty string.
Returns:
str: The text content from the clipboard, or an empty string if the content is
not a string or the clipboard is empty.
"""
clipboard_content = pyperclip.paste()
if isinstance(clipboard_content, str):
return clipboard_content
return ""
该函数使用ImageGrabcapture_screenshot
模块捕获当前屏幕,将其作为 PNG 文件保存在指定的屏幕截图文件夹中,并返回已保存屏幕截图的完整文件路径。📸
它使用时间戳构造文件名以确保唯一性,并将保存的图像质量设置为 20。我们正在降低图像质量,以便稍后快速处理图像。
该remove_last_screenshot
函数会识别并删除指定文件夹中最近创建的截图。它首先检查该文件夹是否存在,并在其中查找 .png 文件。如果找到,它会根据创建时间确定最新的文件,然后再将其删除。🚮
该get_clipboard_text
函数使用pyperclip
模块访问剪贴板。如果剪贴板内容是有效字符串,则返回该内容。如果内容不是字符串,则返回空字符串。
集成网络摄像头支持📸
为了从网络摄像头捕获图像,我们必须添加对它的支持。
在目录中src
,创建一个新文件webcam.py
并添加以下代码行:
# 👇 siri-voice-llama3/src/webcam.py
import os
from datetime import datetime
from pathlib import Path
from typing import NoReturn, Union
import cv2
import utils
def get_available_webcam() -> cv2.VideoCapture | None:
"""
Checks for available webcams and returns the first one that is opened.
This function attempts to open the first 10 webcam indices. If a webcam is found
and successfully opened, it returns a VideoCapture object. If no webcams are found,
it exits the program with an error message.
Returns:
cv2.VideoCapture: The opened webcam object.
None: If no webcam is found, the program exits with an error message.
"""
# Assuming that we are checking the first 10 webcams.
for index in range(10):
web_cam = cv2.VideoCapture(index)
if web_cam.isOpened():
return web_cam
return utils.exit_program(status_code=1, message="No webcams found.")
def capture_webcam_image() -> Union[Path, NoReturn]:
"""
Captures an image from the available webcam and saves it to the specified folder.
This function first checks for an available webcam using `get_available_webcam`.
If a webcam is successfully opened, it creates a folder for saving the images if
it does not already exist, generates a timestamped filename, captures a frame,
and saves the image to the specified folder. The function then releases the webcam.
Returns:
Path: The file path of the saved image.
NoReturn: If there was an error capturing the image, the program exits with an error message.
"""
webcam = get_available_webcam()
if webcam is None or not webcam.isOpened():
return utils.exit_program(
status_code=1, message="There was an error capturing the image."
)
webcam_folder_path = utils.get_path_to_folder(folder_type="webcam")
os.makedirs(webcam_folder_path, exist_ok=True)
timestamp = datetime.now().strftime("%Y_%m_%d_%H_%M_%S")
image_filename = f"webcam_{timestamp}.png"
_, frame = webcam.read()
image_file_path_str = os.path.join(webcam_folder_path, image_filename)
cv2.imwrite(image_file_path_str, frame)
webcam.release()
return Path(image_file_path_str)
该get_available_webcam
函数会检查可用的网络摄像头,并假设前十个索引。如果成功打开了某个摄像头,则返回相应的cv2.VideoCapture
对象。🎥 如果未找到任何摄像头,则会退出程序并显示错误消息。
该capture_webcam_image
函数会从可用的网络摄像头捕获图像。它首先调用我们之前编写的辅助函数get_available_webcam
来获取可用的网络摄像头。如果获取成功,它会创建一个用于保存图像的文件夹(如果该文件夹尚不存在),生成带时间戳的文件名,捕获一帧图像并保存图像。最后,它会释放网络摄像头并返回已保存图像的路径。🖼️
实现主程序逻辑
现在我们已经编写了项目工作中需要的所有实用程序,让我们从主程序逻辑开始
在目录中src
,创建一个新文件siri.py
并添加以下代码行:
# 👇 siri-voice-llama3/src/siri.py
import os
import re
import time
from pathlib import Path
from typing import List
import google.generativeai as genai
import pyttsx3
import speech_recognition as sr
from faster_whisper import WhisperModel
from groq import Groq
from groq.types.chat import ChatCompletionMessageParam
from gtts import gTTS
from openai import OpenAI
from PIL import Image
from pydub import AudioSegment
from pydub.playback import play
import utils
import webcam
class Siri:
"""
A multi-modal AI voice assistant that responds to user prompts
by processing voice commands and context from images or clipboard content.
"""
def __init__(
self,
log_file_path: Path,
project_root_folder_path: Path,
groq_api_key: str,
google_gen_ai_api_key: str,
openai_api_key: str | None,
) -> None:
"""
Initializes the Siri assistant with API clients for Groq, OpenAI, and Google Generative AI.
Args:
log_file_path (Path): Path to the log file.
project_root_folder_path (Path): Root folder of the project.
groq_api_key (str): API key for Groq.
google_gen_ai_api_key (str): API key for Google Generative AI.
openai_api_key (str): API key for OpenAI.
"""
self.log_file_path = log_file_path
self.project_root_folder_path = project_root_folder_path
self.pyttsx3_engine = pyttsx3.init()
self.groq_client = Groq(api_key=groq_api_key)
self.openai_client = OpenAI(api_key=openai_api_key)
# Configure Google Generative AI model
genai_generation_config = genai.GenerationConfig(
temperature=0.7, top_p=1, top_k=1, max_output_tokens=2048
)
genai.configure(api_key=google_gen_ai_api_key)
self.genai_model = genai.GenerativeModel(
"gemini-1.5-flash-latest",
generation_config=genai_generation_config,
safety_settings=[
{"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"},
{"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"},
{
"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
"threshold": "BLOCK_NONE",
},
{
"category": "HARM_CATEGORY_DANGEROUS_CONTENT",
"threshold": "BLOCK_NONE",
},
],
)
# Initialize conversation context for the AI
self.conversation: List[ChatCompletionMessageParam] = [
{
"role": "user",
"content": (
"You are a multi-modal AI voice assistant. Your user may have attached a photo (screenshot or webcam capture) "
"for context, which has already been processed into a detailed text prompt. This will be attached to their transcribed "
"voice input. Generate the most relevant and factual response by carefully considering all previously generated text "
"before adding new information. Do not expect or request additional images; use the provided context if available. "
"Please do not include newlines in your response. Keep it all in one paragraph. "
"Ensure your responses are clear, concise, and relevant to the ongoing conversation, avoiding any unnecessary verbosity."
),
}
]
total_cpu_cores = os.cpu_count() or 1
# Initialize the audio transcription model
self.audio_transcription_model = WhisperModel(
device="cpu",
compute_type="int8",
model_size_or_path="base",
cpu_threads=total_cpu_cores // 2,
num_workers=total_cpu_cores // 2,
)
# Initialize speech recognition components
self.speech_recognizer = sr.Recognizer()
self.mic_audio_source = sr.Microphone()
self.wake_word = "siri"
Siri类设置了一个多模式 AI 语音助手,不仅可以处理语音命令,还可以处理图像或剪贴板内容以添加上下文。
它接受一些关键参数,例如在使用 gTTS 时,记录log_file_path
对话project_root_folder_path
和将 AI 响应存储为 mp3 文件所需的帮助。您还需要 Groq、Google Generative AI 以及可选的 OpenAI 的 API 密钥。🤖
该课程为 Groq、OpenAI 和 Google Generative AI 设置了客户端。对于 Google GenAI,它使用gemini-1.5-flash模型并调整了一些安全设置。
内置初始对话提示,指导 AI 如何根据用户的语音命令或任何处理过的图像做出响应。💬
对于音频转录,它使用 Faster Whisper 模型,该模型在 CPU 上运行,并采用特定的性能设置。它还使用 和 设置语音识别Recognizer
,Microphone
助手会监听唤醒词“siri”并开始执行命令。
现在初始配置已完成,让我们向该类添加一些方法。我们将定义更多方法,用于处理音频转录、提取用户提示、监听提示以及使用 Groq 生成聊天回复。🛠️
siri.py
在目录内的文件中添加以下方法src
。
# 👇 siri-voice-llama3/src/siri.py
# Rest of the code...
def transcribe_audio_to_text(self, audio_file_path: Path) -> str:
"""
Transcribes audio from a file to text.
Args:
audio_file_path (Path): Path to the audio file.
Returns:
str: The transcribed text from the audio.
"""
segments, _ = self.audio_transcription_model.transcribe(
audio=str(audio_file_path)
)
return "".join(segment.text for segment in segments)
def extract_prompt(self, transcribed_text: str) -> str | None:
"""
Extracts the user's prompt from the transcribed text after the wake word.
Args:
transcribed_text (str): The transcribed text from audio input.
Returns:
str | None: The extracted prompt if found, otherwise None.
"""
pattern = rf"\b{re.escape(self.wake_word)}[\s,.?!]*([A-Za-z0-9].*)"
regex_match = re.search(
pattern=pattern, string=transcribed_text, flags=re.IGNORECASE
)
if regex_match is None:
return None
return regex_match.group(1).strip()
def listen(self) -> None:
"""
Starts listening for the wake word and processes audio input in the background.
"""
with self.mic_audio_source as mic:
self.speech_recognizer.adjust_for_ambient_noise(source=mic, duration=2)
self.speech_recognizer.listen_in_background(
source=self.mic_audio_source, callback=self.handle_audio_processing
)
while True:
time.sleep(0.5)
def generate_chat_response_with_groq(
self, prompt: str, image_context: str | None
) -> str:
"""
Generates a response from the Groq model based on user input and optional image context.
Args:
prompt (str): The user's prompt.
image_context (str | None): Optional image context for the response.
Returns:
str: The generated response from the assistant.
"""
if image_context:
prompt = f"USER_PROMPT: {prompt}\n\nIMAGE_CONTEXT: {image_context}"
self.conversation.append({"role": "user", "content": prompt})
completion = self.groq_client.chat.completions.create(
messages=self.conversation, model="llama-3.1-8b-instant"
)
ai_response = completion.choices[0].message.content
self.conversation.append({"role": "assistant", "content": ai_response})
return ai_response or "Sorry, I'm not sure how to respond to that."
该方法transcribe_audio_to_text
接收音频文件的路径,并将其内容转录为文本。它使用 WhisperModel 分段处理音频文件,并返回一个连接所有分段转录文本的字符串。🎧
该方法extract_prompt
从转录文本中提取用户的语音提示,具体来说是唤醒词(例如“siri”)之后的提示。它使用正则表达式查找并捕获唤醒词之后的提示,并返回已清理的提示,None
如果未找到提示,则返回。🗣️
该方法listen
持续监听唤醒词并处理音频输入。它首先根据环境噪音进行调整,然后使用回调 ( handle_audio_processing
) 在后台开始监听。该方法会进入无限循环,每次迭代都会短暂暂停以继续监听。🔄
该方法generate_chat_response_with_groq
使用 Groq 模型,根据用户的提示和可选的图像上下文生成响应。它会使用图像上下文(如果有)格式化提示,将对话添加到模型中,并将 AI 的响应附加到正在进行的对话中。然后,它会返回生成的响应,如果未生成响应,则会返回默认消息。💬
文本转语音生成🗣️
对于文本转语音生成,我们将实现三种不同的方法:pyttsx3、OpenAI 和 gTTS(谷歌文本转语音)。您可以自由选择任何符合您需求的方法。
- Pyttsx3 方法
在这里,对于文本到语音的生成,我们将使用著名的 Python 模块 Pyttsx3。
在该siri.py
文件中,添加以下方法。
# 👇 siri-voice-llama3/src/siri.py
# Rest of the code...
# Pyttsx3 Approach (Weaker Audio Quality)
def text_to_speech(self, text: str) -> None:
"""
Converts text to speech using Pyttsx3's text-to-speech API.
Args:
text (str): The text to convert to speech.
"""
self.pyttsx3_engine.setProperty("volume", 1.0)
self.pyttsx3_engine.setProperty("rate", 125)
voices = self.pyttsx3_engine.getProperty("voices")
# Set voice to Female.
self.pyttsx3_engine.setProperty("voice", voices[0].id)
self.pyttsx3_engine.say(text)
self.pyttsx3_engine.runAndWait()
self.pyttsx3_engine.stop()
该text_to_speech
方法使用pyttsx3_engine
我们在Siri类内部初始化的来设置引擎的一些属性,并最终说出我们提供给它的文本。
- OpenAI 方法
对于这种方法,我们将使用 OpenAI Audio 语音流。这种方法总体上比其他任何方法都具有最佳体验,但它需要您设置 OpenAI API 并且您的帐户中需要有一些 OpenAI 积分。
在该siri.py
文件中,添加以下方法。
# 👇 siri-voice-llama3/src/siri.py
# Rest of the code...
# OpenAI Approach (Best Quality Audio with multiple voice available).
def text_to_speech(self, text: str) -> None:
"""
Converts text to speech using OpenAI's text-to-speech API.
Args:
text (str): The text to convert to speech.
"""
stream = pyaudio.PyAudio().open(
format=pyaudio.paInt16, channels=1, rate=24000, output=True
)
stream_start = False
with self.openai_client.audio.speech.with_streaming_response.create(
model="tts-1", voice="nova", response_format="pcm", input=text
) as openai_response:
silence_threshold = 0.1
for chunk in openai_response.iter_bytes(chunk_size=1024):
if stream_start:
stream.write(chunk)
elif max(chunk) > silence_threshold:
stream.write(chunk)
stream_start = True
该text_to_speech
方法使用 OpenAI 的文本转语音 (TTS) API 将文本转换为语音。
首先使用 打开一个音频流PyAudio
,该音频流配置为以 24,000 Hz 的采样率和 16 位分辨率输出音频。然后,该方法使用提供的文本调用 OpenAI TTS API,指定模型(“tts-1”)、语音(“nova”)和响应格式(“pcm”)。音频数据实时流式传输。🚀
您可以根据需要更改声音。有关可用选项的列表,请访问此处。
在循环中,该方法会检查 OpenAI API 返回的音频块。如果音频超过某个静音阈值,流就会开始播放音频块,确保只有在检测到有意义的声音时才会朗读文本。这可以防止流以静音开始。🔊
- gTTS 方法
对于这种方法,我们将使用 Google 文本转语音引擎。
这种方法速度很慢,而且需要将 AI 的响应保存为“mp3”格式,然后播放该音频文件。
在该siri.py
文件中,添加以下方法。
# 👇 siri-voice-llama3/src/siri.py
# Rest of the code...
def text_to_speech(self, text: str) -> None:
"""
Converts text to speech using Google's text-to-speech API.
Args:
text (str): The text to convert to speech.
"""
tts = gTTS(text=text, lang="en", slow=False)
response_folder_path = Path(
os.path.abspath(
os.path.join(self.project_root_folder_path, "data", "ai_response")
)
)
os.makedirs(response_folder_path, exist_ok=True)
response_audio_file_path = Path(
os.path.join(response_folder_path, "ai_response_audio.mp3")
)
tts.save(response_audio_file_path)
response_audio = AudioSegment.from_mp3(response_audio_file_path)
play(response_audio)
# After the audio is played, delete the audio file.
if os.path.exists(response_audio_file_path):
os.remove(response_audio_file_path)
该text_to_speech
方法使用 Google 的 TTS API 将文本转换为语音。它首先根据给定的英文文本生成语音,并设置slow=False
播放速度。然后,该方法在“data/ai_response”目录中创建一个文件夹路径,用于存储响应音频文件。确保该目录存在后,它将语音保存为 mp3 文件。
保存 mp3 文件后,该方法将加载音频AudioSegment
并播放。播放音频后,该方法将删除 mp3 文件进行清理。
现在,我们也研究了text_to_speech
方法,我们需要编写更多的方法来分析图像提示,如果用户将图像上下文附加到提示中,则选择相关的辅助操作并处理音频并采取相关操作。
将以下代码添加到目录siri.py
中的文件src
。
# 👇 siri-voice-llama3/src/siri.py
# Rest of the code...
def analyze_image_prompt(self, prompt: str, image_path: Path) -> str:
"""
Analyzes an image based on the user prompt to extract semantic information.
Args:
prompt (str): The user's prompt related to the image.
image_path (Path): The path to the image file.
Returns:
str: The analysis result from the image based on the prompt.
"""
image = Image.open(image_path)
prompt = (
"You are an image analysis AI tasked with extracting semantic meaning from images to assist another AI in "
"generating a user response. Your role is to analyze the image based on the user's prompt and provide all relevant, "
"objective data without directly responding to the user. Focus solely on interpreting the image in the context of "
f"the user’s request and relay that information for further processing. \nUSER_PROMPT: {prompt}"
)
genai_response = self.genai_model.generate_content([prompt, image])
return genai_response.text
def select_assistant_action(self, prompt: str) -> str:
"""
Determines the appropriate action for the assistant to take based on user input.
Args:
prompt (str): The user's prompt.
Returns:
str: The selected action for the assistant.
"""
system_prompt_message = (
"You are an AI model tasked with selecting the most appropriate action for a voice assistant. Based on the user's prompt, "
"choose one of the following actions: ['extract clipboard', 'take screenshot', 'delete screenshot', 'capture webcam', 'generic']. "
"Assume the webcam is a standard laptop webcam facing the user. Provide only the action without explanations or additional text. "
"Respond strictly with the most suitable option from the list."
)
function_conversation: List[ChatCompletionMessageParam] = [
{"role": "system", "content": system_prompt_message},
{"role": "user", "content": prompt},
]
completion = self.groq_client.chat.completions.create(
messages=function_conversation, model="llama-3.1-8b-instant"
)
ai_response = completion.choices[0].message.content
return ai_response or "generic"
def handle_audio_processing(self, recognizer: sr.Recognizer, audio: sr.AudioData):
"""
Callback function to process audio input once recognized.
Args:
recognizer (sr.Recognizer): The speech recognizer instance.
audio (sr.AudioData): The audio data captured by the microphone.
"""
data_folder_path = Path(os.path.abspath(os.path.join(".", "data")))
os.makedirs(data_folder_path, exist_ok=True)
audio_prompt_file_path = Path(
os.path.abspath(os.path.join(data_folder_path, "user_audio_prompt.wav"))
)
with open(audio_prompt_file_path, "wb") as f:
f.write(audio.get_wav_data())
transcribed_text = self.transcribe_audio_to_text(
audio_file_path=audio_prompt_file_path
)
parsed_prompt = self.extract_prompt(transcribed_text=transcribed_text)
if parsed_prompt:
utils.log_chat_message(
log_file_path=self.log_file_path, user_message=parsed_prompt
)
skip_response = False
selected_assistant_action = self.select_assistant_action(
prompt=parsed_prompt
)
if "capture webcam" in selected_assistant_action:
image_path = webcam.capture_webcam_image()
image_analysis_result = self.analyze_image_prompt(
prompt=parsed_prompt, image_path=image_path
)
elif "take screenshot" in selected_assistant_action:
image_path = utils.capture_screenshot()
image_analysis_result = self.analyze_image_prompt(
prompt=parsed_prompt, image_path=image_path
)
elif "delete screenshot" in selected_assistant_action:
utils.remove_last_screenshot()
image_analysis_result = None
ai_response = "Screenshot deleted successfully."
self.text_to_speech(text=ai_response)
utils.log_chat_message(
log_file_path=self.log_file_path, ai_message=ai_response
)
skip_response = True
elif "extract clipboard" in selected_assistant_action:
clipboard_content = utils.get_clipboard_text()
parsed_prompt = (
f"{parsed_prompt}\n\nCLIPBOARD_CONTENT: {clipboard_content}"
)
image_analysis_result = None
else:
image_analysis_result = None
# If the response is not supposed to be skipped, then generate the response and speak it out.
if not skip_response:
response = self.generate_chat_response_with_groq(
prompt=parsed_prompt, image_context=image_analysis_result
)
utils.log_chat_message(
log_file_path=self.log_file_path, ai_message=response
)
self.text_to_speech(text=response)
# Remove the user prompt audio after the response is generated.
if os.path.exists(audio_prompt_file_path):
os.remove(audio_prompt_file_path)
该analyze_image_prompt
方法根据用户的提示分析图像以提取语义信息。首先,使用 PIL 库打开指定的图像文件。然后,该方法构建一个提示,指示图像分析 AI 专注于从图像中提取相关数据,而无需直接响应用户。
该方法将构建的提示和图像发送到 Google 生成式 AI 模型进行处理。最终,它会以文本形式返回图像分析结果。📝
该select_assistant_action
方法根据用户的输入确定助手的适当操作。🤔它首先创建一个系统提示,指示 AI 模型从预定义的操作列表中进行选择:'extract clipboard', 'take screenshot', 'delete screenshot', 'capture webcam', or 'generic'
。
接下来,该方法构建一个包含系统提示和用户提示的对话列表。然后,它将此对话发送到 Groq 客户端,以使用指定的模型生成响应。
响应可以是预定义操作列表中的任意一项。
该handle_audio_processing
方法会在助手识别音频输入后进行处理。首先,它会将捕获的音频保存为.wav
“data”文件夹中的文件。然后,它会使用该方法将音频转录为文本transcribe_audio_to_text
,并使用 从文本中提取用户的提示extract_prompt
。
如果发现提示,它会记录用户的消息并使用 确定适当的助手操作select_assistant_action
。根据操作的不同,它可能会捕获网络摄像头图像、截取屏幕截图、删除屏幕截图或提取剪贴板内容。对于基于图像的操作,它会使用analyze_image_prompt
来分析图像。🔍
该skip_response
变量用于控制助手在执行某些操作后是否应跳过生成和说出响应的步骤。该变量的初始值为 False,表示预期会生成响应。
例如,当操作为“删除屏幕截图”时,该方法会删除屏幕截图,并通过文本转语音直接提供预定义的响应(“屏幕截图已成功删除。”)。在这种情况下,skip_response
设置为 True 以防止助手为用户提示生成单独的响应,因为操作本身就足够了。✅
对于其他操作,它会使用方法生成响应generate_chat_response_with_groq
,并将其转换为语音。响应生成后,该方法会删除音频文件。🚮
写入main.py
文件
这将是我们程序的入口点。它执行助手运行所需的设置和初始化。
main.py
在项目根目录中创建一个名为的新文件,并添加以下代码行:
# 👇 siri-voice-llama3/main.py
import os
import sys
from pathlib import Path
# Add the src directory to the module search path
sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath(__file__)), "src"))
from src import setup, siri, utils
"""
Main entry point for the AI llama3 siri voice assitant.
This script loads the necessary API credentials from environment variables,
initializes the Siri assistant with the provided keys, and starts listening
for user input. The program will exit if any of the required API keys are
missing.
To run the application, execute this script in an environment where the
`.env` file is properly configured with the required API keys.
"""
if __name__ == "__main__":
# Determine the current directory of the script
project_root_folder_path = Path(os.path.dirname(os.path.abspath(__file__)))
chat_log_file_path = utils.get_log_file_for_today(
project_root_folder_path=project_root_folder_path
)
all_api_keys = setup.get_credentials()
groq_api_key, google_gen_ai_api_key, openai_api_key = all_api_keys
siri = siri.Siri(
log_file_path=chat_log_file_path,
project_root_folder_path=project_root_folder_path,
groq_api_key=groq_api_key,
google_gen_ai_api_key=google_gen_ai_api_key,
openai_api_key=openai_api_key,
)
siri.listen()
👀 请注意,我们正在使用该方法将路径插入到我们的 src 目录,
sys.path.insert()
以确保 Python 可以从该src
目录中定位并导入模块。
utils.get_log_file_for_today
主块首先确定项目根文件夹,然后获取用于记录聊天消息的每日日志文件路径。
setup.get_credentials
接下来,我们使用之前在编写一些辅助函数时编写的函数来检索 API(用于 Groq、Google Generative AI 和 OpenAI) 。
然后,我们创建 Siri 类的实例,传递日志文件路径、项目根文件夹路径和 API 密钥。
最后,siri.listen
调用该方法,启动助手并监听用户输入。
现在,你应该已经拥有了自己的语音助手的可用版本。🥂
可选:构建 Shell 脚本
🤔为什么需要编写 shell 脚本?
其实没必要。这个 shell 脚本是我自己写的,我以为可以通过 Linux 服务或 cron 作业之类的调度工具在系统重启时自动运行它。然而,由于它需要访问硬件组件(比如麦克风),所以我没能让它工作,所以它并没有真正说出响应(如果你找到了解决办法,请告诉我)。但如果你想自动化所有手动步骤,比如创建虚拟环境、安装依赖项以及最终运行程序,这个 shell 脚本就真的派上用场了。你还可以通过符号链接将脚本添加到PATH 环境变量中,然后在系统的任何位置运行它。😉
使用以下代码行在目录start_siri_llama3.sh
内创建一个新文件:src/scripts
💁 如果您使用的是 fish shell,您可以在这里
start_siri_llama3.fish
找到使用 fish 语法的相同代码。在目录中创建一个名为 的新文件src/scripts
,然后从链接添加代码。
# 👇 siri-voice-llama3/src/scripts/start_siri_llama3.sh
#!/usr/bin/env bash
# Using this above way of writing shebang can have some security concerns.
# See this stackoverflow thread: https://stackoverflow.com/a/72332845
# Since, I want this script to be portable for most of the users, instead of hardcoding like '#!/usr/bin/bash', I am using this way.
ERROR_USAGE="ERROR: Usage: bash {path_to_main.py}"
ERROR_FILE_NOT_FOUND="ERROR: The main.py file does not exist or is not a valid file."
ERROR_PYTHON_NOT_FOUND="ERROR: No suitable Python executable found."
ERROR_BASH_NOT_INSTALLED="ERROR: Bash shell is not installed. Please install Bash."
ERROR_ACTIVATE_NOT_FOUND="ERROR: activate file not found in '$VENV_DIR/bin'"
ERROR_UNSUPPORTED_SHELL="ERROR: Unsupported shell: '$SHELL'"
ERROR_REQUIREMENTS_NOT_FOUND="ERROR: requirements.txt file not found in '$SCRIPT_DIR'"
# Determine the script directory, virtual environment directory, and log file
SCRIPT_DIR="$(dirname "$(realpath "$0")")"
VENV_DIR="$(realpath "$SCRIPT_DIR/../../.venv")"
LOG_FILE="$(realpath "$SCRIPT_DIR/../../logs/shell-error-bash.log")"
REQUIREMENTS_FILE_PATH="$(realpath "$SCRIPT_DIR/../../requirements.txt")"
log_and_exit() {
local message="$1"
echo "[$(date +"%Y-%m-%d %H:%M:%S")] $message" | tee -a $LOG_FILE
exit 1
}
# Check if the main.py file is provided as an argument
if [ $# -ne 1 ]; then
log_and_exit "$ERROR_USAGE"
fi
# Function to check if a file exists and has the correct extension
check_file() {
local file_path="$1"
local expected_extension="$2"
if [ ! -f "$file_path" ]; then
log_and_exit "$ERROR_FILE_NOT_FOUND"
fi
if ! [[ "$file_path" == *".$expected_extension" ]]; then
log_and_exit "The file '$file_path' must be a '.$expected_extension' file."
fi
}
# Validate the provided main.py file
check_file "$1" "py"
# Extract and validate arguments
MAIN_FILE_PATH="$(realpath "$1")"
# Find the appropriate Python executable
PYTHON_EXEC="$(command -v python3 || command -v python)"
# Ensure that the Python executable is available before creating the virtual environment
if [ ! -d "$VENV_DIR" ]; then
if [ -z "$PYTHON_EXEC" ]; then
log_and_exit "$ERROR_PYTHON_NOT_FOUND"
fi
"$PYTHON_EXEC" -m venv "$VENV_DIR"
# Activate the virtual environment after creating it
if [ -f "$VENV_DIR/bin/activate" ]; then
source "$VENV_DIR/bin/activate"
else
log_and_exit "$ERROR_ACTIVATE_NOT_FOUND"
fi
PIP_EXEC_VENV = "$(command -v pip3 || command -v pip)"
# Check if requirements.txt exists and install dependencies
if [ -f "$REQUIREMENTS_FILE_PATH" ]; then
"$PIP_EXEC_VENV" install -r "$REQUIREMENTS_FILE_PATH"
else
log_and_exit "$ERROR_REQUIREMENTS_NOT_FOUND"
fi
fi
# Ensure that the Bash shell is installed.
if ! command -v bash &> /dev/null; then
log_and_exit "$ERROR_BASH_NOT_INSTALLED"
fi
# Activate the virtual environment based on the shell type
if [[ "$SHELL" == *"/bash" ]]; then
# Check if the activate file exists before sourcing it
if [ -f "$VENV_DIR/bin/activate" ]; then
source "$VENV_DIR/bin/activate"
else
log_and_exit "$ERROR_ACTIVATE_NOT_FOUND"
fi
else
log_and_exit "$ERROR_UNSUPPORTED_SHELL"
fi
# Set the python executable to the one from the virtual environment
PYTHON_EXEC="$(command -v python3 || command -v python)"
# Run the main.py file
"$PYTHON_EXEC" "$MAIN_FILE_PATH"
该脚本旨在自动设置和执行 Python 程序,确保在运行之前准备好必要的环境main.py
。⚙️
main.py
首先,它会检查传入的参数是否为有效的 Python 文件 ( )。如果不是,则会记录错误并退出。它还会验证文件是否存在以及扩展名是否正确 ( .py
)。🐍
然后,该脚本会搜索 Python 可执行文件(python3
或python
),如果虚拟环境 (venv) 不存在,则使用 Pythonvenv
模块创建一个。venv 创建完成后,它会激活它,并安装依赖项(requirements.txt
如果找到)。该脚本确保系统上同时安装了 Python 和 Bash,因为它仅支持 Bash shell。
如果用户的 shell 不是 Bash,它会记录错误并退出。否则,它会main.py
使用找到的 Python 可执行文件在虚拟环境中运行 Python 脚本 ( )。
现在,为了能够从系统上的任何位置运行此脚本,您可以使用符号链接将其添加到您的PATH中。🔗
通常,/usr/local/bin
这是我们添加自定义脚本的地方。首先,通过运行以下命令确保它位于你的PATH中:
echo $PATH
如果没有将其添加到您的PATH/usr/local/bin
中,那么您可以使用以下命令将此脚本作为符号链接添加:
ln -s {absolute_path_to_script_sh/fish} /usr/local/bin/start_siri_llama3
运行此命令后,您现在应该能够从系统上的任何位置运行此程序。🎉
结论
哇!😮💨 我们一起完成了不少工作!如果你已经完成了这一步,不妨给自己一个当之无愧的鼓励。现在,你已经成功使用 LLAMA-3 AI 模型构建了一个个人 SIRI 语音助手。
本文的完整源代码可以在这里找到:
https://github.com/shricodev/siri-voice-llama3.git
非常感谢你的阅读!🎉 🫡
文章来源:https://dev.to/shricodev/build-your-personal-siri-with-llama-3-like-a-pro-5h1o在下面的评论部分写下你的想法。👇