Build a voice assistant with OpenAI Whisper and TTS (text to speech) in 5 minutes

Ralf Elfving

Просмотров 21 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 24 янв 2025

Комментарии • 67

@TestTalk Год назад ⁺⁶
My word, I can't tell you how much I now look forward to your videos! Keep up the great work!
@ralfelfving Год назад ⁺¹
@TestTalk thank you so much for the kind words, hopefully many more coming in the weekend and months ahead :)
@TestTalk Год назад
Windows user here, I'm not sure if you mentioned it in your article but I had to download sox then Edit Environment Variable. Not sure if that helps or not but figured I would share and help the YT algorithm for you. @@ralfelfving
@mahtabalam9604 Год назад ⁺³
Immense value bro thanks for the informative videos!
@ralfelfving Год назад
Glad it helps, thanks for the comment! ♥️
@JoJoAcrylicArtwork Год назад ⁺³
fantastic! thanks so much for sharing, this exactly what I was looking to do
@ralfelfving Год назад ⁺¹
Great, it's what's my tutorials are for! :)
@JoJoAcrylicArtwork Год назад
@@ralfelfving love it! Open source baby yeah!!
@nabgilby 4 месяца назад
Just tried this, works great, thanks and I liked it too!
@biancapietersz Год назад ⁺³
I just found your content and am glad you are making tutorials on this. Have you been able to mitigate the latency?
@ralfelfving Год назад
Which latency are you thinking of?
@biancapietersz Год назад ⁺¹
for example if someone responds it takes generation time for the api requests to get the proper info and generate the text and then the speech so there is a 5-10 second lag in response time. I’m trying to figure out a way to make it have faster response.
@ralfelfving Год назад ⁺³
If I remember correctly the way that I set it up in this tutorial is the fastest currently possible with OpenAI. You have these processing components:
1. The person speaks for 10 seconds
2. Send audio to Whisper
3. Whisper process said audio and responds with transcript
4. Send transcript to GPTx (I used 3.5 turbo)
5. GPT process and returns response
6. Send response to TTS
7 TTS responds with audio and play back to user.
In 1&2 you could technically stream chunks of audio and get them transcribed as the user speaks, such that much of the transcription is done once the user has stopped talking, and then join that all together for step 4.
Step 4 has to happen after all of step 1-3 has completed. For GPTx to give you a useful answer, it needs to receive the full question from the user.
Step 5 supports streaming output, but iirc step 6 doesn't support streaming input (yet). That means that as of today, you have to wait for GPTx to give you the entire output before you can process the TTS response. You could look into something similar to mentioned above, chunk up GPTx responses into sentences and get TTS to generate the audio piece by piece. The TTS response itself is streaming in my script, so it will start playing when it has the first few words.
The only clear handover point where the full information is needed is 3-4, the rest is solvable -- and OpenAI will make it better over time.
@biancapietersz Год назад ⁺¹
@@ralfelfving yeah I’ve considered chungking in bits but it’s possible the responses would be inaccurate without the full scope and context of what is being said.
It’s helpful that you’ve mentioned this with step 4
This is a wildly helpful answer. I so appreciate it!
@marcuscarter 11 месяцев назад ⁺¹
Hi, great video, well above my level, but I have a quick question, could you actually have a 'meaningful' conversation with at as you would with chatgpt?
@ralfelfving 11 месяцев назад
Yes, its OpenAI's GPT levels under the hood of both so they'd be very similar.
@marcuscarter 11 месяцев назад
ok great, thanks for the information, I'm trying to work out how to put this tech into an app so this could be the way, many thanks and good luck with the channel
@armankarambakhsh4456 Год назад ⁺¹
Could someone pleaaaase tell me if they could've successfully run this on their windows? I use VS Community 2022 and I constantly get dependency errors like for -node microphone.
I have .JS + .env file in the project + node.js installed and configured for VS + ffmpeg address is listed in windows environment variables.
Feels so stupid to he stuck at such simple thing 😭
@ralfelfving Год назад
Someone commented on the linked Medium article that they got it working on Windows. Did you install dependencies like Node package microphone?
@armankarambakhsh4456 Год назад ⁺¹
@@ralfelfving I ran them all and it sais successful. Luke 25 dependencies. But when I ran the app.js, it gave error for microphone. And when I ran the npm install for microphone, it gives like tons of errors 😕
@ralfelfving Год назад
You'd need to resolve the errors for the microphone npm install.
@AndAllTravel Год назад ⁺¹
Excellent content... I'm also having an issue with 'node install speaker'. Rosetta didn't seem to help. Any other ideas? Without speaker, the app otherwise seems to work but fails after hitting 'enter'
@ralfelfving Год назад
Thanks. I think I forgot to mention it in the blog post because it's not an npm package -- but did you get prompted to install SoX (sound exchange)? It would be done using brew.
@AndAllTravel Год назад
@ralfelfving sox installed but doesn't seem to make a difference. (gyp is not happy lol) It seems to be a common problem but also appears unfixed in the community. I tried to edit 'node-gypi' with the proper MACOSX version to no avail. Here is the log if you are interested: drive.google.com/file/d/1_aNOfPjiAfIBqf2KvUHUVx-Hd9JJu6lJ/view?usp=share_link
@zoltanfejedelem9372 2 месяца назад
Great work, thank you.
I have a question, if I want to text for example: 3999 characters to recite and save to mp3 in the given language how does it work?
@MariastellaALBARELLI 6 месяцев назад
Hello, How can I attach the audio to an assistant using threads messages? Thank you
@EL-tirol Год назад ⁺¹
As I understand, it is connected to general gpt 3.5 model, not to customized API Assistant? It would be cool to create same voice-input - voice output but with your own customized assistant. In a similar way, the did during DevDay presentation :)
@ralfelfving Год назад
The GTP model you chose to use is just an API call, you can switch it out for whichever model you prefer by changing the API call -- GPT4, Assistants API, custom model running locally, ....
@EL-tirol Год назад
@@ralfelfving yep, but calling Assistants API seem trickier as they do not support streaming as of now
@firaunic 5 месяцев назад ⁺¹
Can we do the speech to text part with Whisper from OpenAi but the actual response from some other GPT model? like Gemini or my any other local model endpoint other than ChatGpt?
@ralfelfving 5 месяцев назад
Yeah, just chain in another API call
@pennychewer8931 10 месяцев назад
Is there a way to customise the voice?
@aranthos Год назад ⁺¹
Are there ways to tweak the output in terms of pacing and vocal intensity?
@ralfelfving Год назад
No, not with OpenAI TTS right now/yet. The only option with that API is the speed of the audio in the file, but its not pacing/vocal intensity.
@AI_Escaped Год назад ⁺²
Awesome, can't wait to try. Too bad GPT is all jacked lately. How would one do this using a wakeup word or other stimulation to get the program's attention?
@ralfelfving Год назад ⁺¹
I'm not sure about wakeup words, because you'd need a process to listen at all times and recognize a word. A shorthand would probably be a keyboard shortcut which you could do if you packaged it with e.g. Electron.
@AI_Escaped Год назад ⁺²
@@ralfelfving I guess leaving the mic open would would, but you would be paying for api for everything it processes. Maybe a local open model running locally to just listen for the wakeup word, and then it's passed to you openai api?
@user-us2um3zk7n Год назад ⁺¹
unfortunately I got stuck with an error:
Press Enter when you're ready to start speaking.
Recording... Press Enter to stop
Recording stopped, processing audio...
Error: 400 - Bad Request
@ralfelfving Год назад
Console log the API inputs before the call and the errors of the API call to the terminal to find out what's causing the 400. I suspect the root cause is that you're not appending an audio file because the app doesn't have access to the microphone, or that the microphone source is incorrect and you're sending a silent file.
@Shardus Год назад ⁺¹
I had the same issue. It was because nothing was getting recorded and the output.wav file was empty. On my Linux system I had to set the device to 'default' by changing the new Microphone line to: mic = new Microphone({device:'default'});
@mickelodiansurname9578 11 месяцев назад
You need more subscribers mate, 2.5k is a shame to be honest given the knowledge you are sharing, what is the YT algo up to?
@musumo1908 Год назад ⁺¹
Hey great vid, anyway to add tts as a function to the new GPT4 preview openai assistant.thx
@ralfelfving Год назад
I don't understand your question, can you describe it in an example?
@musumo1908 Год назад
@@ralfelfvinghey my reply seemed to go? Let me rephrase. I was hoping to use TTS with my openai assistant that uses the new gpt4 preview (the assistants post 06/11/23). What’s the best way to integrate this? So basically I want a talking openai assistant…
@burakince4283 24 дня назад
Can I use my own data for TTS?
@crististanciu7708 5 месяцев назад
Hi there, thanks for this great job.
Can you tell us how can we make this 2in1, meaning to give audio responses also when the users type the questions not only when they speak it?
Thank you!
Edit:
Never mind, chat gpt updated the code, and now it works via messages. Thanks.
@doston8795 Год назад
hey can you i add this to UI and how i can do can you advise me please? thank you
@ventureaddict Год назад ⁺¹
Love this! Thank you! How would I swap out OpenAI TTS for Eleven Labs TTS model?
@ralfelfving Год назад
You'd just change the OpenAI TTS call to a ElevelLabs API call instead.
@greendsnow Год назад ⁺⁴
Pricing:
Google
Transcription: $0.024 / minute
TTS $0.016 / 1K characters
Open AI
Whisper $0.006 / minute
TTS $0.015 / 1K characters
TTS HD $0.030 / 1K characters
@dorg9502 Год назад ⁺¹
Or you could use one of the non-gpt related alternatives and run it locally or from your own server.
@yantaosong Год назад
good idea , which alternatives ? whisper for speech to text and llama to answer ? @@dorg9502
@greendsnow Год назад
@@dorg9502 I don't have an Nvidia GPU, I'm not planning to buy one
@snot8783 Год назад ⁺¹
can i do the same using python?
@LearnCode_withAI Год назад
Yes off course you ca find all the details on openai platform
@ralfelfving Год назад
Absolutely. The OpenAI community has a lot of people building with Python, and sharing examples.
@kamalkamals Год назад ⁺¹
The question is how u install speaker package ?
@ralfelfving Год назад
Try running Terminal with Rosetta.
@AndAllTravel Год назад
same problem... terminal with rosetta didn't seem to help
@kamalkamals Год назад
@@ralfelfving cannot understand ur answer, what is relation between installation package speaker and terminal Rosetta !!!
@ralfelfving Год назад
@@kamalkamals Some packages may only work/be compatible with running Terminal with Rosetta.
@kamalkamals Год назад
that s not a best practice to force using x terminal, probably u need to update ur code :)@@ralfelfving
@Hazar-bt6nf 6 месяцев назад
Can it be run on raspberry pi5
@ralfelfving 6 месяцев назад
I don't know
@irangasamarakoon4160 Год назад
this is amazing...
@Mirkolinori 8 месяцев назад
Perfect

Следующие

Автовоспроизведение

World’s Fastest Talking AI: Deepgram + Groq