Build a voice assistant with OpenAI Whisper and TTS (text to speech) in 5 minutes

Поделиться
HTML-код
  • Опубликовано: 24 янв 2025

Комментарии • 67

  • @TestTalk
    @TestTalk Год назад +6

    My word, I can't tell you how much I now look forward to your videos! Keep up the great work!

    • @ralfelfving
      @ralfelfving  Год назад +1

      @TestTalk thank you so much for the kind words, hopefully many more coming in the weekend and months ahead :)

    • @TestTalk
      @TestTalk Год назад

      Windows user here, I'm not sure if you mentioned it in your article but I had to download sox then Edit Environment Variable. Not sure if that helps or not but figured I would share and help the YT algorithm for you. @@ralfelfving

  • @mahtabalam9604
    @mahtabalam9604 Год назад +3

    Immense value bro thanks for the informative videos!

    • @ralfelfving
      @ralfelfving  Год назад

      Glad it helps, thanks for the comment! ♥️

  • @JoJoAcrylicArtwork
    @JoJoAcrylicArtwork Год назад +3

    fantastic! thanks so much for sharing, this exactly what I was looking to do

    • @ralfelfving
      @ralfelfving  Год назад +1

      Great, it's what's my tutorials are for! :)

    • @JoJoAcrylicArtwork
      @JoJoAcrylicArtwork Год назад

      @@ralfelfving love it! Open source baby yeah!!

  • @nabgilby
    @nabgilby 4 месяца назад

    Just tried this, works great, thanks and I liked it too!

  • @biancapietersz
    @biancapietersz Год назад +3

    I just found your content and am glad you are making tutorials on this. Have you been able to mitigate the latency?

    • @ralfelfving
      @ralfelfving  Год назад

      Which latency are you thinking of?

    • @biancapietersz
      @biancapietersz Год назад +1

      for example if someone responds it takes generation time for the api requests to get the proper info and generate the text and then the speech so there is a 5-10 second lag in response time. I’m trying to figure out a way to make it have faster response.

    • @ralfelfving
      @ralfelfving  Год назад +3

      If I remember correctly the way that I set it up in this tutorial is the fastest currently possible with OpenAI. You have these processing components:
      1. The person speaks for 10 seconds
      2. Send audio to Whisper
      3. Whisper process said audio and responds with transcript
      4. Send transcript to GPTx (I used 3.5 turbo)
      5. GPT process and returns response
      6. Send response to TTS
      7 TTS responds with audio and play back to user.
      In 1&2 you could technically stream chunks of audio and get them transcribed as the user speaks, such that much of the transcription is done once the user has stopped talking, and then join that all together for step 4.
      Step 4 has to happen after all of step 1-3 has completed. For GPTx to give you a useful answer, it needs to receive the full question from the user.
      Step 5 supports streaming output, but iirc step 6 doesn't support streaming input (yet). That means that as of today, you have to wait for GPTx to give you the entire output before you can process the TTS response. You could look into something similar to mentioned above, chunk up GPTx responses into sentences and get TTS to generate the audio piece by piece. The TTS response itself is streaming in my script, so it will start playing when it has the first few words.
      The only clear handover point where the full information is needed is 3-4, the rest is solvable -- and OpenAI will make it better over time.

    • @biancapietersz
      @biancapietersz Год назад +1

      @@ralfelfving yeah I’ve considered chungking in bits but it’s possible the responses would be inaccurate without the full scope and context of what is being said.
      It’s helpful that you’ve mentioned this with step 4
      This is a wildly helpful answer. I so appreciate it!

  • @marcuscarter
    @marcuscarter 11 месяцев назад +1

    Hi, great video, well above my level, but I have a quick question, could you actually have a 'meaningful' conversation with at as you would with chatgpt?

    • @ralfelfving
      @ralfelfving  11 месяцев назад

      Yes, its OpenAI's GPT levels under the hood of both so they'd be very similar.

    • @marcuscarter
      @marcuscarter 11 месяцев назад

      ok great, thanks for the information, I'm trying to work out how to put this tech into an app so this could be the way, many thanks and good luck with the channel

  • @armankarambakhsh4456
    @armankarambakhsh4456 Год назад +1

    Could someone pleaaaase tell me if they could've successfully run this on their windows? I use VS Community 2022 and I constantly get dependency errors like for -node microphone.
    I have .JS + .env file in the project + node.js installed and configured for VS + ffmpeg address is listed in windows environment variables.
    Feels so stupid to he stuck at such simple thing 😭

    • @ralfelfving
      @ralfelfving  Год назад

      Someone commented on the linked Medium article that they got it working on Windows. Did you install dependencies like Node package microphone?

    • @armankarambakhsh4456
      @armankarambakhsh4456 Год назад +1

      @@ralfelfving I ran them all and it sais successful. Luke 25 dependencies. But when I ran the app.js, it gave error for microphone. And when I ran the npm install for microphone, it gives like tons of errors 😕

    • @ralfelfving
      @ralfelfving  Год назад

      You'd need to resolve the errors for the microphone npm install.

  • @AndAllTravel
    @AndAllTravel Год назад +1

    Excellent content... I'm also having an issue with 'node install speaker'. Rosetta didn't seem to help. Any other ideas? Without speaker, the app otherwise seems to work but fails after hitting 'enter'

    • @ralfelfving
      @ralfelfving  Год назад

      Thanks. I think I forgot to mention it in the blog post because it's not an npm package -- but did you get prompted to install SoX (sound exchange)? It would be done using brew.

    • @AndAllTravel
      @AndAllTravel Год назад

      @ralfelfving sox installed but doesn't seem to make a difference. (gyp is not happy lol) It seems to be a common problem but also appears unfixed in the community. I tried to edit 'node-gypi' with the proper MACOSX version to no avail. Here is the log if you are interested: drive.google.com/file/d/1_aNOfPjiAfIBqf2KvUHUVx-Hd9JJu6lJ/view?usp=share_link

  • @zoltanfejedelem9372
    @zoltanfejedelem9372 2 месяца назад

    Great work, thank you.
    I have a question, if I want to text for example: 3999 characters to recite and save to mp3 in the given language how does it work?

  • @MariastellaALBARELLI
    @MariastellaALBARELLI 6 месяцев назад

    Hello, How can I attach the audio to an assistant using threads messages? Thank you

  • @EL-tirol
    @EL-tirol Год назад +1

    As I understand, it is connected to general gpt 3.5 model, not to customized API Assistant? It would be cool to create same voice-input - voice output but with your own customized assistant. In a similar way, the did during DevDay presentation :)

    • @ralfelfving
      @ralfelfving  Год назад

      The GTP model you chose to use is just an API call, you can switch it out for whichever model you prefer by changing the API call -- GPT4, Assistants API, custom model running locally, ....

    • @EL-tirol
      @EL-tirol Год назад

      ​@@ralfelfving yep, but calling Assistants API seem trickier as they do not support streaming as of now

  • @firaunic
    @firaunic 5 месяцев назад +1

    Can we do the speech to text part with Whisper from OpenAi but the actual response from some other GPT model? like Gemini or my any other local model endpoint other than ChatGpt?

    • @ralfelfving
      @ralfelfving  5 месяцев назад

      Yeah, just chain in another API call

  • @pennychewer8931
    @pennychewer8931 10 месяцев назад

    Is there a way to customise the voice?

  • @aranthos
    @aranthos Год назад +1

    Are there ways to tweak the output in terms of pacing and vocal intensity?

    • @ralfelfving
      @ralfelfving  Год назад

      No, not with OpenAI TTS right now/yet. The only option with that API is the speed of the audio in the file, but its not pacing/vocal intensity.

  • @AI_Escaped
    @AI_Escaped Год назад +2

    Awesome, can't wait to try. Too bad GPT is all jacked lately. How would one do this using a wakeup word or other stimulation to get the program's attention?

    • @ralfelfving
      @ralfelfving  Год назад +1

      I'm not sure about wakeup words, because you'd need a process to listen at all times and recognize a word. A shorthand would probably be a keyboard shortcut which you could do if you packaged it with e.g. Electron.

    • @AI_Escaped
      @AI_Escaped Год назад +2

      @@ralfelfving I guess leaving the mic open would would, but you would be paying for api for everything it processes. Maybe a local open model running locally to just listen for the wakeup word, and then it's passed to you openai api?

  • @user-us2um3zk7n
    @user-us2um3zk7n Год назад +1

    unfortunately I got stuck with an error:
    Press Enter when you're ready to start speaking.
    Recording... Press Enter to stop
    Recording stopped, processing audio...
    Error: 400 - Bad Request

    • @ralfelfving
      @ralfelfving  Год назад

      Console log the API inputs before the call and the errors of the API call to the terminal to find out what's causing the 400. I suspect the root cause is that you're not appending an audio file because the app doesn't have access to the microphone, or that the microphone source is incorrect and you're sending a silent file.

    • @Shardus
      @Shardus Год назад +1

      I had the same issue. It was because nothing was getting recorded and the output.wav file was empty. On my Linux system I had to set the device to 'default' by changing the new Microphone line to: mic = new Microphone({device:'default'});

  • @mickelodiansurname9578
    @mickelodiansurname9578 11 месяцев назад

    You need more subscribers mate, 2.5k is a shame to be honest given the knowledge you are sharing, what is the YT algo up to?

  • @musumo1908
    @musumo1908 Год назад +1

    Hey great vid, anyway to add tts as a function to the new GPT4 preview openai assistant.thx

    • @ralfelfving
      @ralfelfving  Год назад

      I don't understand your question, can you describe it in an example?

    • @musumo1908
      @musumo1908 Год назад

      @@ralfelfvinghey my reply seemed to go? Let me rephrase. I was hoping to use TTS with my openai assistant that uses the new gpt4 preview (the assistants post 06/11/23). What’s the best way to integrate this? So basically I want a talking openai assistant…

  • @burakince4283
    @burakince4283 24 дня назад

    Can I use my own data for TTS?

  • @crististanciu7708
    @crististanciu7708 5 месяцев назад

    Hi there, thanks for this great job.
    Can you tell us how can we make this 2in1, meaning to give audio responses also when the users type the questions not only when they speak it?
    Thank you!
    Edit:
    Never mind, chat gpt updated the code, and now it works via messages. Thanks.

  • @doston8795
    @doston8795 Год назад

    hey can you i add this to UI and how i can do can you advise me please? thank you

  • @ventureaddict
    @ventureaddict Год назад +1

    Love this! Thank you! How would I swap out OpenAI TTS for Eleven Labs TTS model?

    • @ralfelfving
      @ralfelfving  Год назад

      You'd just change the OpenAI TTS call to a ElevelLabs API call instead.

  • @greendsnow
    @greendsnow Год назад +4

    Pricing:
    Google
    Transcription: $0.024 / minute
    TTS $0.016 / 1K characters
    Open AI
    Whisper $0.006 / minute
    TTS $0.015 / 1K characters
    TTS HD $0.030 / 1K characters

    • @dorg9502
      @dorg9502 Год назад +1

      Or you could use one of the non-gpt related alternatives and run it locally or from your own server.

    • @yantaosong
      @yantaosong Год назад

      good idea , which alternatives ? whisper for speech to text and llama to answer ? @@dorg9502

    • @greendsnow
      @greendsnow Год назад

      @@dorg9502 I don't have an Nvidia GPU, I'm not planning to buy one

  • @snot8783
    @snot8783 Год назад +1

    can i do the same using python?

    • @LearnCode_withAI
      @LearnCode_withAI Год назад

      Yes off course you ca find all the details on openai platform

    • @ralfelfving
      @ralfelfving  Год назад

      Absolutely. The OpenAI community has a lot of people building with Python, and sharing examples.

  • @kamalkamals
    @kamalkamals Год назад +1

    The question is how u install speaker package ?

    • @ralfelfving
      @ralfelfving  Год назад

      Try running Terminal with Rosetta.

    • @AndAllTravel
      @AndAllTravel Год назад

      same problem... terminal with rosetta didn't seem to help

    • @kamalkamals
      @kamalkamals Год назад

      @@ralfelfving cannot understand ur answer, what is relation between installation package speaker and terminal Rosetta !!!

    • @ralfelfving
      @ralfelfving  Год назад

      @@kamalkamals Some packages may only work/be compatible with running Terminal with Rosetta.

    • @kamalkamals
      @kamalkamals Год назад

      that s not a best practice to force using x terminal, probably u need to update ur code :)@@ralfelfving

  • @Hazar-bt6nf
    @Hazar-bt6nf 6 месяцев назад

    Can it be run on raspberry pi5

  • @irangasamarakoon4160
    @irangasamarakoon4160 Год назад

    this is amazing...

  • @Mirkolinori
    @Mirkolinori 8 месяцев назад

    Perfect