Local Low Latency Speech to Speech - Mistral 7B + OpenVoice / Whisper | Open Source AI

Поделиться
HTML-код
  • Опубликовано: 25 янв 2025

Комментарии • 199

  • @JohnSmith762A11B
    @JohnSmith762A11B Год назад +60

    More suggestions: add a "thought completed" detection layer that decides when the user has finished speaking based on the stt input so far (based upon context and natural pauses and such). It will auto-submit the text to the AI backend. Then have the app immediately begin listening to the microphone at the conclusion of playback of the AI's tts-converted response. Yes, sometimes the AI will interrupt the speaker if they hadn't entirely finished what they wanted to say, but that is how real human conversations work when one person perceives the other has finished their thought and chooses to respond. Also, if the user says "What?" or "(could you) Repeat that"" or "please repeat?" or "Say again?" Or "Sorry I missed that." the system should simply play the last WAV file again without going for another round trip to the AI inference server and doing another tts conversion of the text. Reserve the Control-C for stopping and starting this continuous auto-voice recording and response process instead. This will shave a many precious milliseconds of latency and make the conversation much more natural and less like using a walkie-talkie.

    • @SaiyD
      @SaiyD Год назад +1

      nice let me give one suggest to your suggestion. add a random choice with 50% chance to re play the audio or send your input to backend.

    • @ChrizzeeB
      @ChrizzeeB Год назад

      so it'd be sending the STT input again and again with every new word detected? rather than just at the end of a sentence or message?

    • @deltaxcd
      @deltaxcd 11 месяцев назад +3

      I have better idea to feed it partial prompt without waiting user to finish and it starts generating response if there is a slightest pause if user continues talking more text is added to the prompt and output is regenerated. If user talks on top of speaking AI. Ai terminates its response and continues listening
      this will improve things 2 fold because moel will have a chance to process partial prompt and it will reduce time required to process the prompt later
      if we combine that to now wasting for full reply conversation will be completely natural
      there is no need for any of that say again because AI will do that by itself if asked

  • @BruceWayne15325
    @BruceWayne15325 Год назад +19

    very impressive! I'd love to see them implement this in smartphones for real-time translation when visiting foreign countries / restaurants.

    • @optimyse
      @optimyse Год назад +1

      S24 Ultra?

    • @deltaxcd
      @deltaxcd 11 месяцев назад

      there are models that so speech to speech translation

  • @williamjustus2654
    @williamjustus2654 Год назад +11

    Some of the best work and fun that I have seen so far. Can't wait to try on my own. Keep up the great work!!

  • @nyny
    @nyny Год назад +13

    Thats supah cool, I actually built something almost exactly like this yesterday. I get about the same performance. The hard part is needing to figure out threading/process pools/asyncio. To get that latency down. I used small instead of base. I think I get about the same response or better.

  • @Pure_Science_and_Technology
    @Pure_Science_and_Technology Год назад +25

    Awesome! Time to replace my slow speech to speech code using openAI. Also, added eleven labs for a bit of a comedic touch. Thanks for putting this together.

    • @mayushi7792
      @mayushi7792 5 месяцев назад +1

      How much did it cost you? For integrating eleven labs?

  • @zyxwvutsrqponmlkh
    @zyxwvutsrqponmlkh 11 месяцев назад +2

    I have tried open voice and bark, but VITS by far makes the most natural sounding voices.

  • @PhillipThomas87
    @PhillipThomas87 Год назад +7

    I mean, this is dependent on your hardware... Are the specs anywhere for this "inference server"

  • @avi7278
    @avi7278 Год назад +5

    In the US we have this concept, if you watch a football game which is notorious for having a shizload of commercials (ie latency), if you start watching the game 30 minutes late but from the beginning, you can skip most of the commercials. If you just shift the latency to the beginning, 15 seconds of "loading" would probably be sufficient enough for a 5-10 minute conversation between the two chatbots, and also avoid loops by having a third party observer who reviews the last 5 messages and determines if the conversation has gone "stale" and interjects a new idea into one of the interlocutors.

  • @MelindaGreen
    @MelindaGreen 11 месяцев назад +13

    I'm daunted by the idea of setting up these development systems just to use a model. Any chance people can bundle them into one big executable for Windows and iOS? I sure would love to just load-and-go.

  • @ales240
    @ales240 Год назад +2

    Just subscribed! can't wait to get my hands on it, looks super cool!

  • @gabrielsandstedt
    @gabrielsandstedt Год назад +9

    If you are fine venturing into c# or c++ then I know how you can improve the latency and create a single .exe that includes all of your different parts here, including using local models for the whisper voice recognition. I have done this myself using LLama sharp for runnign the GGUF file, and then embedding all external python into a batch process which it calls.

    • @gabrielsandstedt
      @gabrielsandstedt 9 месяцев назад +2

      @@matthewfuller9760 i should put it there actually. I have been jumping between projects lately without sharing much. Will send a link when it is up

  • @LFPGaming
    @LFPGaming Год назад +2

    do you know of any offline/local way to do translations? i've been searching but haven't found a way to do local translations of video or audio using LargeLanguageModels

    • @deltaxcd
      @deltaxcd 11 месяцев назад +1

      there is a program "subtitle edit" which can do that

  • @SaveTheHuman5
    @SaveTheHuman5 11 месяцев назад +5

    Hello, please can inform to us what is your cpu, gpu, ram etc?

  • @JohnSmith762A11B
    @JohnSmith762A11B Год назад +4

    I wonder if you are (or can, if not) caching the processed .mp3 voice model after the speech engine processes it and turns it into partials. That would cut out a lot of latency if it didn't need to process those 20 seconds of recorded voice audio every time. Right now it's pretty fast but the latency still sounds more like they are using walkie talkies than speaking on a phone.

    • @levieux1137
      @levieux1137 Год назад +3

      it could go way further by using the native libs and dropping all the python-based wrappers that pass data between stages using files and that copy, copy, copy and recopy data all the time. For example llama.cpp is clearly recognizable in the lower layers, all the tunable parameters match it. I don't know for openvoice for example however, but the state the presenter arrived at shows that we're pretty close to reaching a DIY conversational robot, which is pretty cool.

    • @JohnSmith762A11B
      @JohnSmith762A11B Год назад

      @@levieux1137 By native libs, you mean the system tts speech on say Windows and macOS?

    • @levieux1137
      @levieux1137 Год назад +2

      @@JohnSmith762A11B not necessarily that, but I'm speaking about the underlying components that are used here. In fact if you look, this is essentially python code built as wrapper on top of other parts that already run natively. The llama.cpp server for example is used here apparently. And once wrapped into layers and layers, you see that it becomes heavy to transport contents from one layer to another (particularly when passing via files, but even memcpy is expensive). It might even be possible that some elements are re-loaded from scratch and re-initialized after each sentence. The python script here appears to be mostly a wrapper around all such components,working like a shell script recording input from the microphone to a file then sending it to openvoice, then send that output to a file, then load another component with that file, etc... This is just like a shell script working with files and heavy initialization at every step. Dropping all that layer and directly using the native APIs of the various libs and components would be way more efficient. And it's very possible that past a point the author will discover that Python is not needed at all, which could suddenly offer more possibilities for lighter embedded processing.

  • @deeplearningdummy
    @deeplearningdummy 11 месяцев назад +4

    I've been trying to figure out how to do this. Great job. I want to support your work and get this up and running for myself, but is RUclips membership the only option?

  • @MavihMindsEverything
    @MavihMindsEverything 2 месяца назад +2

    what's the GPU your are using for it ?

  • @irraz1
    @irraz1 9 месяцев назад +2

    wow! I would love to have such an assistant to practice languages. The “python hub” code, do you plan to share it at some point?

  • @3amknighttraders
    @3amknighttraders 20 дней назад

    LOL , love the video bro. "gimme a second while i hack this shyt"

  • @swannschilling474
    @swannschilling474 Год назад +3

    I am still using Tortoise but Open Voice seems to be promising! 😊 Thanks for this video!! 🎉🎉🎉

  • @ywueeee
    @ywueeee Месяц назад

    is there an updated version of this?

  • @kleber1983
    @kleber1983 10 месяцев назад +1

    Hi, I´d like to know the computer specs required to run your speech to speech system, I m quite interested but I need to know first I my computer can handle it. thanks.

  • @tommoves3385
    @tommoves3385 Год назад +1

    Hey Kris - that is awesome. I like it very much. Great that you do this open source stuff. Very cool 😎.

  • @arkdirfe
    @arkdirfe 11 месяцев назад

    Interesting, this is similar to a small project I made for myself. But instead of a chatbot conversation, the whisper output is fed into SAM (yes, the funny robot voice) and sent to an audio output. Basically makes SAM say whatever I say with a slight delay. I'm chopping up the speech into small segments so it can start transcribing while I speak for longer, but that introduces occasional weirdness, but I'm fine with that.

  • @squiddymute
    @squiddymute Год назад +3

    no api = pure genius

  • @microponics2695
    @microponics2695 Год назад +1

    I have the uncensored model the same one and when I ask it to list curse words it says it can't do that. ???

    • @jungen1093
      @jungen1093 11 месяцев назад

      Lmao that’s annoying

  • @denisblack9897
    @denisblack9897 Год назад +1

    I know about this for more than a year now and it still blows my mind. wtf

  • @yoagcur
    @yoagcur Год назад +1

    Fascinating. Any chance you could upgrade it so that specific voices could be used and a recording made automatically, Could make for some interesting Biden v Trump debates

  • @VincentDeLaCroix-p2z
    @VincentDeLaCroix-p2z 2 месяца назад

    Can you do the exact same thing for a no coder ? My LM studio also doesn't look like yours ? is it the updates ?

  • @ryanjames3907
    @ryanjames3907 Год назад +1

    very cool, low latency voice, thanks for sharing, i watch all your videos, and i look forward to the next one,

  • @jacoballessio5706
    @jacoballessio5706 Год назад

    I wonder if you could directly convert embeddings to speech to skip text inference

  • @nfrancisj2122
    @nfrancisj2122 2 месяца назад

    Do you have plans for voice changer for video games ?

  • @JohnGallie
    @JohnGallie 11 месяцев назад +1

    is there anyway that you can give the python 90% of system resources so it would be faster

  • @MrScoffins
    @MrScoffins 11 месяцев назад +2

    So if you disconnect your computer from the Internet, will it still work?

    • @jephbennett
      @jephbennett 11 месяцев назад +1

      Yes, this code package is not pulling APIs (which is why the latency is low), so it doesn't need internet connection. Downside is, it cannot access info outside of it's core dataset, so no current events or anything like that.

  • @TomM-p3o
    @TomM-p3o Год назад

    This is great. But personally I think a speech recognition with push to talk or push to toggle talk is most useful.

  • @saitheagarajah
    @saitheagarajah 2 месяца назад

    I am a member and I dont see your github repo for this project. Can you please share it with me.

  • @codygaudet8071
    @codygaudet8071 10 месяцев назад

    Just earned yourself a sub sir!

  • @mertgundogdu211
    @mertgundogdu211 9 месяцев назад +1

    How I can try this in my computer?? I couldnt find the talk.py in github code??

    • @Warz-cx6zk
      @Warz-cx6zk 5 месяцев назад

      It's his own code and you need to become a member and wait for invite to Github community.

  • @jeffniekamp1044
    @jeffniekamp1044 3 месяца назад

    once again, requirements won't install kicking off a couple hours of digging through versions before anything might work. Wish it was a little more standard to clearly denote python version being used and the package manager being used. Neither conda nor venv would work for me. Beyond that, project is very interesting as most are...

  • @JayGee1
    @JayGee1 19 дней назад

    Hilarious and amazing. i will try and make something like this. Im new to this AI stuff so this will be interesting..
    Good stuff.

  • @arvsito
    @arvsito Год назад +1

    It will be very interesting to see this in a web application

  • @乾淨核能
    @乾淨核能 6 месяцев назад

    what's the GPU requirement to achieve real time response?
    thank you

  • @googlenutzer3384
    @googlenutzer3384 11 месяцев назад

    Is it also possible to adjust to different languages?

  • @kumar.jayanti9700
    @kumar.jayanti9700 7 месяцев назад

    Hi Kris, Where is the Github code for this one. I could not locate it in the Member github.

  • @skullseason1
    @skullseason1 11 месяцев назад

    How can i do this with the Apple M1, this is soooo awesome i need to figure it out!

  • @DihelsonMendonca
    @DihelsonMendonca 6 месяцев назад

    That's wonderful. I wish I had the knowledge to implement that on my LLMs in LM Studio.

  • @darik31
    @darik31 8 месяцев назад

    Thanks for sharing this mate! I wonder if the code is available somewhere? If so, could you please provide a link? Thanks

  • @darcwader
    @darcwader 8 месяцев назад

    this was more comedy show than tech , lol. so hilarious responses from johnny.

  • @khajask8113
    @khajask8113 7 месяцев назад

    Hindi and Telugu language supports..?

  • @suminlee6576
    @suminlee6576 11 месяцев назад

    Do you have a video for showing how to do this step by step? I was going to be paid member but I couldn't see how to video in your paid channel?

  • @Abhi-l6r1k
    @Abhi-l6r1k 4 месяца назад

    Where is the code available ?,
    I want to try it on my local

  • @SonGoku-pc7jl
    @SonGoku-pc7jl Год назад

    thanks, good project. Whisper can translate my spanish to english to spanish directly with little change in code? and tts i need change something also? thanks!

  • @OdikisOdikis
    @OdikisOdikis 11 месяцев назад

    the predefined answer timing is what makes it not real conversation. It should spit answer questions at random timings like any human can think of something and only then answer. Randomizing timings would create more realistic conversations

  • @LadyTink
    @LadyTink 11 месяцев назад

    Kinda feels like something the "rabbit R1" does
    with the whole fast speech to speech thing

  • @ProjCRys
    @ProjCRys Год назад +1

    Nice! I was about to create something like this for myself but I still couldn't use OpenVoice because I keep failing to run it on my venv instead of conda.

    • @Zvezdan88
      @Zvezdan88 Год назад

      How do you even install OpenVoice?

  • @TanvirsTechTalk
    @TanvirsTechTalk 7 месяцев назад

    How did you actually set it up?

  • @edgarl.mardal8256
    @edgarl.mardal8256 6 месяцев назад

    Jeg kjøper meg patron medlemskap om du setter opp rasa med denne modellen, ettersom hun mangler IQ og structur vil jeg anbefale rasa og bruke salgs teknikk for å få henne til å høres mer logisk ut. Med det mener jeg spinning.

  • @fire17102
    @fire17102 Год назад +2

    Would love to see some realtime animations to go with the voice, could be a face, but also can be minimalistic (like the R1 rabbit).

    • @wurstelei1356
      @wurstelei1356 Год назад

      You need a second GPU for this. Lets say you put on Stable Diffusion. Displaying a robot face with emotions would be nice.

    • @leucome
      @leucome Год назад

      Try Amica AI . It has VRM 3D/vtuber character and multiple option for the voice and the llm backed.

    • @fire17102
      @fire17102 10 месяцев назад

      ​@@leucomedoes it work locally in real time?

    • @fire17102
      @fire17102 10 месяцев назад

      ​@@wurstelei1356Again, I think a minimalistic animation would also do the trick , or prerendeing the images once, and using them in the appropriate sequence in realtime.

    • @leucome
      @leucome 10 месяцев назад +1

      ​@@fire17102 Yes it can work in real-time locally as long as the GPU is fast and has enough vram to run the AI+Voice. It can also connect to online service if required. I uploaded a video where I play Minecraft and talk to the AI at same time with all the component running on a single GPU.

  • @JG27Korny
    @JG27Korny Год назад

    I run the oobabooga silero plus whisper, but those take forever to make voice from text, especially silero.

  • @NirmalEleQtra
    @NirmalEleQtra 8 месяцев назад

    Where can i find whole GitHub repo ?

  • @mickelodiansurname9578
    @mickelodiansurname9578 Год назад

    can the llm handle being told in a system prompt that it will be taking in the sentences in small chunks? say cut up into 2 second audio chunks per transcript. Can the mistral model do that? Anyway if so you might even be able to get it to 'butt in' to your prompt. now thats low latency!

    • @deltaxcd
      @deltaxcd 11 месяцев назад

      No it cant be told that but it is not necessary.
      just feed it the chunk and then if user speaks before it managed to reply more restart and feed more

  • @normanalc
    @normanalc 7 месяцев назад

    I'd like to get a copy of the script please, this one is really cool! thanks for sharing this.

  • @MegaMijit
    @MegaMijit 11 месяцев назад

    this is awesome, but voice could use some fine tuning to sound more realistic

  • @duffy666
    @duffy666 8 месяцев назад

    I really like it! It this already on Github for members (could not find it)?

  • @Jesulex82
    @Jesulex82 6 месяцев назад

    Este es un modelo para descargar y poder hablar con la IA? se puede jugar a ro? habla en español?

  • @Bigjuergo
    @Bigjuergo 3 месяца назад

    Why do you lock up open source???

  • @inLofiLife
    @inLofiLife Год назад

    looks interesting but where is this community link you mentioned? :)

  • @_-JR01
    @_-JR01 11 месяцев назад

    does openvoice perform better than whisper's TTS?

  • @tag_of_frank
    @tag_of_frank 11 месяцев назад

    Why LM Studio over OogaBooga? What are the pros/cons of them? I have been using Ooga, but wondering why one might switch.

  • @monsterfan-j2m
    @monsterfan-j2m 2 месяца назад

    i want to make a whispered speech to normal voice system ,can anyone help me

  • @64jcl
    @64jcl 11 месяцев назад

    Surely the response time is a function of what rig you are doing this on - an RTX 4080 as you have is no doubt a major contributor here, and I would guess you have a beast of a CPU and high speed memory on a newer motherboard.

  • @deltaxcd
    @deltaxcd 11 месяцев назад

    I think to decrease latency more you need to make it speak before AI finishes its sentence
    unfortunately there is no obvious way to feed it partial prompt but waiting until it will finish generating reply takes asy too long

  • @musumo1908
    @musumo1908 Год назад

    Hey cool…anyway to run this self hosted for an online speech to speech setup? Want to drop this into a chatbot project…what level membership to access the code thanks

  • @ExploreTogetherYT
    @ExploreTogetherYT 11 месяцев назад

    how much RAM do you have to run mistral 7b locally? using gpu or cpu?

  • @mastershake2782
    @mastershake2782 Год назад

    I am trying to clone a voice from a reference audio file, but despite following the standard process, the output doesn't seem to change according to the reference. When I change the reference audio to a different file, there's no noticeable change in the voice characteristics of the output. The script successfully extracts the tone color embeddings, but the conversion process doesn't seem to reflect these in the final output. I'm using the demo reference audio provided by OpenVoice (male voice), but the output synthesized speech remains in a female voice, typical of the base speaker model. I've double-checked the script, model checkpoints, and audio file paths, but the issue persists. If anyone has encountered a similar problem or has suggestions on what might be going wrong, I would greatly appreciate your insights. Thank you in advance!

    • @UaintDoxinme
      @UaintDoxinme 3 месяца назад

      same. issue. did you figure it out?

    • @mastershake2782
      @mastershake2782 3 месяца назад

      @@UaintDoxinme I did eventually end up fixing this but I'm sorry I don't remember the details. It's been too long.

    • @UaintDoxinme
      @UaintDoxinme 3 месяца назад

      @@mastershake2782 np dude

  • @TheDailyMemesShow
    @TheDailyMemesShow 6 месяцев назад

    Would this work on the cloud? If so, how?

  • @zedboiii
    @zedboiii 8 месяцев назад +1

    that's some Bethesda level of conversation

  • @RhythmAndRiffsJazz
    @RhythmAndRiffsJazz Год назад +1

    hi I dont have talk.py, but is there another way of running it im missing?

    • @Warz-cx6zk
      @Warz-cx6zk 5 месяцев назад

      It's his own code, you need to become a member of the channel through subscription and wait for the invite code to github community.

  • @binthem7997
    @binthem7997 Год назад

    Great tutorial but I wish you could share gists or share your code

  • @DoNotTredOnMe
    @DoNotTredOnMe 6 месяцев назад

    I'd love to see a video of to AI's conversating with one another.

  • @Nursultan_karazhigit
    @Nursultan_karazhigit 11 месяцев назад +1

    Thanks . Is whisper api free ?

    • @m0nxt3r
      @m0nxt3r 8 месяцев назад

      it's open source

  • @MiguelCayazaya
    @MiguelCayazaya 6 месяцев назад

    Thanks there are those who go to war and become heroes and those who don't but still write programs

  • @MetaphoricMinds
    @MetaphoricMinds 11 месяцев назад +1

    What GPU are you running?

  • @josephtilly258
    @josephtilly258 9 месяцев назад

    really interesting, lot of it i can't understand because I don't know coding but speech to speech could be a big thing within few years

  • @alexander191297
    @alexander191297 11 месяцев назад +1

    I swear on my mother’s grave lol… this AI is hilarious! 😂😂😂

  • @aladinmovies
    @aladinmovies Год назад

    Good job. Interesting video

  • @jerryqueen6755
    @jerryqueen6755 10 месяцев назад +1

    How can I install this on my PC? I am a member of the channel

    • @AllAboutAI
      @AllAboutAI  10 месяцев назад

      did you get the gh invite?

    • @jerryqueen6755
      @jerryqueen6755 10 месяцев назад

      @@AllAboutAI yes, thanks

    • @miaohf
      @miaohf 9 месяцев назад

      @@AllAboutAI I am a member of the channel too, how to get gh invite?

  • @cmcdonough2
    @cmcdonough2 8 месяцев назад

    This was great 😃👍

  • @weisland2807
    @weisland2807 11 месяцев назад

    would be funny if you had this in games - like the people on the streets of gta having convos fueled by somthing like this. maybe it's already happening tho, i'm not in the know. awesomesauce!

  • @laalbujhakkar
    @laalbujhakkar 9 месяцев назад

    How is a system that goes out to openAI, "local" ????????

    • @seRko123
      @seRko123 8 месяцев назад

      Open air whisper is locally

  • @aestendrela
    @aestendrela Год назад +2

    It would be interesting to make a real-time translator. I think it could be very useful. The language barrier would end.

    • @deltaxcd
      @deltaxcd 11 месяцев назад

      meta didi it already they created speech to speech translation model

  • @ArnaudMEURET
    @ArnaudMEURET 11 месяцев назад

    Just to paraphrase your models: “Dude ! Are you actually grabbing the gorram scrollbars to scroll down an effing window !? What is this? 1996 ? Ever heard of a mouse wheel? You know it’s even emulated by double drag on track pads, right?” 🤘

  • @researchforumonline
    @researchforumonline 11 месяцев назад

    wow very cool! Thanks

  • @NoLimitYou
    @NoLimitYou 11 месяцев назад +113

    Too bad you take open source and make it closed.

    • @mblend27
      @mblend27 11 месяцев назад +1

      Explain?

    • @NoLimitYou
      @NoLimitYou 11 месяцев назад

      @@mblend27 You take code openly available, and ask people to become a member, to receive the code of what you demo using the open source code. The whole idea of open source is that everyone contributes without putting it behind walls

    • @Ms.Robot.
      @Ms.Robot. 11 месяцев назад +3

      You can in several ways.

    • @NoLimitYou
      @NoLimitYou 11 месяцев назад +13

      You take open source and make something with that and put it behind a wall.

    • @TheGrobe
      @TheGrobe 11 месяцев назад

      @@mblend27 You make someone pay to access something on github you comprised of open source components.

  • @TheRottweiler_Gemii
    @TheRottweiler_Gemii 8 месяцев назад

    Anybody done with this and have a code or link can share please

  • @jeffsmith9384
    @jeffsmith9384 Год назад

    I would like to see how a chat room full of different models would problem solve... ChatGPT + Claude + * 7B + Grok + Bard... all in a room, trying to decide what you should have for lunch

  • @TheDailyMemesShow
    @TheDailyMemesShow 6 месяцев назад

    OMG, I just noticed I've watched gazillion videos of yours.
    Why haven't subscribed, though?
    I swear I thought I had done it before?
    Something's not adding up here...

  • @aboudezoa
    @aboudezoa 11 месяцев назад

    Running on 4080 🤣 makes sense the damn thing is very fast

  • @smthngsmthngsmthngdarkside
    @smthngsmthngsmthngdarkside 11 месяцев назад +2

    So where's the source code mate?
    Or is this just a hook for your newsletter marketing and crap website?

    • @Skystunt123
      @Skystunt123 8 месяцев назад

      Just a hook, the code is not shared.

  • @Ms.Robot.
    @Ms.Robot. 11 месяцев назад +1

    ❤❤❤🎉 nice

  • @Edward_ZS
    @Edward_ZS Год назад

    I dont see Dan.mp3