More suggestions: add a "thought completed" detection layer that decides when the user has finished speaking based on the stt input so far (based upon context and natural pauses and such). It will auto-submit the text to the AI backend. Then have the app immediately begin listening to the microphone at the conclusion of playback of the AI's tts-converted response. Yes, sometimes the AI will interrupt the speaker if they hadn't entirely finished what they wanted to say, but that is how real human conversations work when one person perceives the other has finished their thought and chooses to respond. Also, if the user says "What?" or "(could you) Repeat that"" or "please repeat?" or "Say again?" Or "Sorry I missed that." the system should simply play the last WAV file again without going for another round trip to the AI inference server and doing another tts conversion of the text. Reserve the Control-C for stopping and starting this continuous auto-voice recording and response process instead. This will shave a many precious milliseconds of latency and make the conversation much more natural and less like using a walkie-talkie.
I have better idea to feed it partial prompt without waiting user to finish and it starts generating response if there is a slightest pause if user continues talking more text is added to the prompt and output is regenerated. If user talks on top of speaking AI. Ai terminates its response and continues listening this will improve things 2 fold because moel will have a chance to process partial prompt and it will reduce time required to process the prompt later if we combine that to now wasting for full reply conversation will be completely natural there is no need for any of that say again because AI will do that by itself if asked
Thats supah cool, I actually built something almost exactly like this yesterday. I get about the same performance. The hard part is needing to figure out threading/process pools/asyncio. To get that latency down. I used small instead of base. I think I get about the same response or better.
Awesome! Time to replace my slow speech to speech code using openAI. Also, added eleven labs for a bit of a comedic touch. Thanks for putting this together.
In the US we have this concept, if you watch a football game which is notorious for having a shizload of commercials (ie latency), if you start watching the game 30 minutes late but from the beginning, you can skip most of the commercials. If you just shift the latency to the beginning, 15 seconds of "loading" would probably be sufficient enough for a 5-10 minute conversation between the two chatbots, and also avoid loops by having a third party observer who reviews the last 5 messages and determines if the conversation has gone "stale" and interjects a new idea into one of the interlocutors.
I'm daunted by the idea of setting up these development systems just to use a model. Any chance people can bundle them into one big executable for Windows and iOS? I sure would love to just load-and-go.
If you are fine venturing into c# or c++ then I know how you can improve the latency and create a single .exe that includes all of your different parts here, including using local models for the whisper voice recognition. I have done this myself using LLama sharp for runnign the GGUF file, and then embedding all external python into a batch process which it calls.
do you know of any offline/local way to do translations? i've been searching but haven't found a way to do local translations of video or audio using LargeLanguageModels
I wonder if you are (or can, if not) caching the processed .mp3 voice model after the speech engine processes it and turns it into partials. That would cut out a lot of latency if it didn't need to process those 20 seconds of recorded voice audio every time. Right now it's pretty fast but the latency still sounds more like they are using walkie talkies than speaking on a phone.
it could go way further by using the native libs and dropping all the python-based wrappers that pass data between stages using files and that copy, copy, copy and recopy data all the time. For example llama.cpp is clearly recognizable in the lower layers, all the tunable parameters match it. I don't know for openvoice for example however, but the state the presenter arrived at shows that we're pretty close to reaching a DIY conversational robot, which is pretty cool.
@@JohnSmith762A11B not necessarily that, but I'm speaking about the underlying components that are used here. In fact if you look, this is essentially python code built as wrapper on top of other parts that already run natively. The llama.cpp server for example is used here apparently. And once wrapped into layers and layers, you see that it becomes heavy to transport contents from one layer to another (particularly when passing via files, but even memcpy is expensive). It might even be possible that some elements are re-loaded from scratch and re-initialized after each sentence. The python script here appears to be mostly a wrapper around all such components,working like a shell script recording input from the microphone to a file then sending it to openvoice, then send that output to a file, then load another component with that file, etc... This is just like a shell script working with files and heavy initialization at every step. Dropping all that layer and directly using the native APIs of the various libs and components would be way more efficient. And it's very possible that past a point the author will discover that Python is not needed at all, which could suddenly offer more possibilities for lighter embedded processing.
I've been trying to figure out how to do this. Great job. I want to support your work and get this up and running for myself, but is RUclips membership the only option?
Hi, I´d like to know the computer specs required to run your speech to speech system, I m quite interested but I need to know first I my computer can handle it. thanks.
Interesting, this is similar to a small project I made for myself. But instead of a chatbot conversation, the whisper output is fed into SAM (yes, the funny robot voice) and sent to an audio output. Basically makes SAM say whatever I say with a slight delay. I'm chopping up the speech into small segments so it can start transcribing while I speak for longer, but that introduces occasional weirdness, but I'm fine with that.
Fascinating. Any chance you could upgrade it so that specific voices could be used and a recording made automatically, Could make for some interesting Biden v Trump debates
Yes, this code package is not pulling APIs (which is why the latency is low), so it doesn't need internet connection. Downside is, it cannot access info outside of it's core dataset, so no current events or anything like that.
once again, requirements won't install kicking off a couple hours of digging through versions before anything might work. Wish it was a little more standard to clearly denote python version being used and the package manager being used. Neither conda nor venv would work for me. Beyond that, project is very interesting as most are...
thanks, good project. Whisper can translate my spanish to english to spanish directly with little change in code? and tts i need change something also? thanks!
the predefined answer timing is what makes it not real conversation. It should spit answer questions at random timings like any human can think of something and only then answer. Randomizing timings would create more realistic conversations
Nice! I was about to create something like this for myself but I still couldn't use OpenVoice because I keep failing to run it on my venv instead of conda.
Jeg kjøper meg patron medlemskap om du setter opp rasa med denne modellen, ettersom hun mangler IQ og structur vil jeg anbefale rasa og bruke salgs teknikk for å få henne til å høres mer logisk ut. Med det mener jeg spinning.
@@wurstelei1356Again, I think a minimalistic animation would also do the trick , or prerendeing the images once, and using them in the appropriate sequence in realtime.
@@fire17102 Yes it can work in real-time locally as long as the GPU is fast and has enough vram to run the AI+Voice. It can also connect to online service if required. I uploaded a video where I play Minecraft and talk to the AI at same time with all the component running on a single GPU.
can the llm handle being told in a system prompt that it will be taking in the sentences in small chunks? say cut up into 2 second audio chunks per transcript. Can the mistral model do that? Anyway if so you might even be able to get it to 'butt in' to your prompt. now thats low latency!
Surely the response time is a function of what rig you are doing this on - an RTX 4080 as you have is no doubt a major contributor here, and I would guess you have a beast of a CPU and high speed memory on a newer motherboard.
I think to decrease latency more you need to make it speak before AI finishes its sentence unfortunately there is no obvious way to feed it partial prompt but waiting until it will finish generating reply takes asy too long
Hey cool…anyway to run this self hosted for an online speech to speech setup? Want to drop this into a chatbot project…what level membership to access the code thanks
I am trying to clone a voice from a reference audio file, but despite following the standard process, the output doesn't seem to change according to the reference. When I change the reference audio to a different file, there's no noticeable change in the voice characteristics of the output. The script successfully extracts the tone color embeddings, but the conversion process doesn't seem to reflect these in the final output. I'm using the demo reference audio provided by OpenVoice (male voice), but the output synthesized speech remains in a female voice, typical of the base speaker model. I've double-checked the script, model checkpoints, and audio file paths, but the issue persists. If anyone has encountered a similar problem or has suggestions on what might be going wrong, I would greatly appreciate your insights. Thank you in advance!
would be funny if you had this in games - like the people on the streets of gta having convos fueled by somthing like this. maybe it's already happening tho, i'm not in the know. awesomesauce!
Just to paraphrase your models: “Dude ! Are you actually grabbing the gorram scrollbars to scroll down an effing window !? What is this? 1996 ? Ever heard of a mouse wheel? You know it’s even emulated by double drag on track pads, right?” 🤘
@@mblend27 You take code openly available, and ask people to become a member, to receive the code of what you demo using the open source code. The whole idea of open source is that everyone contributes without putting it behind walls
I would like to see how a chat room full of different models would problem solve... ChatGPT + Claude + * 7B + Grok + Bard... all in a room, trying to decide what you should have for lunch
OMG, I just noticed I've watched gazillion videos of yours. Why haven't subscribed, though? I swear I thought I had done it before? Something's not adding up here...
More suggestions: add a "thought completed" detection layer that decides when the user has finished speaking based on the stt input so far (based upon context and natural pauses and such). It will auto-submit the text to the AI backend. Then have the app immediately begin listening to the microphone at the conclusion of playback of the AI's tts-converted response. Yes, sometimes the AI will interrupt the speaker if they hadn't entirely finished what they wanted to say, but that is how real human conversations work when one person perceives the other has finished their thought and chooses to respond. Also, if the user says "What?" or "(could you) Repeat that"" or "please repeat?" or "Say again?" Or "Sorry I missed that." the system should simply play the last WAV file again without going for another round trip to the AI inference server and doing another tts conversion of the text. Reserve the Control-C for stopping and starting this continuous auto-voice recording and response process instead. This will shave a many precious milliseconds of latency and make the conversation much more natural and less like using a walkie-talkie.
nice let me give one suggest to your suggestion. add a random choice with 50% chance to re play the audio or send your input to backend.
so it'd be sending the STT input again and again with every new word detected? rather than just at the end of a sentence or message?
I have better idea to feed it partial prompt without waiting user to finish and it starts generating response if there is a slightest pause if user continues talking more text is added to the prompt and output is regenerated. If user talks on top of speaking AI. Ai terminates its response and continues listening
this will improve things 2 fold because moel will have a chance to process partial prompt and it will reduce time required to process the prompt later
if we combine that to now wasting for full reply conversation will be completely natural
there is no need for any of that say again because AI will do that by itself if asked
very impressive! I'd love to see them implement this in smartphones for real-time translation when visiting foreign countries / restaurants.
S24 Ultra?
there are models that so speech to speech translation
Some of the best work and fun that I have seen so far. Can't wait to try on my own. Keep up the great work!!
Thats supah cool, I actually built something almost exactly like this yesterday. I get about the same performance. The hard part is needing to figure out threading/process pools/asyncio. To get that latency down. I used small instead of base. I think I get about the same response or better.
Hi ! Very impressive !! Do you have a github to share your code ?
can we see your code please
Im interested in it as well
Awesome! Time to replace my slow speech to speech code using openAI. Also, added eleven labs for a bit of a comedic touch. Thanks for putting this together.
How much did it cost you? For integrating eleven labs?
I have tried open voice and bark, but VITS by far makes the most natural sounding voices.
I mean, this is dependent on your hardware... Are the specs anywhere for this "inference server"
In the US we have this concept, if you watch a football game which is notorious for having a shizload of commercials (ie latency), if you start watching the game 30 minutes late but from the beginning, you can skip most of the commercials. If you just shift the latency to the beginning, 15 seconds of "loading" would probably be sufficient enough for a 5-10 minute conversation between the two chatbots, and also avoid loops by having a third party observer who reviews the last 5 messages and determines if the conversation has gone "stale" and interjects a new idea into one of the interlocutors.
I'm daunted by the idea of setting up these development systems just to use a model. Any chance people can bundle them into one big executable for Windows and iOS? I sure would love to just load-and-go.
Just subscribed! can't wait to get my hands on it, looks super cool!
If you are fine venturing into c# or c++ then I know how you can improve the latency and create a single .exe that includes all of your different parts here, including using local models for the whisper voice recognition. I have done this myself using LLama sharp for runnign the GGUF file, and then embedding all external python into a batch process which it calls.
@@matthewfuller9760 i should put it there actually. I have been jumping between projects lately without sharing much. Will send a link when it is up
do you know of any offline/local way to do translations? i've been searching but haven't found a way to do local translations of video or audio using LargeLanguageModels
there is a program "subtitle edit" which can do that
Hello, please can inform to us what is your cpu, gpu, ram etc?
I wonder if you are (or can, if not) caching the processed .mp3 voice model after the speech engine processes it and turns it into partials. That would cut out a lot of latency if it didn't need to process those 20 seconds of recorded voice audio every time. Right now it's pretty fast but the latency still sounds more like they are using walkie talkies than speaking on a phone.
it could go way further by using the native libs and dropping all the python-based wrappers that pass data between stages using files and that copy, copy, copy and recopy data all the time. For example llama.cpp is clearly recognizable in the lower layers, all the tunable parameters match it. I don't know for openvoice for example however, but the state the presenter arrived at shows that we're pretty close to reaching a DIY conversational robot, which is pretty cool.
@@levieux1137 By native libs, you mean the system tts speech on say Windows and macOS?
@@JohnSmith762A11B not necessarily that, but I'm speaking about the underlying components that are used here. In fact if you look, this is essentially python code built as wrapper on top of other parts that already run natively. The llama.cpp server for example is used here apparently. And once wrapped into layers and layers, you see that it becomes heavy to transport contents from one layer to another (particularly when passing via files, but even memcpy is expensive). It might even be possible that some elements are re-loaded from scratch and re-initialized after each sentence. The python script here appears to be mostly a wrapper around all such components,working like a shell script recording input from the microphone to a file then sending it to openvoice, then send that output to a file, then load another component with that file, etc... This is just like a shell script working with files and heavy initialization at every step. Dropping all that layer and directly using the native APIs of the various libs and components would be way more efficient. And it's very possible that past a point the author will discover that Python is not needed at all, which could suddenly offer more possibilities for lighter embedded processing.
I've been trying to figure out how to do this. Great job. I want to support your work and get this up and running for myself, but is RUclips membership the only option?
what's the GPU your are using for it ?
wow! I would love to have such an assistant to practice languages. The “python hub” code, do you plan to share it at some point?
LOL , love the video bro. "gimme a second while i hack this shyt"
I am still using Tortoise but Open Voice seems to be promising! 😊 Thanks for this video!! 🎉🎉🎉
is there an updated version of this?
Hi, I´d like to know the computer specs required to run your speech to speech system, I m quite interested but I need to know first I my computer can handle it. thanks.
Hey Kris - that is awesome. I like it very much. Great that you do this open source stuff. Very cool 😎.
Interesting, this is similar to a small project I made for myself. But instead of a chatbot conversation, the whisper output is fed into SAM (yes, the funny robot voice) and sent to an audio output. Basically makes SAM say whatever I say with a slight delay. I'm chopping up the speech into small segments so it can start transcribing while I speak for longer, but that introduces occasional weirdness, but I'm fine with that.
no api = pure genius
I have the uncensored model the same one and when I ask it to list curse words it says it can't do that. ???
Lmao that’s annoying
I know about this for more than a year now and it still blows my mind. wtf
Fascinating. Any chance you could upgrade it so that specific voices could be used and a recording made automatically, Could make for some interesting Biden v Trump debates
Can you do the exact same thing for a no coder ? My LM studio also doesn't look like yours ? is it the updates ?
very cool, low latency voice, thanks for sharing, i watch all your videos, and i look forward to the next one,
I wonder if you could directly convert embeddings to speech to skip text inference
Do you have plans for voice changer for video games ?
is there anyway that you can give the python 90% of system resources so it would be faster
So if you disconnect your computer from the Internet, will it still work?
Yes, this code package is not pulling APIs (which is why the latency is low), so it doesn't need internet connection. Downside is, it cannot access info outside of it's core dataset, so no current events or anything like that.
This is great. But personally I think a speech recognition with push to talk or push to toggle talk is most useful.
I am a member and I dont see your github repo for this project. Can you please share it with me.
Just earned yourself a sub sir!
How I can try this in my computer?? I couldnt find the talk.py in github code??
It's his own code and you need to become a member and wait for invite to Github community.
once again, requirements won't install kicking off a couple hours of digging through versions before anything might work. Wish it was a little more standard to clearly denote python version being used and the package manager being used. Neither conda nor venv would work for me. Beyond that, project is very interesting as most are...
Hilarious and amazing. i will try and make something like this. Im new to this AI stuff so this will be interesting..
Good stuff.
It will be very interesting to see this in a web application
what's the GPU requirement to achieve real time response?
thank you
Is it also possible to adjust to different languages?
Hi Kris, Where is the Github code for this one. I could not locate it in the Member github.
How can i do this with the Apple M1, this is soooo awesome i need to figure it out!
That's wonderful. I wish I had the knowledge to implement that on my LLMs in LM Studio.
Thanks for sharing this mate! I wonder if the code is available somewhere? If so, could you please provide a link? Thanks
this was more comedy show than tech , lol. so hilarious responses from johnny.
Hindi and Telugu language supports..?
Do you have a video for showing how to do this step by step? I was going to be paid member but I couldn't see how to video in your paid channel?
Where is the code available ?,
I want to try it on my local
thanks, good project. Whisper can translate my spanish to english to spanish directly with little change in code? and tts i need change something also? thanks!
the predefined answer timing is what makes it not real conversation. It should spit answer questions at random timings like any human can think of something and only then answer. Randomizing timings would create more realistic conversations
Kinda feels like something the "rabbit R1" does
with the whole fast speech to speech thing
Nice! I was about to create something like this for myself but I still couldn't use OpenVoice because I keep failing to run it on my venv instead of conda.
How do you even install OpenVoice?
How did you actually set it up?
Jeg kjøper meg patron medlemskap om du setter opp rasa med denne modellen, ettersom hun mangler IQ og structur vil jeg anbefale rasa og bruke salgs teknikk for å få henne til å høres mer logisk ut. Med det mener jeg spinning.
Would love to see some realtime animations to go with the voice, could be a face, but also can be minimalistic (like the R1 rabbit).
You need a second GPU for this. Lets say you put on Stable Diffusion. Displaying a robot face with emotions would be nice.
Try Amica AI . It has VRM 3D/vtuber character and multiple option for the voice and the llm backed.
@@leucomedoes it work locally in real time?
@@wurstelei1356Again, I think a minimalistic animation would also do the trick , or prerendeing the images once, and using them in the appropriate sequence in realtime.
@@fire17102 Yes it can work in real-time locally as long as the GPU is fast and has enough vram to run the AI+Voice. It can also connect to online service if required. I uploaded a video where I play Minecraft and talk to the AI at same time with all the component running on a single GPU.
I run the oobabooga silero plus whisper, but those take forever to make voice from text, especially silero.
Where can i find whole GitHub repo ?
can the llm handle being told in a system prompt that it will be taking in the sentences in small chunks? say cut up into 2 second audio chunks per transcript. Can the mistral model do that? Anyway if so you might even be able to get it to 'butt in' to your prompt. now thats low latency!
No it cant be told that but it is not necessary.
just feed it the chunk and then if user speaks before it managed to reply more restart and feed more
I'd like to get a copy of the script please, this one is really cool! thanks for sharing this.
this is awesome, but voice could use some fine tuning to sound more realistic
I really like it! It this already on Github for members (could not find it)?
Este es un modelo para descargar y poder hablar con la IA? se puede jugar a ro? habla en español?
Why do you lock up open source???
looks interesting but where is this community link you mentioned? :)
does openvoice perform better than whisper's TTS?
Why LM Studio over OogaBooga? What are the pros/cons of them? I have been using Ooga, but wondering why one might switch.
i want to make a whispered speech to normal voice system ,can anyone help me
Surely the response time is a function of what rig you are doing this on - an RTX 4080 as you have is no doubt a major contributor here, and I would guess you have a beast of a CPU and high speed memory on a newer motherboard.
I think to decrease latency more you need to make it speak before AI finishes its sentence
unfortunately there is no obvious way to feed it partial prompt but waiting until it will finish generating reply takes asy too long
Hey cool…anyway to run this self hosted for an online speech to speech setup? Want to drop this into a chatbot project…what level membership to access the code thanks
how much RAM do you have to run mistral 7b locally? using gpu or cpu?
I am trying to clone a voice from a reference audio file, but despite following the standard process, the output doesn't seem to change according to the reference. When I change the reference audio to a different file, there's no noticeable change in the voice characteristics of the output. The script successfully extracts the tone color embeddings, but the conversion process doesn't seem to reflect these in the final output. I'm using the demo reference audio provided by OpenVoice (male voice), but the output synthesized speech remains in a female voice, typical of the base speaker model. I've double-checked the script, model checkpoints, and audio file paths, but the issue persists. If anyone has encountered a similar problem or has suggestions on what might be going wrong, I would greatly appreciate your insights. Thank you in advance!
same. issue. did you figure it out?
@@UaintDoxinme I did eventually end up fixing this but I'm sorry I don't remember the details. It's been too long.
@@mastershake2782 np dude
Would this work on the cloud? If so, how?
that's some Bethesda level of conversation
hi I dont have talk.py, but is there another way of running it im missing?
It's his own code, you need to become a member of the channel through subscription and wait for the invite code to github community.
Great tutorial but I wish you could share gists or share your code
I'd love to see a video of to AI's conversating with one another.
Thanks . Is whisper api free ?
it's open source
Thanks there are those who go to war and become heroes and those who don't but still write programs
What GPU are you running?
4080 RTX!
really interesting, lot of it i can't understand because I don't know coding but speech to speech could be a big thing within few years
I swear on my mother’s grave lol… this AI is hilarious! 😂😂😂
Good job. Interesting video
How can I install this on my PC? I am a member of the channel
did you get the gh invite?
@@AllAboutAI yes, thanks
@@AllAboutAI I am a member of the channel too, how to get gh invite?
This was great 😃👍
would be funny if you had this in games - like the people on the streets of gta having convos fueled by somthing like this. maybe it's already happening tho, i'm not in the know. awesomesauce!
How is a system that goes out to openAI, "local" ????????
Open air whisper is locally
It would be interesting to make a real-time translator. I think it could be very useful. The language barrier would end.
meta didi it already they created speech to speech translation model
Just to paraphrase your models: “Dude ! Are you actually grabbing the gorram scrollbars to scroll down an effing window !? What is this? 1996 ? Ever heard of a mouse wheel? You know it’s even emulated by double drag on track pads, right?” 🤘
wow very cool! Thanks
Too bad you take open source and make it closed.
Explain?
@@mblend27 You take code openly available, and ask people to become a member, to receive the code of what you demo using the open source code. The whole idea of open source is that everyone contributes without putting it behind walls
You can in several ways.
You take open source and make something with that and put it behind a wall.
@@mblend27 You make someone pay to access something on github you comprised of open source components.
Anybody done with this and have a code or link can share please
I would like to see how a chat room full of different models would problem solve... ChatGPT + Claude + * 7B + Grok + Bard... all in a room, trying to decide what you should have for lunch
OMG, I just noticed I've watched gazillion videos of yours.
Why haven't subscribed, though?
I swear I thought I had done it before?
Something's not adding up here...
Running on 4080 🤣 makes sense the damn thing is very fast
So where's the source code mate?
Or is this just a hook for your newsletter marketing and crap website?
Just a hook, the code is not shared.
❤❤❤🎉 nice
I dont see Dan.mp3