Now this is what I like to see! Finally an open-source AI voice assistant that's *actually* an ongoing development instead of those 'one-off' projects for educational purposes. This is getting really close to a modular platform where people can create and customize their own AI assistants. I would love to recreate JARVIS. Wish you all the best with this project, I hope it gets more contributions.
You can, you just need a 16gb. I made one with whisper, bark, llama3.1b. Although the bark example code is god awful and slow. To speed it up like 10x you need to batch it. You can get the mic input via a web browser. Whisper you can run on a cpu. So the layout is llm/bark gpu whisper cpu
Thanks for the repo! Great work. I ran it and it worked right out of the box! It does need VAD or something. It often hears me say “Thankyou” when I have said nothing at all. I have had the same problem before and just gave up and went to a push to talk button. I designated the “`” key and it worked well. Wasn’t a hassle at all to PTT that key on my desktop. A perfect VAD would be nice though!
Thank you, I am working on a solution. There are some thresholds that you can play with in the current implementation but bring more robust VAD to it soon
VAD stands for Voice Activity Detection. In STT (Speech-to-Text) projects, VAD is crucial for identifying segments of audio where speech is present versus silence or background noise. These things are bad about hallucinating noises into words. It’s a real pain.
Great video and updates. Can you please give a guide of placing a voice visualizer along with the tts. Will be nice to have a visual when the voice speaks
I don't think it's actually gemini live or OpenAI advanced voice mode alternative. The feature of OpenAI AVM is that it inputs and outputs audio, so model can hear sounds, tone, interruptions and emotions and other stuff, and can output different intonation, voice, sounds to user, with all this only one model
Im curious, since the voice of the AI has this weird "stern/robot almost kind of human but isnt" timbre, is there any way to make them change their timbre/pitch frequency - kind of like the from OpenAi's ZVoice Demo?
Osm work indeed. I run the project but i am having an error which is as follows Failed to record audio: [WinError 2] The system cannot find the file specified. How can i fix this issue. I am running it on windows.
I don't know but this is not working. I tried placing deepgram, openai, groq but none working. It always gives authorization failure or something sort of this. I configured config file accordingly and .env file with keys but it seems something is still missing.
We‘re always just talking about what we could additionally do. I haven’t found any video, that covers in-case-routing. - RAG in case of, - retrieve from web in case of, - a mixture of both in case of, - n in case of m, - else direct reply. All the solutions are always exactly one use case and this is a mess. Only the combination of all is useful. How could we address this?
brother melotts is not working properly i am running it on docker after i finished bukding it in docker when i start the command for run it shows the following Unable to find image 'melotts:latest' locally. dont know what to do . please help me regarding this issue .
Now this is what I like to see! Finally an open-source AI voice assistant that's *actually* an ongoing development instead of those 'one-off' projects for educational purposes.
This is getting really close to a modular platform where people can create and customize their own AI assistants. I would love to recreate JARVIS.
Wish you all the best with this project, I hope it gets more contributions.
Adding Vision and Voice Activity Detector (VAD) would take it to the next level
Adding support for streaming for STT and TTS would be great as well!
Its on TODO list.
Would have been great if you could do this totally on-device with open-source tools and models.
Even greater if it could run on-device on a smartphone, such as a Oneplus 11 with 16 GB of RAM.
You can, you just need a 16gb. I made one with whisper, bark, llama3.1b. Although the bark example code is god awful and slow. To speed it up like 10x you need to batch it. You can get the mic input via a web browser. Whisper you can run on a cpu. So the layout is llm/bark gpu whisper cpu
@@acidlaekI have done similar. Yosist on GitHub
Running the transcribe and audio generation is very computationaly heavy
@@pliniocastro1546 nah whisper isn’t that bad. You can run that on a crappy laptop. Suno/bark on the other hand… a gpu is required period.
Thanks for the repo! Great work. I ran it and it worked right out of the box! It does need VAD or something. It often hears me say “Thankyou” when I have said nothing at all. I have had the same problem before and just gave up and went to a push to talk button. I designated the “`” key and it worked well. Wasn’t a hassle at all to PTT that key on my desktop. A perfect VAD would be nice though!
Thank you, I am working on a solution. There are some thresholds that you can play with in the current implementation but bring more robust VAD to it soon
What is VAD?
VAD stands for Voice Activity Detection. In STT (Speech-to-Text) projects, VAD is crucial for identifying segments of audio where speech is present versus silence or background noise. These things are bad about hallucinating noises into words. It’s a real pain.
it does not work right out of the box now, lots off errors in it
wake word implementation would be cool.
great idea.
Great video and updates. Can you please give a guide of placing a voice visualizer along with the tts. Will be nice to have a visual when the voice speaks
that's actually a good idea. Will look into it.
@@engineerprompt Wrote you an email on it :)
I don't think it's actually gemini live or OpenAI advanced voice mode alternative. The feature of OpenAI AVM is that it inputs and outputs audio, so model can hear sounds, tone, interruptions and emotions and other stuff, and can output different intonation, voice, sounds to user, with all this only one model
amazing, material, would if you did the video walking through the code base. Cheers
Brilliant work well done!
can CoquiTTS 2 be supported ? it's opensource and offline
Can it be created a japanese to japanese agent? So it could be like a language learning tool?
Cool, how would you add a key word trigger with groq api? Wouldnt that mean listening constantly would cost you a lot to just detect a keyword?
Awesome project brother and useful resources, thank you for sharing it.By the way please
upload tutorial soon for changing different voices
great! your content is very useful and well done! thank you
thank you!
Congrats for the new library, already starred. Try to contribute.
thank you, would love that.
Im curious, since the voice of the AI has this weird "stern/robot almost kind of human but isnt" timbre, is there any way to make them change their timbre/pitch frequency - kind of like the from OpenAi's ZVoice Demo?
How are you handling the silence detection? Any VAD models like silero ? Waiting for your offline setup demo as well.
At the moment, its looking at voice above a certain threshold. Looking to add VAD models soon.
-> excellent stuff!
Great video! Indeed the latency is low.
Can we get an all local version video?
Yup, working on it.
Great video, awesome project and useful resources, thank you for sharing, I'll try in my personal projects.
great, glad its useful.
Osm work indeed. I run the project but i am having an error which is as follows
Failed to record audio: [WinError 2] The system cannot find the file specified.
How can i fix this issue. I am running it on windows.
I don't know but this is not working. I tried placing deepgram, openai, groq but none working. It always gives authorization failure or something sort of this. I configured config file accordingly and .env file with keys but it seems something is still missing.
great video, may i know what the specs of your ai server?
it does not have memory when you start it after ending session ima try get that working
would be great if u can feed it knowledge such as pdf or faqs pls
Looks interesting! Is it possible to use it in multi-user mode (2 and more users at once) or having plan to make it?
Can you explain a little more. Do you mean two people talking to the assistant at the same time and the assistant tracks each user individually?
@@engineerprompt Yes.
We‘re always just talking about what we could additionally do. I haven’t found any video, that covers in-case-routing.
- RAG in case of,
- retrieve from web in case of,
- a mixture of both in case of,
- n in case of m,
- else direct reply.
All the solutions are always exactly one use case and this is a mess. Only the combination of all is useful. How could we address this?
that's coming. I plan to add function call to show a proof of concept for routing. Stay tuned.
@@engineerprompt , great. Thanks for your reply! 🤗
brother melotts is not working properly i am running it on docker after i finished bukding it in docker when i start the command for run it shows the following Unable to find image 'melotts:latest' locally. dont know what to do . please help me regarding this issue .
Awesome project. You never cease to amaze us! Thanks
thank you :)
Thanks!
Reminds me of Neurosama
Amazing
good one
Nice
Title shoulbe "(voice for English only)"
First!