Creating Low Latency Voice Agents - Open Source 🗣️🗣️🗣️

Prompt Engineering

Просмотров 13 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 26 дек 2024

Комментарии • 64

@DogeFrom2014 21 день назад
Now this is what I like to see! Finally an open-source AI voice assistant that's *actually* an ongoing development instead of those 'one-off' projects for educational purposes.
This is getting really close to a modular platform where people can create and customize their own AI assistants. I would love to recreate JARVIS.
Wish you all the best with this project, I hope it gets more contributions.
@mtpscout 3 месяца назад ⁺¹
Adding Vision and Voice Activity Detector (VAD) would take it to the next level
@coocku5390 4 месяца назад ⁺³
Adding support for streaming for STT and TTS would be great as well!
@engineerprompt 4 месяца назад
Its on TODO list.
@ArianeQube 4 месяца назад ⁺¹⁷
Would have been great if you could do this totally on-device with open-source tools and models.
@johnkintree763 4 месяца назад
Even greater if it could run on-device on a smartphone, such as a Oneplus 11 with 16 GB of RAM.
@acidlaek 4 месяца назад ⁺¹
You can, you just need a 16gb. I made one with whisper, bark, llama3.1b. Although the bark example code is god awful and slow. To speed it up like 10x you need to batch it. You can get the mic input via a web browser. Whisper you can run on a cpu. So the layout is llm/bark gpu whisper cpu
@HariPrezadu 4 месяца назад
@@acidlaekI have done similar. Yosist on GitHub
@pliniocastro1546 4 месяца назад
Running the transcribe and audio generation is very computationaly heavy
@acidlaek 4 месяца назад ⁺¹
@@pliniocastro1546 nah whisper isn’t that bad. You can run that on a crappy laptop. Suno/bark on the other hand… a gpu is required period.
@gramnegrod 4 месяца назад
Thanks for the repo! Great work. I ran it and it worked right out of the box! It does need VAD or something. It often hears me say “Thankyou” when I have said nothing at all. I have had the same problem before and just gave up and went to a push to talk button. I designated the “`” key and it worked well. Wasn’t a hassle at all to PTT that key on my desktop. A perfect VAD would be nice though!
@engineerprompt 4 месяца назад ⁺¹
Thank you, I am working on a solution. There are some thresholds that you can play with in the current implementation but bring more robust VAD to it soon
@yazanrisheh5127 4 месяца назад
What is VAD?
@gramnegrod 4 месяца назад
VAD stands for Voice Activity Detection. In STT (Speech-to-Text) projects, VAD is crucial for identifying segments of audio where speech is present versus silence or background noise. These things are bad about hallucinating noises into words. It’s a real pain.
@aireferat Месяц назад
it does not work right out of the box now, lots off errors in it
@Sri_Harsha_Electronics_Guthik 4 месяца назад ⁺⁴
wake word implementation would be cool.
@engineerprompt 4 месяца назад ⁺¹
great idea.
@cgtinc4868 4 месяца назад ⁺¹
Great video and updates. Can you please give a guide of placing a voice visualizer along with the tts. Will be nice to have a visual when the voice speaks
@engineerprompt 4 месяца назад ⁺²
that's actually a good idea. Will look into it.
@cgtinc4868 4 месяца назад
@@engineerprompt Wrote you an email on it :)
@azgdoo5372 4 месяца назад ⁺¹
I don't think it's actually gemini live or OpenAI advanced voice mode alternative. The feature of OpenAI AVM is that it inputs and outputs audio, so model can hear sounds, tone, interruptions and emotions and other stuff, and can output different intonation, voice, sounds to user, with all this only one model
@alrojas68 3 месяца назад
amazing, material, would if you did the video walking through the code base. Cheers
@BradleyKieser 4 месяца назад ⁺¹
Brilliant work well done!
@Techonsapevole 4 месяца назад ⁺¹
can CoquiTTS 2 be supported ? it's opensource and offline
@annwang5530 4 месяца назад ⁺²
Can it be created a japanese to japanese agent? So it could be like a language learning tool?
@AB-cd5gd 13 дней назад
Cool, how would you add a key word trigger with groq api? Wouldnt that mean listening constantly would cost you a lot to just detect a keyword?
@arun8135 3 месяца назад
Awesome project brother and useful resources, thank you for sharing it.By the way please
upload tutorial soon for changing different voices
@BadBite 4 месяца назад ⁺¹
great! your content is very useful and well done! thank you
@engineerprompt 4 месяца назад
thank you!
@unclecode 4 месяца назад
Congrats for the new library, already starred. Try to contribute.
@engineerprompt 4 месяца назад
thank you, would love that.
@xXWillyxWonkaXx 3 месяца назад
Im curious, since the voice of the AI has this weird "stern/robot almost kind of human but isnt" timbre, is there any way to make them change their timbre/pitch frequency - kind of like the from OpenAi's ZVoice Demo?
@SaddamBinSyed 4 месяца назад ⁺¹
How are you handling the silence detection? Any VAD models like silero ? Waiting for your offline setup demo as well.
@engineerprompt 4 месяца назад ⁺²
At the moment, its looking at voice above a certain threshold. Looking to add VAD models soon.
@youtubeccia9276 2 месяца назад
-> excellent stuff!
@manuelbradovent3562 4 месяца назад
Great video! Indeed the latency is low.
@Centaurman 4 месяца назад ⁺²
Can we get an all local version video?
@engineerprompt 4 месяца назад ⁺²
Yup, working on it.
4 месяца назад
Great video, awesome project and useful resources, thank you for sharing, I'll try in my personal projects.
@engineerprompt 4 месяца назад
great, glad its useful.
@meerulhassan2944 2 месяца назад
Osm work indeed. I run the project but i am having an error which is as follows
Failed to record audio: [WinError 2] The system cannot find the file specified.
How can i fix this issue. I am running it on windows.
@m.hussain360 День назад
I don't know but this is not working. I tried placing deepgram, openai, groq but none working. It always gives authorization failure or something sort of this. I configured config file accordingly and .env file with keys but it seems something is still missing.
@dencio452 Месяц назад
great video, may i know what the specs of your ai server?
@short_laughtv 3 месяца назад
it does not have memory when you start it after ending session ima try get that working
@williamwong8424 4 месяца назад
would be great if u can feed it knowledge such as pdf or faqs pls
@DearGeorge3 4 месяца назад
Looks interesting! Is it possible to use it in multi-user mode (2 and more users at once) or having plan to make it?
@engineerprompt 4 месяца назад
Can you explain a little more. Do you mean two people talking to the assistant at the same time and the assistant tracks each user individually?
@DearGeorge3 4 месяца назад
@@engineerprompt Yes.
@MeinDeutschkurs 4 месяца назад
We‘re always just talking about what we could additionally do. I haven’t found any video, that covers in-case-routing.
- RAG in case of,
- retrieve from web in case of,
- a mixture of both in case of,
- n in case of m,
- else direct reply.
All the solutions are always exactly one use case and this is a mess. Only the combination of all is useful. How could we address this?
@engineerprompt 4 месяца назад ⁺¹
that's coming. I plan to add function call to show a proof of concept for routing. Stay tuned.
@MeinDeutschkurs 4 месяца назад
@@engineerprompt , great. Thanks for your reply! 🤗
@souravroy8834 4 месяца назад
brother melotts is not working properly i am running it on docker after i finished bukding it in docker when i start the command for run it shows the following Unable to find image 'melotts:latest' locally. dont know what to do . please help me regarding this issue .
@aibeginnertutorials 4 месяца назад
Awesome project. You never cease to amaze us! Thanks
@engineerprompt 4 месяца назад
thank you :)
@barackobama4552 4 месяца назад
Thanks!
@simplifiedwithvarun 4 месяца назад
Reminds me of Neurosama
@Username8281 4 месяца назад
Amazing
@tecnom7133 4 месяца назад
good one
@JNET_Reloaded 4 месяца назад
Nice
@nguyenanhnguyen7658 3 месяца назад
Title shoulbe "(voice for English only)"
@zidviziouz 4 месяца назад
First!

Следующие

Автовоспроизведение

ColPali: Vision-Based RAG System For Complex Documents