this was on my todo list for a bigger project. I'm highly confident i should be able to plug this in for my apllication. you rock and have saved me alot of time i think.
i want to implement this using verbi - a system that: ● Engages in real-time voice conversations with customers, mimicking the style and knowledge of a product seller ● Understands and responds to complex product queries with low latency (
@PromptEngg. How will you handle it when a user speaks other than the English language ? Do I need to translate the STT output text before passing to LLM or Is there any way to handle this in PROMPT itself ? I am stuck with this point.. Appreciate your advice on this . Thanks😊
your work is fantastic. Can you make a video to show us how to install it locally step by step please. Thanks There is a problem when restarting the code sometimes it gives error about api. Is there any way to fix that?
Can you make it such that verbi has a wakeword so that we can keep it running everytime to call it whenever we want and not have it say random things if I am talking to my friend who is next to me. It would only reply or act if I say "Verbi what tasks do I have" instead of "What tasks do I have"
well suggestions are make it so it will start most or all windows based application via voice commands, have it clicking and moving mouse , creating new business emails reply etc, make it able to run or q&A pdf.word excel etc , rag,memory,tree if thoughts etc multiple agents , making code , executing scripts etc and voila
Funny enough I am also working on a similar project but a more complicated AI having controls to vscode and my stack I believe fits most of your as from grok. Have just watched 12 seconds of your video and I am eager to know the TTS part because I am struggling with good quality in that department. Rest is almost done some tweaking here and there is remaining.
I do have an idea for a project, but I am not a developer, so I am relying on an LLM to generate the code. My idea is to have a bot inside MS Teams meetings that can do a presentation and interact with other attendees if they ask questions. For example, if the bot is presenting and an attendee asks a question, it would pause the presentation to answer. If the answer is satisfactory, the bot would resume the presentation. Only relevant topics should be answered; otherwise, the question would be listed as an action item. Is this feasible?
these TTS api fees pricey. we need on-device TTS and idealy fully open source. my current demo TTS is like 90% of the voice to voice stack fees per minute. we need the price to drop an order of magnitude
Since I am a human being, I am forced to click checkboxes and buttons and fill out forms, both on web pages and on fat apps, all day every day. This frustrates me because it is the type of brainless work that a machine should be doing. Also because I am a human, I don't use APIs to do this, I use UIs. And I know for sure that there are no APIs for much of what I do with UIs. If someone else would make an ai that operates UIs (actually SEES them, not using appletalk or puppeteer or similar APIs -- which don't actually SEE anything, and often don't detect controls that I can see -- e.g. an imagemap with an on-click event), I would gladly pay for it. But since I can find no ai projects that make even a reasonable attempt at this (they all use APIs), I am having to build one for myself, just so I can use it.
@@HariPrezadu I wish the tedious work I must do consisted only of clicking checkboxes. But I am forced to type on the keyboard quite a bit also, and tab and click on fields, operate pulldown menus, and lots of other UI forms of torture. Oh what I would give if I could only get an ai to do this for me. Fortunately I am well versed in construction of ai-driven software, so maybe my suffering will so on be over. Or maybe I will die trying to make such an ai; but that would also end my UI suffering, so there is hope.
I have done a project using salenium and BERT user can ask go to google open website fill the form etc its was amazing project that i have done for pne of my client
Just look out there, this exists through control over the browser. If the UI is browser based you can do this and much more. I just forgot the name but there more than three they just new. This was a big problem for agents. Solving captchas and behave as a human would
@@zeusconquers I am frustrated because i HAVE examined the half-hearted attempts at operating UIs. The browser-based solutions are very unsatisfying because they can see only controls that are well marked and easily distinguished by the HTML/JS/CSS as controls. An old-fashioned image map with js on-click events that look at the x,y coordinate of the click, for example, would be nearly impossible to operate with this kind of solution. The non-browser based solutions suffer from the same problem because they use OS API's to query for buttons, text fields, and other controls. If the app doesn't go out of its way to publish these controls the right way, the OS doesn't know about them. Bottom line: everything I have examined for operating UIs can't SEE anything; they are using APIs or sort-of APIs.
this was on my todo list for a bigger project. I'm highly confident i should be able to plug this in for my apllication. you rock and have saved me alot of time i think.
Very intersting ! Eager for the next video about local LLM integration. Subscribed !
i want to implement this using verbi - a system that:
● Engages in real-time voice conversations with customers, mimicking the style and
knowledge of a product seller
● Understands and responds to complex product queries with low latency (
any tips ???
@@CryptoMaN_Rahul is it regarding Grid ???
The work you are doing is fantastic. Thank you for sharing
This is what im looking for. Nice work..
This video is realy help me with my thesis
Glad it was helpful!
Adding Vision and Voice Activity Detector (VAD) would take it to the next level
@PromptEngg. How will you handle it when a user speaks other than the English language ? Do I need to translate the STT output text before passing to LLM or Is there any way to handle this in PROMPT itself ?
I am stuck with this point..
Appreciate your advice on this . Thanks😊
brilliant, thanks a lot for the content, you are a true master.
your work is fantastic. Can you make a video to show us how to install it locally step by step please. Thanks
There is a problem when restarting the code sometimes it gives error about api. Is there any way to fix that?
Can you make it such that verbi has a wakeword so that we can keep it running everytime to call it whenever we want and not have it say random things if I am talking to my friend who is next to me. It would only reply or act if I say "Verbi what tasks do I have" instead of "What tasks do I have"
well suggestions are make it so it will start most or all windows based application via voice commands, have it clicking and moving mouse , creating new business emails reply etc, make it able to run or q&A pdf.word excel etc , rag,memory,tree if thoughts etc multiple agents , making code , executing scripts etc and voila
I made a same thing with ollama with local TTS
Funny enough I am also working on a similar project but a more complicated AI having controls to vscode and my stack I believe fits most of your as from grok. Have just watched 12 seconds of your video and I am eager to know the TTS part because I am struggling with good quality in that department. Rest is almost done some tweaking here and there is remaining.
Amazing videos like always. What software do you use to record your screen in that way?
I try many time, but failed due to audio error, can you help me to check ?
Now, finally, something really USEFUL from AI. Congratulations 🎉❤
how do you implement session?
can you help me to check issues in github ? I need to demo to my customer
I do have an idea for a project, but I am not a developer, so I am relying on an LLM to generate the code. My idea is to have a bot inside MS Teams meetings that can do a presentation and interact with other attendees if they ask questions. For example, if the bot is presenting and an attendee asks a question, it would pause the presentation to answer. If the answer is satisfactory, the bot would resume the presentation. Only relevant topics should be answered; otherwise, the question would be listed as an action item. Is this feasible?
You are a developer
@@Devishetty001what a great relevant answer LMAO
I have this idea working locally using vllm, xtts2, whisper v3 large and brave search api.
What is the difference between this and Verbi?
It's verbi with additional functionality
@@engineerprompt What 'additional functionality' are you referring to?
Can I add it to Verbi when testing? If so, how?
Brillante
Phenomenal
That voice is SO creepy!
its actually copy of original jarvis voice,lol
Didn't JARVIS go Evil in the Marvel movies? Maybe that is what you are picking up on?
@@Create-The-Imaginable nah, he won't go evil, the evil version of ai was Ultron, not Jarvis.
@@im-notai I never saw that movie! I need to check it out! Was it good from an AI philisophical perspective?
@@Create-The-Imaginable you should 👍
these TTS api fees pricey. we need on-device TTS and idealy fully open source. my current demo TTS is like 90% of the voice to voice stack fees per minute. we need the price to drop an order of magnitude
Since I am a human being, I am forced to click checkboxes and buttons and fill out forms, both on web pages and on fat apps, all day every day. This frustrates me because it is the type of brainless work that a machine should be doing. Also because I am a human, I don't use APIs to do this, I use UIs. And I know for sure that there are no APIs for much of what I do with UIs. If someone else would make an ai that operates UIs (actually SEES them, not using appletalk or puppeteer or similar APIs -- which don't actually SEE anything, and often don't detect controls that I can see -- e.g. an imagemap with an on-click event), I would gladly pay for it. But since I can find no ai projects that make even a reasonable attempt at this (they all use APIs), I am having to build one for myself, just so I can use it.
Just clicking checkbox. It’s easy to make. But click captcha image. Thats difficult
@@HariPrezadu I wish the tedious work I must do consisted only of clicking checkboxes. But I am forced to type on the keyboard quite a bit also, and tab and click on fields, operate pulldown menus, and lots of other UI forms of torture. Oh what I would give if I could only get an ai to do this for me. Fortunately I am well versed in construction of ai-driven software, so maybe my suffering will so on be over. Or maybe I will die trying to make such an ai; but that would also end my UI suffering, so there is hope.
I have done a project using salenium and BERT user can ask go to google open website fill the form etc its was amazing project that i have done for pne of my client
Just look out there, this exists through control over the browser. If the UI is browser based you can do this and much more. I just forgot the name but there more than three they just new. This was a big problem for agents. Solving captchas and behave as a human would
@@zeusconquers I am frustrated because i HAVE examined the half-hearted attempts at operating UIs. The browser-based solutions are very unsatisfying because they can see only controls that are well marked and easily distinguished by the HTML/JS/CSS as controls. An old-fashioned image map with js on-click events that look at the x,y coordinate of the click, for example, would be nearly impossible to operate with this kind of solution. The non-browser based solutions suffer from the same problem because they use OS API's to query for buttons, text fields, and other controls. If the app doesn't go out of its way to publish these controls the right way, the OS doesn't know about them. Bottom line: everything I have examined for operating UIs can't SEE anything; they are using APIs or sort-of APIs.
👍