@@justcallmebrian793 will be tech focused with other random vids - basically a reflection of my current interests (which may shift & change, but since I’m an engineer likely tech)
ok: this is brilliant on many levels. its simply an example of human-ai interaction using a general LLM / w a more humane conventional AI. this is a great DeMO. love the natural setup and attempt to make it natural as would a real conversation with a person only this case a human Wiki.
Thoughts on the impact of this tech compared to the internet. I think people forget how long it took to build out and integrate internet technology. It all started with very slow phone modems, and email addresses almost no one had… no on line banking, no internet games, no text messaging, no youtube, no netflix. it took about a decade to start seeing the internet start becoming useful. in that context, it will take a while for these new technologies to penetrate and start impacting jobs. The entire conversational AI seems less than a year old… heck, less than 6 months old.
Right - well I was just making filler conversation to keep the convo moving. But yes it will take time, entrepreneurs will have to create the ventures, investors will have to fund them, markets will have to adopt them, programmers program them, etc etc Good point.
Overall, I thought it was a great video. Even when the AI cuts you off, sounds like a natural conversation. I don’t like along pause when someone’s finished talking because it sounds like there or you got disconnected. I have a small business to send outbound calls. Do you charge to set this up for others?
There's a good bit of no-code solutions you can go with to avoid consultant fees. Feel free to dm me on Twitter. I'm also working on a site w/ guides to common voice workflows + recommended platforms so you can build on your own.
AI never let you finish your thoughts if you paused for a second, which forced you to keep talking with no period, just so she didn't cut in on you. It needs some work to be able to tell when we are done with a thought, verses, just pausing to gather our words.
Right - I found myself speaking without break at times to keep things smooth, but I did find that it handled pausing (maybe off-camera) as well. Once you're aware of its limitations you do find yourself trying to "help it out", which in turn leads to unnatural conversation patterns, etc
@@runepanix1 That's a good idea. Chat GPT has the same problem when you use the voice option. That's why I hold down the button so it has to wait until I'm done.
I enjoy using “old” text based models like claude. As tedious as that interface is …. Paradoxically i does allow space to better compose thoughts…. Like the difference in content between a letter and a text message
You can do this locally with open source tools and its way easier to get more performant (low latency) responses and manage interruptions and other advanced stuff that you simply can't do with some remote API.
I had mention of this at 38:26 - could you go more into how you deploy your solution & the latency there? I've seen many demos locally w/ low latency, but once you deploy this and have international usage machine positioning becomes important. I do agree on the later idea that I wonder if these 50-100ms sensitive edge-cases can be solved w/ a remote solution, it really pushes optimization all the way down to the physical link layer (which may be the hard limit here, cannot change the highway bits are shipped along).
@@bephrem most of it is that in managing your own local LLMs and speech stuff lets you get data pipelining to minimize round trip issues. Most of the existing AI cloud APIs are set up to handle bulk workloads and struggle on streaming. Having low level access we can start the local LLMs output to start being TTS'd almost as soon as text tokens are being dumped out because we don't have to worry about bundling everything up and shipping it to a server and sitting around waiting for the wav file to download. Right now we just have a bunch of basic sockets on one workstation but are planning to use zeromq or something similar to get networked stuff going. A lot of the stuff I am seeing is smoke and mirrors to hide latency issues. Like the AI agents saying "got it" etc seem like pre-scripted responses to trick your mind into thinking it has started talking. The local one we got running doesn't need to do that and it's still fast enough to manage conversations.
@@bephrem a hybrid solution might be viable. Having a smaller faster agent facilitate talking with a more heavyweight one in the cloud is something we a re looking at. Like we can't get Grok to run locally really, but having a mixtral agent say something like "fun idea, let me think about that for a moment" and behind the scenes sending off a more complex query to a cloud AI when it needs to seems practical. Ideally they use a shared RAG or something so it doesn't end up having amnesia.
@@ultimape Right - I've considered the hybrid approach. It opens up a lot of complexity in picking what does what, & as you mention sync'ing state - makes me question if that complexity is necessary, but it seems the latency necessitates it? Not sure, but good ideas.
Interestingly, the presence of a nubile young voice ,,, ultimately just illustrates the inherently hollow nature of the interaction by unrealistically raising anticipated expectations … like an artificial sex doll… the sexier it is only increases the sense of unfulfilled potential
Right - the LLM sort of parrots back what was said. My goal was more to just have a long sample video for developers to see latency, user interjection handling, agent interruption prevention, backchanneling, etc. I knew I wouldn't have a very intellectually novel or stimulating conversation.
@@thekittenfreakify Adequate enough for most commercial use-cases. Not really there yet for fluent conversation (both mechanically w/ latency + LLM knowing to say the "human" thing).
My kingdom for a speech interface that properly knows when I'm talking and when I'm done talking. ChatGPTs voice interface in the app has some of the same issues, but it will give much more substantial replies - without trying to be cute.
I haven’t found any voice chat as sophisticated as talking to GPT 4 or 3.5 for free, which I believe is using their whisper AI model to generate the realistic voices, but they only have five options, currently.
i dont mean to be a be negative nancy but..with this many fails this video should not have been posted..or you could have just cut them..i dig your style but it's infuriating that it stops and starts so many times. like i feel like rage quitting the video. no disrespect to you. thank you for the content. i subscribed and look forward to the next video with a different camera!
Right - so the main camera was a Sony A7IV which is known for it's overheating issues. I didn't know there was a specific heat shut-off setting you had to flip to high tolerance (defaults to a low tolerance) since it was my 2nd time using the camera & I had never used it for filming longer than a minute. This was filmed on a Friday & edited all Saturday, I thought it better to release something finished than to never find the time to try again. The demo also times out within ~15 minutes (I think) so there's a cap there as well. For the purposes of the video, my aim was achieved within ~15 minutes. The LLM won't recall that far back (I imagine) so the most relevant thing was showing the request handling performance without holding back. This video doesn't have to be watched all the way through. And right - the style is to have the video totally uncut & candid. It was just either I don't reshoot & release this or I release it as we see it today! And thanks!
@@bephrem It would be really interesting to do a paid version of this using GPT-4 - I'm hearing the simplistic, "pr-speakishness of ChatGPT 3.5" with most of her flat, emotionless answers , the difference between 3.5 and 4 is glaring in this regard. The latency is super impressive though.
featuring: Blender Bottle, overheating camera, & VapiAI (vapi.ai/ ) - play at 1.75x - 2x speed for convenience.
Will your channel be dedicated to AI and other topics pertaining to technology or something else? I like the way you are covering this
@@justcallmebrian793 will be tech focused with other random vids - basically a reflection of my current interests (which may shift & change, but since I’m an engineer likely tech)
This is great, thank you for your thorough explanation!
sure! I couldn't really find any candid long-form demo videos (30min - 1hr+), so just wanted to make one.
ok: this is brilliant on many levels. its simply an example of human-ai interaction using a general LLM / w a more humane conventional AI. this is a great DeMO. love the natural setup and attempt to make it natural as would a real conversation with a person only this case a human Wiki.
thanks haha
Appreciate the technical intro on this. Thinking of using Vapi in my stack for a consumer facing product i'm working on.
cool cool
The Jarvis-like presentation at 35:00 was pretty cool !!!
Thoughts on the impact of this tech compared to the internet. I think people forget how long it took to build out and integrate internet technology. It all started with very slow phone modems, and email addresses almost no one had… no on line banking, no internet games, no text messaging, no youtube, no netflix. it took about a decade to start seeing the internet start becoming useful. in that context, it will take a while for these new technologies to penetrate and start impacting jobs. The entire conversational AI seems less than a year old… heck, less than 6 months old.
Right - well I was just making filler conversation to keep the convo moving. But yes it will take time, entrepreneurs will have to create the ventures, investors will have to fund them, markets will have to adopt them, programmers program them, etc etc Good point.
@@bephrem nice video though… even the clumsy parts illustrate the current nascent reality of this technology
Overall, I thought it was a great video. Even when the AI cuts you off, sounds like a natural conversation. I don’t like along pause when someone’s finished talking because it sounds like there or you got disconnected. I have a small business to send outbound calls.
Do you charge to set this up for others?
There's a good bit of no-code solutions you can go with to avoid consultant fees. Feel free to dm me on Twitter.
I'm also working on a site w/ guides to common voice workflows + recommended platforms so you can build on your own.
This tech could create a brilliant customer service phone line
indeed
Awesome video man
thx haha - very random vid
AI never let you finish your thoughts if you paused for a second, which forced you to keep talking with no period, just so she didn't cut in on you. It needs some work to be able to tell when we are done with a thought, verses, just pausing to gather our words.
Right - I found myself speaking without break at times to keep things smooth, but I did find that it handled pausing (maybe off-camera) as well. Once you're aware of its limitations you do find yourself trying to "help it out", which in turn leads to unnatural conversation patterns, etc
I'd probably tell it to not respond until I say OVER or some other word, but then it's a little awkward
@@runepanix1 That's a good idea.
Chat GPT has the same problem when you use the voice option. That's why I hold down the button so it has to wait until I'm done.
@@runepanix1 well you’d want it to just know not to interject, having a stop/completion word ruins the point of a freeform conversation interface
Dude chill. It’s new. This is the worst it’s gonna be. Relax….🥴🤦🏾♂️
I enjoy using “old” text based models like claude. As tedious as that interface is …. Paradoxically i does allow space to better compose thoughts…. Like the difference in content between a letter and a text message
this is great training data
Damn my algo knows me. Hold my beer. Got something to show you all soon. But yeah. Impressive!
excited to see it!
Can you try this with Sinderin and VoiceOS?
yes, I set up a playground where you can try different services: vocalized.dev/playground - will add Sindarin soon
@@bephremThat’s an awesome project! “VoiceOS io” and “milis ai” could be on the roadmap too. I’d love to compare all of them side by side.
You can do this locally with open source tools and its way easier to get more performant (low latency) responses and manage interruptions and other advanced stuff that you simply can't do with some remote API.
I had mention of this at 38:26 - could you go more into how you deploy your solution & the latency there? I've seen many demos locally w/ low latency, but once you deploy this and have international usage machine positioning becomes important. I do agree on the later idea that I wonder if these 50-100ms sensitive edge-cases can be solved w/ a remote solution, it really pushes optimization all the way down to the physical link layer (which may be the hard limit here, cannot change the highway bits are shipped along).
@@bephrem most of it is that in managing your own local LLMs and speech stuff lets you get data pipelining to minimize round trip issues. Most of the existing AI cloud APIs are set up to handle bulk workloads and struggle on streaming.
Having low level access we can start the local LLMs output to start being TTS'd almost as soon as text tokens are being dumped out because we don't have to worry about bundling everything up and shipping it to a server and sitting around waiting for the wav file to download. Right now we just have a bunch of basic sockets on one workstation but are planning to use zeromq or something similar to get networked stuff going.
A lot of the stuff I am seeing is smoke and mirrors to hide latency issues. Like the AI agents saying "got it" etc seem like pre-scripted responses to trick your mind into thinking it has started talking.
The local one we got running doesn't need to do that and it's still fast enough to manage conversations.
@@ultimape I have a pretty big audience in the real estate investing space looking to bring conversational AI to them, any chance we could chat? ☺️
@@bephrem a hybrid solution might be viable. Having a smaller faster agent facilitate talking with a more heavyweight one in the cloud is something we a re looking at.
Like we can't get Grok to run locally really, but having a mixtral agent say something like "fun idea, let me think about that for a moment" and behind the scenes sending off a more complex query to a cloud AI when it needs to seems practical.
Ideally they use a shared RAG or something so it doesn't end up having amnesia.
@@ultimape Right - I've considered the hybrid approach. It opens up a lot of complexity in picking what does what, & as you mention sync'ing state - makes me question if that complexity is necessary, but it seems the latency necessitates it? Not sure, but good ideas.
I wonder if AI can move beyond sounding like PR-speak. It speaks a whole bunch of words that don’t mean much.
it soon will
Should've done it with Pi Ai
good idea 💯
@@bephrem It's conversational ability is the best I've come across so far
PI ai doesn't have an API.
Interestingly, the presence of a nubile young voice ,,, ultimately just illustrates the inherently hollow nature of the interaction by unrealistically raising anticipated expectations … like an artificial sex doll… the sexier it is only increases the sense of unfulfilled potential
interesting, indeed
All I hear is none answers.
Right - the LLM sort of parrots back what was said. My goal was more to just have a long sample video for developers to see latency, user interjection handling, agent interruption prevention, backchanneling, etc. I knew I wouldn't have a very intellectually novel or stimulating conversation.
I see. How did it do. Seemed ok or adequate to me but I am curious what your data shows.
@@thekittenfreakify Adequate enough for most commercial use-cases. Not really there yet for fluent conversation (both mechanically w/ latency + LLM knowing to say the "human" thing).
My kingdom for a speech interface that properly knows when I'm talking and when I'm done talking. ChatGPTs voice interface in the app has some of the same issues, but it will give much more substantial replies - without trying to be cute.
thanks for sharing!
Not as good as chatting with the latest version of GPT.
The LLM in this video was ChatGPT 3.5
I haven’t found any voice chat as sophisticated as talking to GPT 4 or 3.5 for free, which I believe is using their whisper AI model to generate the realistic voices, but they only have five options, currently.
i dont mean to be a be negative nancy but..with this many fails this video should not have been posted..or you could have just cut them..i dig your style but it's infuriating that it stops and starts so many times. like i feel like rage quitting the video. no disrespect to you. thank you for the content. i subscribed and look forward to the next video with a different camera!
I think the purpose of the video is exactly that, to show both the innovation and limitations of conversational AI
Right - so the main camera was a Sony A7IV which is known for it's overheating issues. I didn't know there was a specific heat shut-off setting you had to flip to high tolerance (defaults to a low tolerance) since it was my 2nd time using the camera & I had never used it for filming longer than a minute. This was filmed on a Friday & edited all Saturday, I thought it better to release something finished than to never find the time to try again. The demo also times out within ~15 minutes (I think) so there's a cap there as well.
For the purposes of the video, my aim was achieved within ~15 minutes. The LLM won't recall that far back (I imagine) so the most relevant thing was showing the request handling performance without holding back. This video doesn't have to be watched all the way through.
And right - the style is to have the video totally uncut & candid. It was just either I don't reshoot & release this or I release it as we see it today!
And thanks!
@@vladonutueu Right - I think @frinkfronk9198 's point is more about the technical difficulties over the video contents, but addressed above.
Talk to me when the personality comes through. Worst date ever.
The lack of personality comes from the LLM (ChatGPT 3.5 in this case), not Vapi's orchestration of services.
@@bephrem It would be really interesting to do a paid version of this using GPT-4 - I'm hearing the simplistic, "pr-speakishness of ChatGPT 3.5" with most of her flat, emotionless answers , the difference between 3.5 and 4 is glaring in this regard. The latency is super impressive though.
Worst podcast guest ever
for sure
@@bephrem still super interesting though 👌🏻
6:59 both you and AI fucked up.
what happened? heh
@@bephrem lady misunderstood your question, then you missed her point, conversation changed direction man xD
@@__J____ff ah I see
I think that was just hallucination, it made sense the keep the convo going although it could've been redirected.