thanks for the feature! super excited to keep building here. For the best experience w/ the API, I recommend using `stream=True` to get the first audio back super fast . Audio will come back in chunks. we'll add more info about how to use this to our docs
Has anyone done a demo of a single Cartesia voice outputting something like podcast length? 20 to 30 minutes? The human quality on short text is stunning but I worry that over longer text it will fall into repetitive cadence. The fact that voices are cloned on just a 20 second sample reinforces my concern. Have you tested that?
Yeah, we really need this stuff free and open source. The only real limiting factor is the affordability of the GPU(s) needed to run this locally. There's stuff out there, but local open source audio stuff is behind text and image based models sadly, but maybe soon that'll change.
@@14supersonic actually you can use stable diffusion locally with a mid range 12 gb commercial gpu for image generation, audio models as well, also quantized llm models are very good for simpler tasks like summarizations
@Zale370 I know, that's why I said audio based AI models are behind text and image based solutions. When you compare something like local Llama 3 or SD3 to local audio based AI models, there's no audio modality comparable to them yet in terms of local usage.
Indeed, there are no optimal and swift text-to-speech (TTS) solutions for local LLM inference. I personally believe this is not solely due to GPU memory constraints but also driven by security considerations.
Yeah, I'm a security consultant and the risks inherent in this are just insane. I won't ever say open source should slow down but I appreciate the time we are getting to communicate what's coming. Amazingly, the EU AI legislation classifies voice cloner AI as lower risk. I don't think they've ever got a phone call from their doctor asking them to stop a particular medication or their wife saying they're being held hostage and they're demanding all the money. It gets darker from there.
To get instant feedback you MUST use websocket not the http post. Also use a stream playback to instantly return new data coming down websockets into the playback, then you'll receive your 153ms. I can share the code w/ you, I just don't know how to do that here.
Hello. Thank you for your efforts and very good training. It is very difficult for us to prepare API and use it. Can you tell me what we should do in cases of free use? Thank you
If you wanted this to plug into a chatbot the pricing does not add up. I've done some crunching, it won't even get you far with a basic smallish customer doing say doing 1000-3000 chats a month which isn't a lot. Most engines price in at audio sequence every 15s or 1m. More good engines are emerging. For our low end customers, we usually see 3 to 5 concurrency anyway and that's like the smallest model. Currently we have done 100's of millions of chat, 100's of millions of live chat too. So getting into the billions. The market is competitive. Some of the new google studio voices are comparable, deep gram too. Sure these are nice voices but for streaming api, at cost and competitive, sorry but no! unless the pricing model radically improves. It's early days so hopefully there will be new models, new options and a realization. Suggest you take say 5000, 10000, 30000 and 100,000 chats and work out the text size average transcript on the bot side, and average out the characters. You will see my point!
the free plan is "10000" characters , while the lowest $5 per month gets you "100,000 characters per month". I re-read that again, its in "characters" and not "words" . am i dreaming? so one letter is one character, right? is that correct? isn't that super expensive.
Its cheap compared to the other paid voice service (elevenlabs), that gives only 30k characters for 5 dollars, for the same 100k characters, it cost over 20 dollars on elevenlabs, 4x more expensive, but yeah, compared to other AI services where you pay once and gets almost unlimited usage, like infermatic for text AI, its expensive.
Yup that's only characters. On average, 1000 characters is about 1 min of audio iirc, so the free tier is 10 min audio. For the same price ($5), the starter pack of Elevenlabs is only 30000 characters per month, so only half an hour.
Was working on a project whwre i need to use my local language but having issuse with coqui ai tts Library, aby other alternative that would be helpful, and easy to use thank you
Appreciate your efforts, but why the heck would you need an API call to get the ID of the voice you want to use or other seemingly static parameters? Also the API latency is terrible compared to their playground. Either you're doing something unnecessary still or their infrastructure is poor, which defeats the purpose of their supposedly low latency. Further the text to speech piece should be chunked into sentences and be streamed to the TTS service instead of waiting for the full response. This is OK for one or two sentence responses but if latency increases linearly then it's no good. Is there endpointing? interruption?
thanks for the feedback @avi7278. 1. you can get the voice_id straight from the playground! will have support very soon for passing that in directly 2. For the best experience w/ the API, recommend using `stream=True` to get the first audio back super fast 🚀. Audio will come back in chunks. we'll add more info about this to our docs 3. you can definitely send text chunks over the wire, will have more native support for text streaming soon
Hey Prompt Engineer, If you don't mind. Could i also be a contributor of your Project. I have some wonderful Features which could help you make your Verbi AI more better and a perfect voice assistant 🥹 Its a request to add me in the group. I would disappoint you 😼
Remind me of Elevenslab's early days. I think they use stream mode in their playground, measuring the time it takes to generate the first audio segment. That's why seems very fast. What do u think?
That's exactly how they are doing it. Their cofounder pointed it out and suggested to enable streaming via api as well. On the discord a contributor to project-verbi said its possible to get about 200-400ms with streaming. I might redo this again.
thanks for the feature! super excited to keep building here.
For the best experience w/ the API, I recommend using `stream=True` to get the first audio back super fast . Audio will come back in chunks. we'll add more info about how to use this to our docs
thanks for pointing it out. I do feel the docs need more work, I am going to explore it further. thanks for putting it together.
Has anyone done a demo of a single Cartesia voice outputting something like podcast length? 20 to 30 minutes? The human quality on short text is stunning but I worry that over longer text it will fall into repetitive cadence. The fact that voices are cloned on just a 20 second sample reinforces my concern.
Have you tested that?
Interesting point, I will do a test and report back. It will be a fun experiment.
The more models I use the less I want to pay for the apis
Yeah, we really need this stuff free and open source. The only real limiting factor is the affordability of the GPU(s) needed to run this locally. There's stuff out there, but local open source audio stuff is behind text and image based models sadly, but maybe soon that'll change.
@@14supersonic actually you can use stable diffusion locally with a mid range 12 gb commercial gpu for image generation, audio models as well, also quantized llm models are very good for simpler tasks like summarizations
@Zale370 I know, that's why I said audio based AI models are behind text and image based solutions. When you compare something like local Llama 3 or SD3 to local audio based AI models, there's no audio modality comparable to them yet in terms of local usage.
Indeed, there are no optimal and swift text-to-speech (TTS) solutions for local LLM inference. I personally believe this is not solely due to GPU memory constraints but also driven by security considerations.
Yeah, I'm a security consultant and the risks inherent in this are just insane. I won't ever say open source should slow down but I appreciate the time we are getting to communicate what's coming.
Amazingly, the EU AI legislation classifies voice cloner AI as lower risk. I don't think they've ever got a phone call from their doctor asking them to stop a particular medication or their wife saying they're being held hostage and they're demanding all the money. It gets darker from there.
I‘m always so impressed by models like this. But where are all the open source solutions according to this topic? Research is crazy!
To get instant feedback you MUST use websocket not the http post. Also use a stream playback to instantly return new data coming down websockets into the playback, then you'll receive your 153ms. I can share the code w/ you, I just don't know how to do that here.
thanks for pointing this out. Would love to look at the code. You can email me: engineerprompt at gmail or reach out on discord :)
What software do you use for those super smooth zooms?
It's called screen studio. It's only for mac
Hello. Thank you for your efforts and very good training. It is very difficult for us to prepare API and use it. Can you tell me what we should do in cases of free use? Thank you
If you wanted this to plug into a chatbot the pricing does not add up. I've done some crunching, it won't even get you far with a basic smallish customer doing say doing 1000-3000 chats a month which isn't a lot. Most engines price in at audio sequence every 15s or 1m. More good engines are emerging. For our low end customers, we usually see 3 to 5 concurrency anyway and that's like the smallest model. Currently we have done 100's of millions of chat, 100's of millions of live chat too. So getting into the billions. The market is competitive. Some of the new google studio voices are comparable, deep gram too. Sure these are nice voices but for streaming api, at cost and competitive, sorry but no! unless the pricing model radically improves. It's early days so hopefully there will be new models, new options and a realization. Suggest you take say 5000, 10000, 30000 and 100,000 chats and work out the text size average transcript on the bot side, and average out the characters. You will see my point!
that's a valid argument. Hopefully they will be able to reduce their price as they scale.
the free plan is "10000" characters , while the lowest $5 per month gets you "100,000 characters per month". I re-read that again, its in "characters" and not "words" . am i dreaming? so one letter is one character, right? is that correct? isn't that super expensive.
Its cheap compared to the other paid voice service (elevenlabs), that gives only 30k characters for 5 dollars, for the same 100k characters, it cost over 20 dollars on elevenlabs, 4x more expensive, but yeah, compared to other AI services where you pay once and gets almost unlimited usage, like infermatic for text AI, its expensive.
And with character they can count every space as a character.
Yup that's only characters. On average, 1000 characters is about 1 min of audio iirc, so the free tier is 10 min audio. For the same price ($5), the starter pack of Elevenlabs is only 30000 characters per month, so only half an hour.
@@BackTiVi I'll stay with Coqui.
Nobody would pay for services while we can do it on our own PC locally.
Was working on a project whwre i need to use my local language but having issuse with coqui ai tts Library, aby other alternative that would be helpful, and easy to use thank you
Try meloTTS
Thank you, I will try it
thanks
Appreciate your efforts, but why the heck would you need an API call to get the ID of the voice you want to use or other seemingly static parameters? Also the API latency is terrible compared to their playground. Either you're doing something unnecessary still or their infrastructure is poor, which defeats the purpose of their supposedly low latency. Further the text to speech piece should be chunked into sentences and be streamed to the TTS service instead of waiting for the full response. This is OK for one or two sentence responses but if latency increases linearly then it's no good. Is there endpointing? interruption?
thanks for the feedback @avi7278.
1. you can get the voice_id straight from the playground! will have support very soon for passing that in directly
2. For the best experience w/ the API, recommend using `stream=True` to get the first audio back super fast 🚀. Audio will come back in chunks. we'll add more info about this to our docs
3. you can definitely send text chunks over the wire, will have more native support for text streaming soon
is it faster than Deepgram?
Yes, on the playground. The Cartesia team recommends streaming. I am going to test that and report.
I'm all in. better price the eleven labs.
Liking without watching coz I know this is gonna be amazing
Oh boy, it's three times more expensive than Google's premium voices and only includes English. Skipped.
gosh I thought it was close to what gpt4o was capable of doing in terms of speed, big disappointment when you say 'API is slow'...
Seems like if you enable streaming its much faster. I will create a follow up video. This has potential
They do not sound good at all....
Hey Prompt Engineer,
If you don't mind. Could i also be a contributor of your Project. I have some wonderful Features which could help you make your Verbi AI more better and a perfect voice assistant 🥹
Its a request to add me in the group. I would disappoint you 😼
Yes, would love contributions. Please open a PR. We have a dedicated channel on the discord server. Feel free to join the discussion there.
Thanks but I'm Brazilian and didn't find portuguese in it
At the moment, its only English.
@@engineerprompt Hi, ok. Thank you for your attention and answer
meh, I thought it's a better local tts ...ohh well.
This... incredible... awesome, NICE WORK !!
No thanks for advertising.
Remind me of Elevenslab's early days. I think they use stream mode in their playground, measuring the time it takes to generate the first audio segment. That's why seems very fast. What do u think?
That's exactly how they are doing it. Their cofounder pointed it out and suggested to enable streaming via api as well. On the discord a contributor to project-verbi said its possible to get about 200-400ms with streaming. I might redo this again.
@@engineerprompt
Thank you so much and I can't wait for your next exciting video.
They still have natural cadence issues, which is a hard problem to solve.
Yes, I think this is just the alpha version so hopefully will get better over time.
Multi-language?
No
Coming soon 🚁
I'm interested in open source only... can't finish watching. Thumbs down, sorry.