Why Cartesia-AI's Voice Tech is a Game-Changer You Can't Ignore!

Поделиться
HTML-код
  • Опубликовано: 28 сен 2024

Комментарии • 60

  • @arjundesai2715
    @arjundesai2715 3 месяца назад +4

    thanks for the feature! super excited to keep building here.
    For the best experience w/ the API, I recommend using `stream=True` to get the first audio back super fast . Audio will come back in chunks. we'll add more info about how to use this to our docs

    • @engineerprompt
      @engineerprompt  3 месяца назад

      thanks for pointing it out. I do feel the docs need more work, I am going to explore it further. thanks for putting it together.

  • @keithprice3369
    @keithprice3369 3 месяца назад +5

    Has anyone done a demo of a single Cartesia voice outputting something like podcast length? 20 to 30 minutes? The human quality on short text is stunning but I worry that over longer text it will fall into repetitive cadence. The fact that voices are cloned on just a 20 second sample reinforces my concern.
    Have you tested that?

    • @engineerprompt
      @engineerprompt  3 месяца назад +1

      Interesting point, I will do a test and report back. It will be a fun experiment.

  • @Zale370
    @Zale370 3 месяца назад +24

    The more models I use the less I want to pay for the apis

    • @14supersonic
      @14supersonic 3 месяца назад +4

      Yeah, we really need this stuff free and open source. The only real limiting factor is the affordability of the GPU(s) needed to run this locally. There's stuff out there, but local open source audio stuff is behind text and image based models sadly, but maybe soon that'll change.

    • @Zale370
      @Zale370 3 месяца назад +1

      @@14supersonic actually you can use stable diffusion locally with a mid range 12 gb commercial gpu for image generation, audio models as well, also quantized llm models are very good for simpler tasks like summarizations

    • @14supersonic
      @14supersonic 3 месяца назад +2

      @Zale370 I know, that's why I said audio based AI models are behind text and image based solutions. When you compare something like local Llama 3 or SD3 to local audio based AI models, there's no audio modality comparable to them yet in terms of local usage.

    • @Alex29196
      @Alex29196 3 месяца назад

      Indeed, there are no optimal and swift text-to-speech (TTS) solutions for local LLM inference. I personally believe this is not solely due to GPU memory constraints but also driven by security considerations.

    • @ts757arse
      @ts757arse 3 месяца назад +2

      Yeah, I'm a security consultant and the risks inherent in this are just insane. I won't ever say open source should slow down but I appreciate the time we are getting to communicate what's coming.
      Amazingly, the EU AI legislation classifies voice cloner AI as lower risk. I don't think they've ever got a phone call from their doctor asking them to stop a particular medication or their wife saying they're being held hostage and they're demanding all the money. It gets darker from there.

  • @MeinDeutschkurs
    @MeinDeutschkurs 3 месяца назад +6

    I‘m always so impressed by models like this. But where are all the open source solutions according to this topic? Research is crazy!

  • @imdb6942
    @imdb6942 3 месяца назад +3

    To get instant feedback you MUST use websocket not the http post. Also use a stream playback to instantly return new data coming down websockets into the playback, then you'll receive your 153ms. I can share the code w/ you, I just don't know how to do that here.

    • @engineerprompt
      @engineerprompt  3 месяца назад +2

      thanks for pointing this out. Would love to look at the code. You can email me: engineerprompt at gmail or reach out on discord :)

  • @GAllium14
    @GAllium14 3 месяца назад +1

    What software do you use for those super smooth zooms?

    • @engineerprompt
      @engineerprompt  3 месяца назад

      It's called screen studio. It's only for mac

  • @mohsenghafari7652
    @mohsenghafari7652 3 месяца назад

    Hello. Thank you for your efforts and very good training. It is very difficult for us to prepare API and use it. Can you tell me what we should do in cases of free use? Thank you

  • @MrKarlyboy
    @MrKarlyboy 3 месяца назад +2

    If you wanted this to plug into a chatbot the pricing does not add up. I've done some crunching, it won't even get you far with a basic smallish customer doing say doing 1000-3000 chats a month which isn't a lot. Most engines price in at audio sequence every 15s or 1m. More good engines are emerging. For our low end customers, we usually see 3 to 5 concurrency anyway and that's like the smallest model. Currently we have done 100's of millions of chat, 100's of millions of live chat too. So getting into the billions. The market is competitive. Some of the new google studio voices are comparable, deep gram too. Sure these are nice voices but for streaming api, at cost and competitive, sorry but no! unless the pricing model radically improves. It's early days so hopefully there will be new models, new options and a realization. Suggest you take say 5000, 10000, 30000 and 100,000 chats and work out the text size average transcript on the bot side, and average out the characters. You will see my point!

    • @engineerprompt
      @engineerprompt  3 месяца назад +3

      that's a valid argument. Hopefully they will be able to reduce their price as they scale.

  • @gsagum
    @gsagum 3 месяца назад +5

    the free plan is "10000" characters , while the lowest $5 per month gets you "100,000 characters per month". I re-read that again, its in "characters" and not "words" . am i dreaming? so one letter is one character, right? is that correct? isn't that super expensive.

    • @3750gustavo
      @3750gustavo 3 месяца назад

      Its cheap compared to the other paid voice service (elevenlabs), that gives only 30k characters for 5 dollars, for the same 100k characters, it cost over 20 dollars on elevenlabs, 4x more expensive, but yeah, compared to other AI services where you pay once and gets almost unlimited usage, like infermatic for text AI, its expensive.

    • @simongus
      @simongus 3 месяца назад +1

      And with character they can count every space as a character.

    • @BackTiVi
      @BackTiVi 3 месяца назад +1

      Yup that's only characters. On average, 1000 characters is about 1 min of audio iirc, so the free tier is 10 min audio. For the same price ($5), the starter pack of Elevenlabs is only 30000 characters per month, so only half an hour.

    • @ronilevarez901
      @ronilevarez901 3 месяца назад

      @@BackTiVi I'll stay with Coqui.

  • @大支爺
    @大支爺 3 месяца назад +1

    Nobody would pay for services while we can do it on our own PC locally.

  • @Cedric_0
    @Cedric_0 3 месяца назад

    Was working on a project whwre i need to use my local language but having issuse with coqui ai tts Library, aby other alternative that would be helpful, and easy to use thank you

  • @mohsenghafari7652
    @mohsenghafari7652 3 месяца назад

    thanks

  • @avi7278
    @avi7278 3 месяца назад

    Appreciate your efforts, but why the heck would you need an API call to get the ID of the voice you want to use or other seemingly static parameters? Also the API latency is terrible compared to their playground. Either you're doing something unnecessary still or their infrastructure is poor, which defeats the purpose of their supposedly low latency. Further the text to speech piece should be chunked into sentences and be streamed to the TTS service instead of waiting for the full response. This is OK for one or two sentence responses but if latency increases linearly then it's no good. Is there endpointing? interruption?

    • @arjundesai2715
      @arjundesai2715 3 месяца назад +1

      thanks for the feedback @avi7278.
      1. you can get the voice_id straight from the playground! will have support very soon for passing that in directly
      2. For the best experience w/ the API, recommend using `stream=True` to get the first audio back super fast 🚀. Audio will come back in chunks. we'll add more info about this to our docs
      3. you can definitely send text chunks over the wire, will have more native support for text streaming soon

  • @Beetgrape
    @Beetgrape 3 месяца назад

    is it faster than Deepgram?

    • @engineerprompt
      @engineerprompt  3 месяца назад

      Yes, on the playground. The Cartesia team recommends streaming. I am going to test that and report.

  • @aifortune
    @aifortune 3 месяца назад

    I'm all in. better price the eleven labs.

  • @GameofLifeChannel
    @GameofLifeChannel 3 месяца назад

    Liking without watching coz I know this is gonna be amazing

  • @michalbiros6221
    @michalbiros6221 3 месяца назад +1

    Oh boy, it's three times more expensive than Google's premium voices and only includes English. Skipped.

  • @olivierv1993
    @olivierv1993 3 месяца назад

    gosh I thought it was close to what gpt4o was capable of doing in terms of speed, big disappointment when you say 'API is slow'...

    • @engineerprompt
      @engineerprompt  3 месяца назад +1

      Seems like if you enable streaming its much faster. I will create a follow up video. This has potential

  • @tx3851
    @tx3851 3 месяца назад +1

    They do not sound good at all....

  • @DevsDoCode
    @DevsDoCode 3 месяца назад +1

    Hey Prompt Engineer,
    If you don't mind. Could i also be a contributor of your Project. I have some wonderful Features which could help you make your Verbi AI more better and a perfect voice assistant 🥹
    Its a request to add me in the group. I would disappoint you 😼

    • @engineerprompt
      @engineerprompt  3 месяца назад

      Yes, would love contributions. Please open a PR. We have a dedicated channel on the discord server. Feel free to join the discussion there.

  • @canaldetestes4517
    @canaldetestes4517 3 месяца назад +3

    Thanks but I'm Brazilian and didn't find portuguese in it

    • @engineerprompt
      @engineerprompt  3 месяца назад +2

      At the moment, its only English.

    • @canaldetestes4517
      @canaldetestes4517 3 месяца назад +1

      @@engineerprompt Hi, ok. Thank you for your attention and answer

  • @drgutman
    @drgutman 3 месяца назад +1

    meh, I thought it's a better local tts ...ohh well.

  • @P-G-77
    @P-G-77 3 месяца назад +1

    This... incredible... awesome, NICE WORK !!

  • @大支爺
    @大支爺 3 месяца назад

    No thanks for advertising.

  • @unclecode
    @unclecode 3 месяца назад +1

    Remind me of Elevenslab's early days. I think they use stream mode in their playground, measuring the time it takes to generate the first audio segment. That's why seems very fast. What do u think?

    • @engineerprompt
      @engineerprompt  3 месяца назад

      That's exactly how they are doing it. Their cofounder pointed it out and suggested to enable streaming via api as well. On the discord a contributor to project-verbi said its possible to get about 200-400ms with streaming. I might redo this again.

    • @unclecode
      @unclecode 3 месяца назад

      @@engineerprompt

  • @adriantang5811
    @adriantang5811 3 месяца назад

    Thank you so much and I can't wait for your next exciting video.

  • @KCM25NJL
    @KCM25NJL 3 месяца назад

    They still have natural cadence issues, which is a hard problem to solve.

    • @engineerprompt
      @engineerprompt  3 месяца назад

      Yes, I think this is just the alpha version so hopefully will get better over time.

  • @tribuzeus
    @tribuzeus 3 месяца назад

    Multi-language?

  • @ScottzPlaylists
    @ScottzPlaylists 3 месяца назад

    I'm interested in open source only... can't finish watching. Thumbs down, sorry.