This is really cool. Being able to create localised versions of your content would open you up to just so many more audiences. Imagine a time when RUclips allows you to upload a video with multiple audio tracks, each linked to a different localised voice track.
I can legit see this happening within the next 3-5 years, EASILY. The tech is already almost here so it's pretty exciting to see, especially since it wouldn't need to be real time
Mr.Beast already did have my language(Indonesia) audiotrack in each of their videos... which, I bet, cost a whole lot of money. BUT, with this, everyone would be able to do that without much worry about money... hahah what a time to live in. (pardon my grammar)
So I know German and while the first translation into German was kinda meh, it lacked expression, the Elen-Ring audio was actually not bad. For example, I really see it coming as a quick way to translate dubs of video games or movies. Thanks for the video, keep it up!
Yes bad translation and bad voice. I don't recommend it to spam french people or rather yes it should be used as the "uncanny Valley" would trigger suspicion:D
Hi! is there any way make it "real" realtime? I mean without having to upload a .wav file or record each time. Feed it an audio input all the time and make an continuous audio output with the translation, even if there is a 3-10s delay.
Thanks for the demo. How hard would be to make an app that translates "on the fly"? I support a local church and the service is in Spanish but we have many guests that requires live translations to English. It will be cool if we could have an app that "listens" to the sermon and then speaks automatically to English or any other language. Do you have any ideas?
Speaking another language actually makes you sound different. When I speak Spanish, I have a lower tone compared to when I speak English. Also, non-native speakers will still sound different despite being fluent in a different language. And that doesn’t even take into account dialects, speech impediment, tonal differences, pitch, etc
Hey Jarods, would be possible to record how you setup that whole thing (visual studio) to moment when we can open gradio - 2:26 ? Personally, that would help me do similar stuff independent and live example with M4T would be great to learn on real project
Thanks for posting.. I built this from a Mac M2 and it's horribly slow.. so good to know it can actually be faster than the input length if you have a legit GPU
Thanks for this video. Is this model available locally? Can you give a tutorial on how to install it on Windows (I have a 4090 GPU). Thanks. I see a near future (within months) where we could have real time speech to speech translations for live streaming where multiple audiences can enjoy the "Zeitgeist" experience. And this will make life easier for channels that want to broadcast in other foreign markets without going through the boring text to speech way. Looks awesome. Stability AI's SDXL Turbo, which I use a lot, open the floodgates for turbo-fying AI. We're going to see it in LLM and TTS.
It is all local, yup! Installation is more complicated than normal cause you have to deal with windows subsytem for Linux (wsl) and with that, it opens up a huge can of worms to deal with.
This is really interesting. Haven't been keeping up with this at all apart from lurking your videos. I've only known OpenAI and DeepL being pretty reliable for translations, and I guess Facebook's SeamlessM4T is another one on that list that has the benefit of being ran locally and no api costs. Are there any other known models for translations as well?
Thank you for the video. I am currently living in Osaka, Japan and I am very interested in Instant Translation with AI models. However, what I understand by "Instant Translation" is not: "I say a sentence - The model translates it after a few seconds and I can hear it - I say another sentence - The model translates it after a few seconds and I can hear it..." What I understand by Instant Translation is: "You are talking in Japanese and, while you are talking in Japanese (with a delay of a few senconds), I listen your speech in Spanish. No matter how long it is the speech. May be the Japanese speech is 10 minutes long and I can begin to listen to it after 5 seconds in Spanish and will end 5 seconds after finishing in Japanese". Basically it is like having a interpreteur by your side. Do you think is it possible to implement such a thing with this model?
Answering in "general": With text, yes, it is possible to do this as you can continually transcribe and translate the audio as it's being spoken. With text-to-speech, it's a bit harder and slower. You need to wait for the speaker to finish their sentence first before you translate it and have a tts engine speak it, therefore, you'd be at least a sentence behind. A short sentence: この人、怪しいと思います might translate to "I think this person is suspicious," but if you had this technology translate word for word, it'd probably be like "this person, suspicious I think"
espero pronto halla una herramienta como esta en tiempo real para poder hablar con naturalidad con personas de otros países y por fin romper la berrera del idioma que nos a tenido separados durante años
Ok, te creo. El primero no estaba super bien, pero el que estaba traducido desde el Japones sonó bien. Des-afortunadamente el modelo tiene un poco de problemas con los artículos y preposiciones. Try Swedish next time!!!!
When I started to watch it was at 360 too, sometimes at the first minutes just after the video it's uploaded not all the qualities are available. Check if you can change the quality in the settings
@@Jarods_Journey Notification Squad! Great video, just for reference the model is quite interesting, as a native spanish speaker the translation was decent a bit robotic but the pronunciation of Godzilla was quite weird. You have a great channel
This is really cool. Being able to create localised versions of your content would open you up to just so many more audiences. Imagine a time when RUclips allows you to upload a video with multiple audio tracks, each linked to a different localised voice track.
I can legit see this happening within the next 3-5 years, EASILY. The tech is already almost here so it's pretty exciting to see, especially since it wouldn't need to be real time
Mr.Beast already did have my language(Indonesia) audiotrack in each of their videos... which, I bet, cost a whole lot of money. BUT, with this, everyone would be able to do that without much worry about money... hahah what a time to live in.
(pardon my grammar)
Thanks for the video. Can u run the code in windows? How can I access the demo code?
So I know German and while the first translation into German was kinda meh, it lacked expression, the Elen-Ring audio was actually not bad. For example, I really see it coming as a quick way to translate dubs of video games or movies. Thanks for the video, keep it up!
Honestly I could see this too! Dubbing a movie or game in another language could become quite a bit easier as this tech becomes better
The french was completely broken, but overall impressive
Yes bad translation and bad voice. I don't recommend it to spam french people or rather yes it should be used as the "uncanny Valley" would trigger suspicion:D
Hi! is there any way make it "real" realtime? I mean without having to upload a .wav file or record each time. Feed it an audio input all the time and make an continuous audio output with the translation, even if there is a 3-10s delay.
Thanks for the demo. How hard would be to make an app that translates "on the fly"? I support a local church and the service is in Spanish but we have many guests that requires live translations to English. It will be cool if we could have an app that "listens" to the sermon and then speaks automatically to English or any other language. Do you have any ideas?
Speaking another language actually makes you sound different. When I speak Spanish, I have a lower tone compared to when I speak English. Also, non-native speakers will still sound different despite being fluent in a different language. And that doesn’t even take into account dialects, speech impediment, tonal differences, pitch, etc
Hey Jarods, would be possible to record how you setup that whole thing (visual studio) to moment when we can open gradio - 2:26 ? Personally, that would help me do similar stuff independent and live example with M4T would be great to learn on real project
That's really cool. Have you try the seamless-streaming one? I wonder how it real time
German speaker: Approve :) just not how you normally say it in everyday language
Thanks for posting.. I built this from a Mac M2 and it's horribly slow.. so good to know it can actually be faster than the input length if you have a legit GPU
Thanks for this video. Is this model available locally? Can you give a tutorial on how to install it on Windows (I have a 4090 GPU). Thanks. I see a near future (within months) where we could have real time speech to speech translations for live streaming where multiple audiences can enjoy the "Zeitgeist" experience. And this will make life easier for channels that want to broadcast in other foreign markets without going through the boring text to speech way. Looks awesome. Stability AI's SDXL Turbo, which I use a lot, open the floodgates for turbo-fying AI. We're going to see it in LLM and TTS.
It is all local, yup! Installation is more complicated than normal cause you have to deal with windows subsytem for Linux (wsl) and with that, it opens up a huge can of worms to deal with.
In a few years, we can have a portable real time translators. It will be widely use once the hardware limitation is gone.
This is really interesting. Haven't been keeping up with this at all apart from lurking your videos. I've only known OpenAI and DeepL being pretty reliable for translations, and I guess Facebook's SeamlessM4T is another one on that list that has the benefit of being ran locally and no api costs. Are there any other known models for translations as well?
Not sure, whisper and now seamless for local stuff, but seamless is still kinda in a research phase
Ought to get a Linux distro running with the 4090. Great videos/info here. Keep them coming~
Holding off until wsl2 no longer workers for me 😂
The french sounded real human like, except for 'M40'
french sounded pretty good
holy shitttt , great video too ty o7
If only RVC would add this as some sort of addon would be so handy, so we don't have to switch in between both of them!
Thank you for the video. I am currently living in Osaka, Japan and I am very interested in Instant Translation with AI models. However, what I understand by "Instant Translation" is not: "I say a sentence - The model translates it after a few seconds and I can hear it - I say another sentence - The model translates it after a few seconds and I can hear it..." What I understand by Instant Translation is: "You are talking in Japanese and, while you are talking in Japanese (with a delay of a few senconds), I listen your speech in Spanish. No matter how long it is the speech. May be the Japanese speech is 10 minutes long and I can begin to listen to it after 5 seconds in Spanish and will end 5 seconds after finishing in Japanese". Basically it is like having a interpreteur by your side.
Do you think is it possible to implement such a thing with this model?
Answering in "general": With text, yes, it is possible to do this as you can continually transcribe and translate the audio as it's being spoken. With text-to-speech, it's a bit harder and slower. You need to wait for the speaker to finish their sentence first before you translate it and have a tts engine speak it, therefore, you'd be at least a sentence behind. A short sentence: この人、怪しいと思います might translate to "I think this person is suspicious," but if you had this technology translate word for word, it'd probably be like "this person, suspicious I think"
@@Jarods_Journey there is a company called “Interprefy”. I think this is the idea of what I am trying to achieve.
espero pronto halla una herramienta como esta en tiempo real para poder hablar con naturalidad con personas de otros países
y por fin romper la berrera del idioma que nos a tenido separados durante años
4:51 could use use this with English to English with a trained model? And be better than So-Vits-SVC?
Ahhh, this I don't think it's there yet unfortunately. But I haven't tried it tbh
Jarod, how can i use this tool without the 10 seconds limitation? i want to use for longer files, is it possible?
i speak spanish and it sounds very good
Please create tutorial how to install locally
can these models work on Gtx 1650 4 g , or are they vram hungry?
It should fit, the models are less than 3gb large
Ok, te creo. El primero no estaba super bien, pero el que estaba traducido desde el Japones sonó bien. Des-afortunadamente el modelo tiene un poco de problemas con los artículos y preposiciones.
Try Swedish next time!!!!
As it goes, this is the worst the tech will be :), can't wait till they do add more voices to the expressive model
@@Jarods_Journey Yes, it is a great way to see it.
Only 9 second works
it is cool yeah, but the translation to spanish is not that good lol
hola
"Am I the only one who sees this video in 360p quality?"
I have 1080p 🎉
When I started to watch it was at 360 too, sometimes at the first minutes just after the video it's uploaded not all the qualities are available. Check if you can change the quality in the settings
From the new upload, too early for 1080! :)
@@Jarods_Journey Notification Squad!
Great video, just for reference the model is quite interesting, as a native spanish speaker the translation was decent a bit robotic but the pronunciation of Godzilla was quite weird. You have a great channel