@@FiveBelowFiveUK Thank you for the work. i used to do open source development and the endless amount of thankless work and barbs just to help can be tiring. So I respect anyone who jumps in the fire to move the community forward.
until there is a VAE decoder all we can do is use ipadapter or Vision type nodes to get tokens from an image and then prompt them, this is only an estimation not true img2video - but i don't see why a future update would not add this in time, i think it's less than a week old. I'll be covering it for sure ! ~ and Thanks so much :)
@@FiveBelowFiveUK Great! The most relieving thing is that there's no technical reason that the model itself would not bend to that purpose. This model is already so good that it's only a question of time when the open source AI-based generation of video will be crazy epic! Most models seem to fail in motion, and video 2 video is a huge help with that. For example, I have been creating music videos with Runway mostly using Runway but with ComfyUI we will get more control instead of doing expensive lottery. In my opinion, what would be needed in order to have a genuinely usable video tool for music videos and movies instead of animated portraits and talking heads, these inputs would be needed: 1. Driving video for motion 2. Reference image for background 3. Reference image for character 4. Text prompt for controlling background action, camera movement and such Keep on keeping on!
For maximum quality, the latent sideloading workflows in V6 are almost perfect - certainly peak as far as this model is concerned, you can also double the steps. Some people use 200 steps but i found 50-100 was enough. 49 frames was enough for my use case, but i know some want longer clips. V6 Fast is a good example of shorter videos. However - the reason that is required so much power to decode the latent files (runpod) was i did not use VAE tiling, this is insane because that was how this was even able to run on local in the first place - i wanted to show that the quality always suffered from ghosting no matter what tiling setting you used. Its the seams showing you see.
What do you mean? it's just me :) I used to use after effects and rotoscoping, but now it's all in a comfyui workflow, if you see the first video on the channel the secret is hiding in plain sight. Depth + Openpose, and a wireless mouse :)
because it's so technical i have not even covered it yet, but the short story is you have to have the files in the comfyui folder (where the startup.bat files for comfyui are) and you do "pip install filename.whl" but this can break things, so again, i hope to return to this in a future update.
Nice tutorial mate :) I got it running on runpot but the results had bigggggg ghosting not usable also the i2v did not look at all at my image. also the mochi model is not stored on the network storage so each time starting it loads it again :( . also when i just use the decode side the tile vae is not selectent then i get an OOM with an A40 :/ cheers janosch
Thanks for the feedback - it'#s really helpful for everyone to see what results you have! I must admit that the nodes for the mochi loader are pulling in the models every time, but i think i can improve that by updating my Runpod Provisioning script ! Expect an update on that front soon ;) Regarding the Decoder OOM's with no Tiling - I think this was a 100GB VRAM model (!) so we are still squeezing it in even on aa 48gb. I wanted to offer the "full fat" option for those people that used Runpod as a primary platform. I only decode 2 second clips without the VAE tiling, Video VAE decoding takes an insane amount of VRAM due to all the frames.
@@FiveBelowFiveUK no result yet mate hahah i think my runpod had a headache. also i was not aware that there is an i2v for mochi? perhabs thats why its not working as intended? but keep up the great work :)
I think, people who recommend cloud/subscription services that in any way costs money, does not understand why most people are interested in generating locally. The whole idea of generating locally is that it's free, no additional payments required than the one they've already made for the PC.
I agree to some degree but the cloud services like minimax also generate much faster than what any highend PC can. Putting 5 in a queue and generating 2 at a time, sometimes within a few minutes simply isnt happening locally. I choose to use both cloud and local for now.
sure but using a 48gig vram card on runpod is cheaper than my electric bill, so if its a matter of money some people might wanna take that into account.
@@quercus3290 I understand a lot of the reasoning and calculations behind it, but I always struggle when I ask myself 'why'. Some people probably makes models, videos and images as a hobby, i make images for fun myself, it literally replaced gaming for me. Though I can't help but notice that there's also a lot of people that starts with generative AI thinking they're gonna make some cash, or fame, or both- Sinking money into it with expensive time constrained services.
AFAIK there is no VAE encoder, so all you can do is "Vision to Image" which will approximate and image using a complex description/captions. However this is only an update away if the author decided to add this, i was even tempted to try writing one myself. It's still early days so i decided to cover a few other things before coming back to it - hope that helps!
depending on the setup - I showed Q8_0 quantized with CLIP FP16 on CPU, so that would be ~20GB, running on a 4090/3090 However, there are many quantized setups and people in community are running under 16GB, i cannot confirm, but possible to squeeze down to 12 if you offload CLIP to CPU, although that required over 32GB system ram, so many optimization options these days it's hard to test them all. for the safest bet and with the best quality on Local, 20GB is where i stake the signpost on this one.
github.com/chflame163/ComfyUI_LayerStyle -- this should be what you need, i think its in the comfyui manager - it is unloading the VRAM to help fit the models into VRAM on your GPU, i place them there to help with the crazy load these video models use. Full details in the links in description! feel free to ask more questions if you have them :)
Yes, there are people in my discord that are using 3090 with this model, you would use the V6 Fast settings or V5 (Q4/Q8 + T5 FP8 Scaled). I will be making new versions to support lower VRAM this week, i had some other things to cover first :) I explain in articles all the different setups if you can't wait :)
I don't mean to complain because you're providing free info, but that filter on the audio is very distracting. would prefer no frills and just hear you clearly as you are. filters might still be cool but the one you are using now just changes the EQ too much like the low end drops out completely it sounds like the speaker cable is halfway unplugged
valid criticism is valid - i think it was an adobe audio preset going funky after changing to a new mic, combined with the new mic also having crazy cut off. Hoping it's solved in newer videos - thanks for the feedback - it lets me know to fix it !
You deserve more subs. Well done and I dig the unique presentation.
Big thanks :)
@@FiveBelowFiveUK Thank you for the work. i used to do open source development and the endless amount of thankless work and barbs just to help can be tiring. So I respect anyone who jumps in the fire to move the community forward.
excellent research and improvement. the future of Ai video is bright, running locally!
Yes - its a crazy advancement in my opinion ! Agreed :)
amazing study on this amazing video model! thank you brother
pleasure is all mine - thanks for watching !
I like 🎉 the passion and precision of useful data
Good looking Brother 👌
🤝
Great work mate! 👏👏
Thank you! Cheers!
Great stuff 👍
Big Thanks !!
A wonderful deep dive which is very rare to see! Is video2video possible with Mochi?
until there is a VAE decoder all we can do is use ipadapter or Vision type nodes to get tokens from an image and then prompt them, this is only an estimation not true img2video - but i don't see why a future update would not add this in time, i think it's less than a week old. I'll be covering it for sure ! ~
and Thanks so much :)
@@FiveBelowFiveUK Great! The most relieving thing is that there's no technical reason that the model itself would not bend to that purpose. This model is already so good that it's only a question of time when the open source AI-based generation of video will be crazy epic!
Most models seem to fail in motion, and video 2 video is a huge help with that. For example, I have been creating music videos with Runway mostly using Runway but with ComfyUI we will get more control instead of doing expensive lottery.
In my opinion, what would be needed in order to have a genuinely usable video tool for music videos and movies instead of animated portraits and talking heads, these inputs would be needed:
1. Driving video for motion
2. Reference image for background
3. Reference image for character
4. Text prompt for controlling background action, camera movement and such
Keep on keeping on!
Thanks! Good work. Which workflow gives the best result after your testing?
For maximum quality, the latent sideloading workflows in V6 are almost perfect - certainly peak as far as this model is concerned, you can also double the steps. Some people use 200 steps but i found 50-100 was enough. 49 frames was enough for my use case, but i know some want longer clips. V6 Fast is a good example of shorter videos. However - the reason that is required so much power to decode the latent files (runpod) was i did not use VAE tiling, this is insane because that was how this was even able to run on local in the first place - i wanted to show that the quality always suffered from ghosting no matter what tiling setting you used. Its the seams showing you see.
amazing. btw, how do you create your video avatar? that's awesome
What do you mean? it's just me :)
I used to use after effects and rotoscoping, but now it's all in a comfyui workflow, if you see the first video on the channel the secret is hiding in plain sight. Depth + Openpose, and a wireless mouse :)
34:32 where to put those flash attention files^?
because it's so technical i have not even covered it yet, but the short story is you have to have the files in the comfyui folder (where the startup.bat files for comfyui are) and you do "pip install filename.whl" but this can break things, so again, i hope to return to this in a future update.
Nice tutorial mate :) I got it running on runpot but the results had bigggggg ghosting not usable also the i2v did not look at all at my image. also the mochi model is not stored on the network storage so each time starting it loads it again :( . also when i just use the decode side the tile vae is not selectent then i get an OOM with an A40 :/ cheers janosch
Thanks for the feedback - it'#s really helpful for everyone to see what results you have!
I must admit that the nodes for the mochi loader are pulling in the models every time, but i think i can improve that by updating my Runpod Provisioning script ! Expect an update on that front soon ;)
Regarding the Decoder OOM's with no Tiling - I think this was a 100GB VRAM model (!) so we are still squeezing it in even on aa 48gb. I wanted to offer the "full fat" option for those people that used Runpod as a primary platform.
I only decode 2 second clips without the VAE tiling, Video VAE decoding takes an insane amount of VRAM due to all the frames.
@@FiveBelowFiveUK no result yet mate hahah i think my runpod had a headache. also i was not aware that there is an i2v for mochi? perhabs thats why its not working as intended? but keep up the great work :)
I think, people who recommend cloud/subscription services that in any way costs money, does not understand why most people are interested in generating locally.
The whole idea of generating locally is that it's free, no additional payments required than the one they've already made for the PC.
I agree to some degree but the cloud services like minimax also generate much faster than what any highend PC can. Putting 5 in a queue and generating 2 at a time, sometimes within a few minutes simply isnt happening locally. I choose to use both cloud and local for now.
It's not only about money. It is about as much privacy and safety of your data as possible (assuming it is even remotely possible with AI technology).
sure but using a 48gig vram card on runpod is cheaper than my electric bill, so if its a matter of money some people might wanna take that into account.
@@quercus3290 I understand a lot of the reasoning and calculations behind it, but I always struggle when I ask myself 'why'.
Some people probably makes models, videos and images as a hobby, i make images for fun myself, it literally replaced gaming for me. Though I can't help but notice that there's also a lot of people that starts with generative AI thinking they're gonna make some cash, or fame, or both- Sinking money into it with expensive time constrained services.
@@TheGalacticIndian What's privacy anyway?
I've not watched yet but can you do img2video with this?
AFAIK there is no VAE encoder, so all you can do is "Vision to Image" which will approximate and image using a complex description/captions. However this is only an update away if the author decided to add this, i was even tempted to try writing one myself. It's still early days so i decided to cover a few other things before coming back to it - hope that helps!
My favorite stick figure is back!
This guy is brilliant
Big Thanks everyone :)
how many vrams needs this tool?
depending on the setup - I showed Q8_0 quantized with CLIP FP16 on CPU, so that would be ~20GB, running on a 4090/3090
However, there are many quantized setups and people in community are running under 16GB, i cannot confirm, but possible to squeeze down to 12 if you offload CLIP to CPU, although that required over 32GB system ram, so many optimization options these days it's hard to test them all.
for the safest bet and with the best quality on Local, 20GB is where i stake the signpost on this one.
Setting up cuda toolkit and vision studio is pain in the ass
agreed - that is why i have not even covered adding flash attention, to be honest it's the difference between 20 minutes and 10 minutes, i can wait :)
@@FiveBelowFiveUK is there any alternatives that you would recommend?
All nodes in your workflows are giving float errors, how can I solve this?
sounds like you need to update the pytorch to at least 2.5.0, use the .bat in the update folder (comfyui) update with dependencies.
@@FiveBelowFiveUK ok, thanks I'll update. Do you have any optimized workflow to run on an rtx 3060 12 GB vRAM?
Does anybody know what this means LayerUtility: PurgeVRAM I can't use install missing nodes on this, restarted and searched.
github.com/chflame163/ComfyUI_LayerStyle -- this should be what you need, i think its in the comfyui manager - it is unloading the VRAM to help fit the models into VRAM on your GPU, i place them there to help with the crazy load these video models use. Full details in the links in description!
feel free to ask more questions if you have them :)
Can i do it on my 3090?
Yes, there are people in my discord that are using 3090 with this model, you would use the V6 Fast settings or V5 (Q4/Q8 + T5 FP8 Scaled). I will be making new versions to support lower VRAM this week, i had some other things to cover first :) I explain in articles all the different setups if you can't wait :)
@FiveBelowFiveUK Thanks broski
Change the MIKE!
Watts the problem
yeah i keep on looking away from the mic haha it's a new mic already see, i'm still getting used to it being on the desk and not on a boom arm.
hahaha :)
this works with amd gpu?
I would not like to guess as i do not have an AMD GPU, all i can say is try it and see?
If it doesn't, let us know, because it will help others :)
🙂🙂
I don't mean to complain because you're providing free info, but that filter on the audio is very distracting. would prefer no frills and just hear you clearly as you are. filters might still be cool but the one you are using now just changes the EQ too much like the low end drops out completely it sounds like the speaker cable is halfway unplugged
valid criticism is valid - i think it was an adobe audio preset going funky after changing to a new mic, combined with the new mic also having crazy cut off. Hoping it's solved in newer videos - thanks for the feedback - it lets me know to fix it !