Do you think it is possible to program deep learning models for a GPU using assembly language for parallel calculations? Also, why aren't companies like Google or Meta programming all deep learning algorithms in assembly to gain more computational power? thank you, inspiring video
A while back I searched for NVidia assembly and I couldn't find any documentation (maybe I wasn't looking hard enough or maybe things have changed now). Anyway, assembly is linked to a specific device and it may change between family of devices. Thus at least for NVidia you are stuck with nvcc compiler. Another option would be to use the shaders (I remember seeing some projects about that) but it's still high level language. Of course, I can't speak about companies and their reasons for doing things. I think however it doesn't matter that much for a big company as they can use thousands of cpus/gpus/tpus.
@@ComputingMongoose Indeed, I had forgotten about that detail-the hardware/assembly dependency for coding, which is a major obstacle to device compatibility and could render all the code obsolete as soon as a new generation comes out. Thank you for your response
Sorry about that! Thanks for letting me know. It's actually the default setting in Midnight Commander and I'm quite used to it. I will try to change it to some other colors (maybe black background?). But there will be a couple more videos released with this blue background as I have filmed them already and they need only editing, thumbnails and youtube stuff.
As PyTorch states themselfs priority is usablity first, speed second and it is not only wraper on c++ library, so i suppose part of a job is made in python, i wonder also if it is possible to set some flag "silent" because if this progress is in python it also eats much of time (on such small model) even writing to file Can you compare with Candle ?
I think indeed this is due mostly to the python part. I was however expecting a bit more being done in c++. I am not familiar with Candle, but I looked at it and it does seem like a good thing to try. For the moment I'm focusing on adding more complexity to my network. I"ll release some more movies with new additions. Afterwards, more experiments will follow.
@@ComputingMongoose Other obvious shoot is TensorFlow but i have no idea how much work is needed to use it... 😅 I'm waiting for more assembly adventures.
If you're thinking about the final result (in terms of end loss, accuracy or other metric) then I agree. But from a time perspective, it doesn't really matter since the number of epochs is fixed, thus making the same number of operations. I do intend to extend the network though and add multiple layers and solve some more complex problems.
So, how to scale it to run GGUF models? I'm capable to run Llama3 405billions on used server 12 Ram slots motherboards(just $150), but speed is horrendous in current tools (less than 1 token/sec on 22 cores/44 threads Xeon Cpu with best q8 quality, altrough llama models are boring by censorship, it's biggest open model today).
I am working towards a network with multiple layers and with some data loading. But I don't think I will implement loading of pytorch or similar models since this is quite tedious in all assembly.
@@ComputingMongoose so, it can be used for training models? All current open models are based on llama-cpp which as i remember Stanford uni made from Facebook "leaked"(not really) model, kinda all architecture foundation on that.
@@fontenbleau I am not working towards llama architectures, but if I continue working on it, it will likely become able to train some more advanced models. But again I am not working specifically towards llama or other specific architectures. I am also not targeting specifically NLP applications (NNs can be used for image, voice, sensor data, etc. apart from NLP). However, it would make for an interesting application to be able to perform some text analysis with the assembly language network. I will have to think what is the easiest application to implement.
Another excellent video! I wish you success with your channel!
Thank you very much!
you have very interesting videos, subscribed
Thanks and welcome!
Do you think it is possible to program deep learning models for a GPU using assembly language for parallel calculations? Also, why aren't companies like Google or Meta programming all deep learning algorithms in assembly to gain more computational power? thank you, inspiring video
A while back I searched for NVidia assembly and I couldn't find any documentation (maybe I wasn't looking hard enough or maybe things have changed now). Anyway, assembly is linked to a specific device and it may change between family of devices. Thus at least for NVidia you are stuck with nvcc compiler. Another option would be to use the shaders (I remember seeing some projects about that) but it's still high level language. Of course, I can't speak about companies and their reasons for doing things. I think however it doesn't matter that much for a big company as they can use thousands of cpus/gpus/tpus.
@@ComputingMongoose Indeed, I had forgotten about that detail-the hardware/assembly dependency for coding, which is a major obstacle to device compatibility and could render all the code obsolete as soon as a new generation comes out. Thank you for your response
my bro great information! But that blue theme hurts lot peoples :'l
Sorry about that! Thanks for letting me know. It's actually the default setting in Midnight Commander and I'm quite used to it. I will try to change it to some other colors (maybe black background?). But there will be a couple more videos released with this blue background as I have filmed them already and they need only editing, thumbnails and youtube stuff.
Get used to it.
This is how computers were for most of history.
@@RichardLofty Indeed! And I still enjoy these basic interfaces, yet I do understand that for younger people it may seem weird.
Haha that’s the default for midnight commander
As PyTorch states themselfs priority is usablity first, speed second and it is not only wraper on c++ library, so i suppose part of a job is made in python, i wonder also if it is possible to set some flag "silent" because if this progress is in python it also eats much of time (on such small model) even writing to file
Can you compare with Candle ?
I think indeed this is due mostly to the python part. I was however expecting a bit more being done in c++. I am not familiar with Candle, but I looked at it and it does seem like a good thing to try. For the moment I'm focusing on adding more complexity to my network. I"ll release some more movies with new additions. Afterwards, more experiments will follow.
@@ComputingMongoose Other obvious shoot is TensorFlow but i have no idea how much work is needed to use it... 😅
I'm waiting for more assembly adventures.
@@AK-vx4dy There are indeed many frameworks out there... Anyway, more assembly stuff will be coming soon.
You MUST train a network on XOR. It's nonnlinear, and is a better benchmark.
If you're thinking about the final result (in terms of end loss, accuracy or other metric) then I agree. But from a time perspective, it doesn't really matter since the number of epochs is fixed, thus making the same number of operations. I do intend to extend the network though and add multiple layers and solve some more complex problems.
So, how to scale it to run GGUF models? I'm capable to run Llama3 405billions on used server 12 Ram slots motherboards(just $150), but speed is horrendous in current tools (less than 1 token/sec on 22 cores/44 threads Xeon Cpu with best q8 quality, altrough llama models are boring by censorship, it's biggest open model today).
I am working towards a network with multiple layers and with some data loading. But I don't think I will implement loading of pytorch or similar models since this is quite tedious in all assembly.
@@ComputingMongoose so, it can be used for training models? All current open models are based on llama-cpp which as i remember Stanford uni made from Facebook "leaked"(not really) model, kinda all architecture foundation on that.
@@fontenbleau I am not working towards llama architectures, but if I continue working on it, it will likely become able to train some more advanced models. But again I am not working specifically towards llama or other specific architectures. I am also not targeting specifically NLP applications (NNs can be used for image, voice, sensor data, etc. apart from NLP). However, it would make for an interesting application to be able to perform some text analysis with the assembly language network. I will have to think what is the easiest application to implement.