NVIDIA Jetson Orin Nano SUPER Unleashed: Build an AI Super Cluster

Поделиться
HTML-код
  • Опубликовано: 24 янв 2025

Комментарии • 117

  • @rawhideslide
    @rawhideslide 14 дней назад +17

    Home Assistant here I come!
    I have been hesitating on replacing cloud based home automation based on alexa virtual assistant due to concerns about local performance for both whisper STT and a trivial LLM. A 2B parameter model looks like it is much more intelligent and should be able to pass the "wive test". Wish me luck, I anticipate that it will take me a year to return from this journey, or should I say rabbit hole!

    • @vasiovasio
      @vasiovasio 14 дней назад +2

      Good luck!
      I will put the video URL in the Google Calendar notification to Year from now to ask you what happens.
      This affordable Nvidia hardware is interesting to me, too, but it has zero experience running LLP locally.

    • @AndreiFinskiGPlus
      @AndreiFinskiGPlus 14 дней назад +1

      good luck, my main problem so far is abscense of streaming api support for Home Assistant integration. so you have to syncroniously wait for http response on chat message which still makes it really slow even with 21tps for 3b llama 3.2

    • @amihartz
      @amihartz 7 дней назад

      @@AndreiFinskiGPlus my biggest problem is that they all sell for like freaking $500 because nvidia makes like 12 of them for the whole world market

  • @chipcode5538
    @chipcode5538 15 дней назад +2

    Thanks Gary, this is nice solution to run larger models on lower cost hardware.

  • @kryptonic010
    @kryptonic010 15 дней назад +2

    As always, good stuff Gary!

  • @rickytomatoes
    @rickytomatoes 12 дней назад +1

    This was an amazing demo and tutorial - thanks for posting!

  • @nathanbanks2354
    @nathanbanks2354 10 дней назад

    This is really neat. I never thought of making a cluster, and it seems like your cluster really is running almost twice as fast than the single node. I tried running all these models using ollama in arch linux on my laptop with an old P5000 GPU with 16GB of RAM. I ended up getting 40 tokens per second on the first two models, but only 5.6 tokens per second on gemma2-9b-q8 (which I created based on the gguf file). This means I was running around 2x as fast whenever you used just one Orin Nano, but around the same speed when you had 2 Orin Nanos calculating. It's good to see that ollama is smart enough to saturate both machines in the cluster. (Unless my GPU is just bad at 8-bit quantization or the Jetson is bad at 4-bit quantization.)
    I don't know how ollama does this. With a simple implementation, with the first layers on machine A and the last layers on machine B, it wouldn't be possible for both machines to process the info at the same time for the same query. Maybe it manages to split each layer in half without increasing network traffic too much.
    Perhaps you could run nvidia-smi or a similar tool on both workers to see if the processor is at 100% as it's generating. When I tried renting a machine with 4x 4090 GPU's, I couldn't get them all up to 100% usage running Llama 70b or Mixtral 8x22b, but I suppose the Mixtral is designed to avoid saturating all the GPU's.

  • @themojoman
    @themojoman 15 дней назад +29

    I wonder how cluster performance of the two Jetson Nanos (2 x 249$) compares to the performance of the base M4 Mac Mini for 599$ when running the same models.

    • @andikunar7183
      @andikunar7183 15 дней назад +4

      Like Gary said, it's much slower because of the network. In my view, one Jetson Orin Nano Super has a stand-alone token-generation performance comparable to a 8GB M2 Mac Mini. And you have the additional cost of an M.2 drive and case. BUT it has CUDA and can run the NVIDIA software stack, even paired with Apple silicon Macs (like their Project DIGITS will). It's currently not available yet in Europe as Super with the reduced price. And contrary to Gary, I'm not totally sure if its really the same hardware or just specially tested/selected modules from their production (I heard about some people having issues with the faster speed). So I'm waiting for mine.

    • @flashxcate
      @flashxcate 15 дней назад

      @@andikunar7183it is available in europe through official resellers. I bought mine through RS components yesterday and it should arrive by the end of the month

    • @OrientalStories
      @OrientalStories 15 дней назад +5

      nvidia is even more scummy than apple

    • @GaryExplains
      @GaryExplains  15 дней назад +7

      @andikunar7183 It is exactly the same hardware. Nvidia confirmed this to me and my old board is running perfectly at the new speeds.

    • @andikunar7183
      @andikunar7183 14 дней назад

      @@GaryExplains thanks a lot for the clarification.

  • @RaysNewLife
    @RaysNewLife 13 дней назад +4

    I was considering going this route until the project digits was announced

  • @artemplatinoff1185
    @artemplatinoff1185 15 дней назад

    I was waiting for this video since orin came, thank you.

    • @artemplatinoff1185
      @artemplatinoff1185 15 дней назад

      specs says type C data only but can it work as usb slave? if yes you could put type c cable srv1 to srv 2 and bridge them by using g_ether protocol.

  • @Waveshare_Ruan
    @Waveshare_Ruan 14 дней назад

    Cool!Gary got it so quickly-I'm truly jealous! 😄

  • @andikunar7183
    @andikunar7183 15 дней назад +2

    Wow, another perfect video, thanks!!!!

  • @forthepeople1664
    @forthepeople1664 12 дней назад

    Good video, nice setup, then technical dive.

  • @rtos
    @rtos 15 дней назад +3

    When an AI model is running is it possible to run other linux applications in the background? This is usually not possible for a model running off the CPU which gets loaded close to 100%. However in this board the main data crunching should be handled in the GPU, leaving the CPU for other tasks?

    • @georgehooper429
      @georgehooper429 15 дней назад +3

      I purchased one of these Orin nanos when they were first released at the cheaper price. I’ve been running llama3.2:3b and when running ollama (during a query) it consumes 100% of the GPU, but almost no CPU is being used. As the query starts and stops you see a small blip of one of the cores, but not much else. On my nano I’m running ollama, whisper and piper to use the GPU for both whisper and piper. So depending on your additional services, as long as it fits in the shared 8GB of memory it should support more stuff.

    • @rtos
      @rtos 15 дней назад

      @georgehooper429 Thanks for the clarification!

  • @BrianThomas
    @BrianThomas 13 дней назад +2

    For running LLaMA and similar LLMs, a small cluster of up to 4 Orin Nanos may offer a balance between performance and cost. Beyond this, the ROI decreases, and investing in a more powerful single unit like the Jetson Xavier NX or higher-end GPUs would be more cost-effective and efficient.

  • @TheGrizz485
    @TheGrizz485 15 дней назад +5

    These are basically binned-down Switch 2 APUs. Sell them as devkits instead of throwing them away.

    • @GaryExplains
      @GaryExplains  15 дней назад +2

      But these have been on sale for over a year and the Switch 2 hasn't been announced yet, but you claim Nvidia is selling binned version of the processer for over a year. Really?

    • @GaryExplains
      @GaryExplains  15 дней назад +4

      Also you seem to misunderstand the role of the Orin modules.

    • @TheGrizz485
      @TheGrizz485 15 дней назад

      @@GaryExplains Yes. I think the T239 SoC has been around for 2 years. the full non-disabled chip (on Nintendo switch 2) has 1,536 CUDA cores

    • @kettusnuhveli
      @kettusnuhveli 15 дней назад +2

      @@GaryExplains Gary I know it sounds crazy but he is not far off the mark. The original switch uses Tegra 1 chip (GM20B), the exact same one that was used in Jetson TX1 and by the time the original switch came out that chip was already few years old. Hardware development takes time so youre never gonna see brand spanking new components in devices like that.

    • @GaryExplains
      @GaryExplains  15 дней назад +3

      @kettusnuhveli @TheGrizz485 The problem with your reasoning is that you think the reason the processor exits is because of the Switch 2 and then NVIDIA is looking for other uses of the chips. That is nonsense. It is the other way around. NVIDIA has the processors for its own use (like the Orin modules etc) and then if it can get a contract on top for something like the Switch 2 then great. The logistics of making those chips is huge and expensive. Obviously NVIDIA will make that as efficient as possible. But to claim they are just binned chips from the Switch 2 is simply not true.

  • @vasiovasio
    @vasiovasio 14 дней назад +2

    Great video, Gary!
    After five months or so, when is available. please make a Deep review of Project Digits and compare the results with this tiny machine! ☺️☺️☺️

    • @px1690
      @px1690 14 дней назад

      one simple answer.... RAM having 128GB available for running models loras conditioners etc.. all in one pipeline without the long load times between the various steps is why digits exists.. you can't get that with these Jetson thingies without serious hardware tampering.

    • @vasiovasio
      @vasiovasio 13 дней назад

      @@px1690 Thank you for your answer!
      I researched the topic in the last few weeks - locally, something like, let's say, Llama-3.3-70B requires 100GB RAM to run.
      Jetson Orin seems to like to be Really the Test Kit from NVIDIA in MEGA Affortible format for all of us who want to try, taste, and experiment with how things generally work, but Digits or even two of them stacked up will be the really useful Working hardware with enough Power.
      And a Polite reminder - I asked Llama-3.3-70B yesterday for the size of the model training data, and it says it is 1.5TB... this is like a Big Encyclopedia and has some data for almost every aspect of Life, but is so, so small dataset.
      The next big bottleneck will be the absence of Quality, already Validated training data.
      One is for sure - things Move at a super Fast Speed!

  • @azmyin
    @azmyin 15 дней назад +1

    Hi Mr. Gary so is it possible to “upgrade” an older orin nano to “super” judt with a software update?

  • @johnhoffmann1565
    @johnhoffmann1565 11 дней назад

    great work!thank you

  • @paulturner5769
    @paulturner5769 15 дней назад +3

    Every time I come to a Gary Explains video I have to turn up the volume substantially.
    Please Gary, normalise your volume level!

    • @GaryExplains
      @GaryExplains  15 дней назад

      I do.

    • @xxportalxx.
      @xxportalxx. 14 дней назад

      Yeah I just had the same experience with this vid, went from the last vid sounding like shouting, to this one sounding like a whisper.

    • @paulturner5769
      @paulturner5769 13 дней назад

      @@GaryExplains I have no doubt of your technical competence and I assure you this is not only a problem with your channel.
      I expect the problem occurs post production in the upload phase. Do you log on to the actual internet and use a straight PC to check volume levels are as you intended?
      It is clearly a big problem from the number of videos addressing it - ruclips.net/user/results?search_query=why+are+some+youtube+channels+so+quiet
      (I Love your content and presentation by the way, so in the mean time I will just have to keep twiddling the volume on my desktop.)

  • @ronp5615
    @ronp5615 14 дней назад

    Possible to upgrade the VRAM on each without a CNC or other complicated process?

  • @savasirez
    @savasirez 14 дней назад

    Great video Gary, thanks. Two questions, do you think using a type-c to 10gE ethernet adapter to form a cluster would help with the AI performance and do you think we can install different Linux distros (e.g RHEL AI) as the O/S to make use of different AI models?

    • @GaryExplains
      @GaryExplains  14 дней назад +1

      I am not sure how much it would improve, but I think there would be some improvement for sure. Running the Llama 3.2 model (3GB) over two nodes gives 11 tokens per second, so that is down from 21 when on a single node, so I guess the Ethernet is part of the bottleneck. Obviously the 10gE adapter would need to be Linux compatible and the driver working on Arm (not x86). As for other OSes, I think that is harder. What other AI models does RHEL AI offer that llama.cpp or ollama don't?

  • @spiceMonkey007
    @spiceMonkey007 9 дней назад

    Great work, Gary! Are these devices connected through ethernet cables or wifi?

  • @nhtna4706
    @nhtna4706 14 дней назад

    What do u think of the new digits product ?? U should do a review on that n how it is diff from this n the kind of capabilities that one can run, from both personal level or small company level

    • @GaryExplains
      @GaryExplains  14 дней назад +2

      The video before this one was about DIGITS.

  • @brianpearson2532
    @brianpearson2532 15 дней назад +2

    Did NVIDIA just put a bigger fan on the heat sink to allow for the higher clock speed on the “new” boards?

    • @GaryExplains
      @GaryExplains  15 дней назад +3

      I don't think so, the heatsink and the fan look the same size to me.

  • @Johnassu
    @Johnassu 14 дней назад

    So, what I'm really curious about is if the die size actually got smaller.
    The old Orin Nano used a half-disabled, underclocked version of the huge 448 mm² GA10B from the $1600 Orin AGX, basically die harvesting.
    That expensive original chip made the Nano cost $500.
    But now it's super cheap at $249. My theory is that Nvidia made a new, smaller chipset instead of the GA10B, probably for the Switch 2. They’re likely using the lower-yield chips from mass producing that in the Orin Nano Super.
    The fact that only the Nano is cheaper kinda proves it. So, I really wanna know the die size of the Orin Nano Super under that cooler.

  • @gsestream
    @gsestream 2 дня назад

    try micro-fgpa gate logic network instead. its fixed logic but fpga. for any workload. its optimized for the workload.

  • @AhmadQ.81
    @AhmadQ.81 15 дней назад +3

    I think the AMD AI Max with 128 GB ram would be a better solution for running LLM and I expect to see it inside small nuc pcs soon similar in size to Mac mini 😅🎉

  • @warrensanders5744
    @warrensanders5744 13 дней назад

    I'm beginning to feel that these LLMs are "topic specific?" How can I explore this concept?

    • @GaryExplains
      @GaryExplains  13 дней назад

      What do you mean exactly and how would you like to explore the idea?

  • @angelmesa9530
    @angelmesa9530 10 дней назад

    Thanks for Spanish track option.

  • @MichaelSkinner-e9j
    @MichaelSkinner-e9j 15 дней назад +1

    How does the CPU compare with AMD, Intel, Apple, and Qualcomm?
    I'm curious how this and the Full version compares to say an 8600G/8700G or something similiar (performance metrics)
    How does this compare to AMD's Zen 4 C cores? And isn't the price similiar?

    • @andikunar7183
      @andikunar7183 15 дней назад

      You cannot directly pair a NVIDIA GPU with Apple or Snapdragon X, so it's not this interesting. CUDA is a big advantage over Apple's MPS/MLX. As for Gary's shown token-generation speed, its similar to a M2 - but TG is mostly determined by memory-bandwidth (which is nearly identical to M2 and a bit slower than a base M4 or Snapdragon X). It's not available yet as the cheaper "Super" in Europe, so I'm waiting for mine. It will be interesting to see, how it performs with fp4/fp8 quants or in containers (which Apple's MPS doesn't support).

    • @MichaelSkinner-e9j
      @MichaelSkinner-e9j 15 дней назад

      @@andikunar7183 What I mean is for the amount of compute/AI you get. The reason I compare it TO them is Cost: This is $249, which does compare Nvidia's 8600G, or even Apple's cheaper offerings.

    • @andikunar7183
      @andikunar7183 15 дней назад

      @@MichaelSkinner-e9j compute/AI comparisons are much more complex than single numbers / TOPS, so I can't really help. It depends on you use-case. LLM token-generation is largely determined by RAM-bandwidth - e.g. a Snapdragon X Elite is as fast (using only its CPUs and Q4_0 quantization) as a M2 Mac and this Kit when running on all it's GPU. But prompt-processing on the Snapdragon X is slower because llama.cpp/ollama,... don't yet support the Snapdragon X's GPUs. I still think CUDA is the decisive element for this board. Its 8GB total CPU+GPU RAM is tiny for LLMs.

    • @MichaelSkinner-e9j
      @MichaelSkinner-e9j 15 дней назад

      @ i’m talking about the overall compute you get for the money. Look at the price for each one, and their performance relative for all metrics. Not just that.
      Have you ever heard of Phoronix? A long time ago, he used to have a blog that compared X 86 and RCPU’s on various operating systems and various metrics. Both through the CPU, memory bandwidth, GPU, compression, and so on. Adding AI is just another metric.
      Technically, all CPU systems can do AI. Some do it better than others. That’s just another metric on their test suite.
      The performance gulf metric between X86 and ARM has gradually shrunk. Apple’s M4 is a perfect example, besides AMD’s own refresh when they added an NPU.
      Usually, the real difference between them all is dedicated hardware for whatever task. They all do the same thing though.
      All of them can run LLM’s. Just some can do it better than others.
      Compute is compute. The difference is adding specialized hardware for specialized tasks, like Intel did eons ago for QuickSync

    • @MichaelSkinner-e9j
      @MichaelSkinner-e9j 15 дней назад

      ⁠@@andikunar7183 I disagree as far as comparing chips for AI.
      Any developer can develop AI for any hardware stack. It’s just that tensor cores or NPU’s are specially designed for that, like people had special hardware for decoding.
      All of these chips have CPU (Serial Processors), RAM to load/store, GPU’s for parallel processing, special decode hardware, and now they are starting to use special hardware tailored for AI tasks like smoothing in video/photo editing, and making it easier for diction in language scripts/models.
      They all have baseline similiar stuff, it’s just each manufacturer goes about it differently in how they lay it out on the chip level and how many registers/how wide they have down theCPU line, besides their secret sauce.
      You can compare them all. You can run all of them through the same suite of tests. That’s what I was asking.

  • @RPhaF
    @RPhaF 15 дней назад

    Why not using Exo ? It's specially made to distribute inference over a cluster of small machines : laptops, raspberry pie, phones, etc... over wifi

  • @ray-charc3131
    @ray-charc3131 14 дней назад

    Originally it is of 2Gb ram, or changed to 8Gb ??

    • @GaryExplains
      @GaryExplains  14 дней назад

      It has 8GB, although there is a 4GB version, but not for the development kit. This is a Jetson Orin Nano, not a Jetson Nano.

  • @ivlis32
    @ivlis32 14 дней назад +3

    What's the point of releasing the hardware and then not able to distriburte it? It's is soldout everywhere and scammers on Amazon resell this board for $550 (who the hell needs it for this price?)

  • @maneeshs3876
    @maneeshs3876 10 дней назад

    Nice video

  • @mikesxoom
    @mikesxoom 15 дней назад

    Impossible to setup. I have been trying for days and all I get is a black cursor. Very frustrating

    • @GaryExplains
      @GaryExplains  15 дней назад

      Obviously it works as I show in the video. Does llama.cpp work on a single node?

  • @galihpa
    @galihpa 15 дней назад

    Good video but the audio needs to be louder

  • @PetersonChevy
    @PetersonChevy 12 дней назад

    Would love to see you cover Nvidia's new DIGITS, $3,000 mini PC for LLMS when it comes out

  • @expensivetechnology9963
    @expensivetechnology9963 14 дней назад

    #GaryExplains May I buy you a coffee? I’ve been struggling with calculating how much VRAM an LLM model requires based on Paramaters. ChatGPT-4.0 gave me a formula that defaulted to FP16 - and so I was convinced that whatever the parameter mount (e.g. 100B, 405B) these models required roughly double the GB in VRAM. Then NVIDIA claimed their DIGITS system with 128GB of VRAM could run 200B parameter LLMs - and when I asked chatGPT - oh they must be employing Quantization with FP8 and it did the math - but there still wasn’t enough VRAM - but I knew NVIDIA didn’t lie or make a mistake. And finally on your video I saw that running Machine Learning Inference workloads at 4-bit precision is an option (never occurred to me) drops the vram required by half. I have a lot of questions about the future of MLI. Are you accepting new friend applications right now?

    • @GaryExplains
      @GaryExplains  14 дней назад +1

      Yes the quantization is important. In a nutshell the amount of RAM needed is the same as the file size. Look at ollama.com/library click on a model and see how big the file is.

    • @expensivetechnology9963
      @expensivetechnology9963 14 дней назад

      @ I’ve been in IT for 30-years. We’ve all heard the AI hype reaching a crescendo. I mean come on! Now my clients may subscribe to premium API calls (e.g. 1,000 Brave Search API calls @ $0.03 or something). IT Consultants need a plan to manage this madness. I’m fascinated by MLI workloads. I’m going to need your email address. Too many questions. You’re smarter than ChatGPT today.

  • @wtftolate3782
    @wtftolate3782 15 дней назад +1

    This is currently running $515 USD on Amazon!

    • @GaryExplains
      @GaryExplains  15 дней назад +1

      Don't buy it from Amazon, use one of the official distributors.

    • @123ftw1
      @123ftw1 12 дней назад

      @@GaryExplains If they are in stock one day. They have 3 official distributors in UK, 2 of them are out of stock and the 3rd is...out of stock but at least I can backorder it for £244.80. I'm hopeful they'd complete my order in 10 days.

  • @briancase6180
    @briancase6180 13 дней назад

    But, can't a single machine be both the master and an RPC server? Of course it can, so your first split example could have used two Orin Nano Supers and skipped the third, master machine (your laptop in the example). And, you ran the 9B model in Q8, not Q4, so the comparison is not exactly apples-to-apples.

  • @meme-eg8uw
    @meme-eg8uw 9 дней назад

    Why didnt we say anything about the motherboard being a 7mm increase

  • @m3mee2010
    @m3mee2010 12 дней назад

    Question..why do I need this? For $20 per month I get the massive performance of chatgpt 4...seems like a major waste of time and money...

    • @GaryExplains
      @GaryExplains  12 дней назад +1

      If you are thinking of this as a chatgpt replacement then you don't need it, it isn't for you.

  • @Grrr2048
    @Grrr2048 15 дней назад

    You da man!

  • @robertneale6986
    @robertneale6986 6 часов назад

    would a USB C network be faster

  • @Adam-fl9uc
    @Adam-fl9uc 8 дней назад

    the problem is 4 of them are still slower than the ryzen ai mini pc's for roughly the same price

    • @GaryExplains
      @GaryExplains  8 дней назад

      That is an interesting observation. Can you point me to the LLM token speed for the Ryzen AI Mini PC, I haven't seen those numbers.

  • @lp3037
    @lp3037 5 дней назад

    The biggest problem is that it is out of stock and I don't want to pay 2, 3 times the original price on Amazon.
    By the time, I can get my hands on one . Would have been 2026!

  • @PankajDoharey
    @PankajDoharey 14 дней назад

    It has more CUDA cores.

  • @amihartz
    @amihartz 7 дней назад

    these look cool but why are they so expensive? it says $250 but i only see them selling for $400+. does not seem worth it.

  • @TheSmiths-0121
    @TheSmiths-0121 2 дня назад

    The small memory is the one reason i dont want these, swap able ram so we could do 64gb at least but it is nvidia, so expected

  • @Larimuss
    @Larimuss 13 дней назад

    2gb LLM is gonna be dumb af. The only use case is training or Lora models for specific cases your still gonna need api for decent chat etc. which you can do on a $10 board 😂 just a fancy toy with some small use cases.

    • @GaryExplains
      @GaryExplains  13 дней назад

      2GB? Did you watch the rest of the video or did you stop at the llama 3.2 demo? It can easily run a 7b parameter LLM model plus other models like peoplenet etc. what $10 board can run a 7b parameter LLM?

    • @Larimuss
      @Larimuss 13 дней назад

      @ oh fair enough, no I stopped at 2gb model 😅 considering it’s supposed to be for Ai dev stuff I just expected Nvidia to give more for $250 USD. 8gb chips are not expensive.

    • @GaryExplains
      @GaryExplains  12 дней назад

      Technically it is for AI Edge stuff, not general AI dev stuff.

  • @gu9838
    @gu9838 3 дня назад

    too bad it seems to be sold out everywhere lol

  • @Walker956
    @Walker956 День назад

    of course its called super

  • @KartikeyPatel-f8e
    @KartikeyPatel-f8e 3 дня назад

    We'll information 😅

  • @charlesscholton5252
    @charlesscholton5252 7 дней назад

    Project Digits
    ..

  • @MrBobWareham
    @MrBobWareham 3 дня назад

    Way too much money for an SBC

    • @GaryExplains
      @GaryExplains  3 дня назад

      In fairness it isn't an SBC in the traditional sense, it is more an edge AI device.

  • @OrientalStories
    @OrientalStories 15 дней назад

    here you go, that will be 1k per 1gb of vram