Build Your Own GPU Accelerated Supercomputer - NVIDIA Jetson Cluster

Поделиться
HTML-код
  • Опубликовано: 27 дек 2024

Комментарии • 225

  • @xenoaltrax485
    @xenoaltrax485 4 года назад +181

    Fun fact: with 128 CUDA cores in a Nano, how many cores actually perform the square root operations in the program? Answer: zero. Yep, with the Nano being based on Nvidia's Maxwell architecture, not one of those 128 cores is capable of computing a square root directly. Instead the Nano's single Maxwell SM (streaming multiprocessor) comes with 32 SFUs (special function units) which are used to compute the square root. But even quirkier, these SFUs only know how to compute the reciprocal square root, as well as the regular reciprocal operation. So to get a square root the SFU will actually execute two instructions: a reciprocal square root, followed by a reciprocal. Strange but true! But actually documented in Nvidia's "CUDA C Programming Guide" in the section on "Performance Guidelines: Maximize Instruction Throughput".
    Ah yes, the joys of having a day job as a CUDA programmer. You get to be gobsmacked every day by the weird ways you need to go about trying to optimize your programs to scrimp and save on every precious clock cycle :P

    • @pluralcloud1756
      @pluralcloud1756 4 года назад +4

      i like your depth of thought-- can you point us to some info so we can learn the important tech to understand why and how you have determined what you stated-- thanks for the comment-- background: i bought into nvida cuda many years ago for video post processing and could never really take advantage of it...but now want to in the AI/ML solutions for Iot

    • @xenoaltrax485
      @xenoaltrax485 4 года назад +7

      @@pluralcloud1756 The info can be found in Nvidia's "CUDA C Programming Guide", here's a direct link to the pertinent section on arithmetic instruction throughput: docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions

    • @ProDigit80
      @ProDigit80 4 года назад +1

      Not every program need to calculate the square root, and you're incorrect in that statement too.
      Nvidia's Cuda Cores are stream processors, they do 16 or 32 bit flops.
      The actual full cores are DPP (64 bit)calculates a square root very precise; however even 32 bit FPP has like what? 20 bit mantissa?, They are good enough to calculate a square root quite accurately!
      Anyway, it's way more accurate than your handheld calculator!

    • @xenoaltrax485
      @xenoaltrax485 4 года назад +18

      @@ProDigit80 The CUDA cores do not calculate the square root directly. This is easy to verify: make a simple kernel which calculates a square root, eg. "__global__ void sqrtkern(float fi, float *fo) { *fo = sqrt(fi); }". Then use NVCC with the "-cubin" option to generate a CUBIN file. Then use CUOBJDUMP on this CUBIN file and with the "-sass" option to generate the SASS file which contains the actual low-level assembly instructions for the GPU. Check the SASS file and you will see an instruction "MUFU.RSQ" which is a multi-funtion unit instruction to calculate the reciprocal square root and is issued to the SFUs. So from the assembly you can clearly see that the kernel is using the SFUs to compute reciprocal square roots rather than using the CUDA cores.
      If you want to avoid using the SFUs and want to solely use the CUDA cores then you have to write your own square root function, meaning do not use the "sqrt()" built-in function in your code.

    • @0dyss3us51
      @0dyss3us51 4 года назад +1

      Lol ty for that fun to know

  • @d.barnette2687
    @d.barnette2687 4 года назад +33

    Greetings from near Albuquerque, New Mexico, USA. Thanks for all you do to bring various computing concepts, hardware, and software to your viewers. I want to leave a few comments about this video on Build Your Own GPU Accelerated Supercomputer.
    When you take your square root problem and divide it into smaller and smaller but more numerous parts, that is called 'strong scaling' of a numerical problem. This implies that the problem size on each compute node becomes smaller and smaller. Eventually, if the problem continues to be broken up into smaller and smaller pieces, what happens is the communication time from compute node to compute node imposed by the message passing interface (MPI) becomes dominant over the compute time on each node. When this happens, the efficiency of parallel computing can be really low. My point here is that your video shows that double the compute nodes and you halve the compute time. That scaling will happen at first but cannot be continued ad infinitum.
    Another approach to parallel computing is to take a small problem of a fixed-size on one compute node, then keep adding the same size problem (but expanding the compute domain) to other compute nodes, all working on the same but now bigger problem. This is called 'weak scaling.' And as one might guess, the performance and efficiency curves for strong and weak scaling are quite different.
    As you know but perhaps some viewers do not, running nvidia GPUs requires knowing the CUDA programming language, which requires a non-trivial effort. This language is entirely different from programming languages such as Python, Fortran, or C++. This is why Intel chose to use more X86 co-processors in their Core i9 boards instead of GPUs so that programmers could stay with their familiar programming languages. AMD took the same approach with their ThreadRipper boards. Software development time is much reduced without having to learn CUDA to program the extra compute nodes. Implementing CUDA on top of typical programming languages can extend significantly the time between the start of a software development program and when the software actually executes properly on a given platform.
    In a nutshell, the plus side of all this is that GPUs are super fast for numerical computing. GPUs are hands-down faster than any X86 processor. Downside is the difficulty in programming a problem to make proper use of the GPUs.
    One more comment. For viewers interested in parallel computing, I highly recommend OPENMPI as the Message Passing Interface version to use as it is open source, actively developed, and easy to implement.

    • @sanmansabane2899
      @sanmansabane2899 4 года назад

      Great comment. Very well explained. There's one thing I want your opinion on is how do you views OpenACC for parallel computing ? The learning curve vs. the performance gain ?

    • @d.barnette2687
      @d.barnette2687 4 года назад +4

      @@sanmansabane2899 OpenACC is geared for directive-based parallel computing much like OpenMP. OpenACC utilizes GPUs for accelerated computing, whereas OpenMP uses multi-cores. I have more experience with OpenMP, so my comments will pertain more to OpenMP than OpenACC. In general, the idea here is to take a serial code, add a few beginning and ending [directives in Fortran : pragmas in C or C++] which consist of usually just a few lines of code, and let the directive-based compiler figure out the best way to parallelize the code in-between. Do-loops and For-loops are prime candidates for this approach. As a result of letting the compiler do the heavy lifting, the parallel efficiency one gets out of this approach is heavily compiler dependent. Some compilers do much better at parallel efficiency using directives than others. Also, in my experience, if the code between the directives is written "poorly," the execution time can actually increase rather than decrease. Not good. Note that OpenACC and OpenMP create multiple threads on a core, but they do not allow communication across cores or processors. OpenMPI does that. So the most efficient approach can be using OpenMPI (requires re-write of a lot of code to get right) for intra or inter-process communication, and include in the code directives to launch threads on the cores using OpenMP or threads on the GPU using OpenACC. Note that there has been a successful push to include GPU protocols in OpenMP. This MAY mean OpenACC is falling down in popularity. For example, if you wiki 'OpenACC', you'll find that the decrease in OpenACC's popularity is probably why on April 3, 2019, John Levesque (the director of Cray Supercomputing Center of Excellence at Cray) announced that Cray are ending support for OpenACC. I've met John before. He is a very well-respected and knowledgeable man. I'm sure his decision to drop OpenACC was done with much forethought and much hindsight. This may be a good reason to go with OpenMP -- it will definitely be around for awhile.

    • @elorrambasdo5233
      @elorrambasdo5233 3 года назад +1

      I took a college class where CUDA was the language (Massively Parallel Computing was the course) and it absolutely requires a 'non-trivial' effort.

  • @visiongt3944
    @visiongt3944 4 года назад +95

    "We just take square roots. We're simple folks here."
    **builds a supercomputer cluster with GPU acceleration 😎**

    • @GaryExplains
      @GaryExplains  4 года назад +16

      😂

    • @giornikitop5373
      @giornikitop5373 4 года назад

      i would hardly call this a supercomputer. no offense, although it's nice for test and experimentation, has low power and all that, in reallity, given the orevall cost, a mid tier nvidia gpu will crush it. still good for having some fun.

    • @CircuitReborn
      @CircuitReborn 4 года назад +4

      @@giornikitop5373 gives me a great idea how to build a unique gaming rig though...just using better GPUs.

    • @Jkirk3279
      @Jkirk3279 3 года назад

      @@CircuitReborn
      Has anybody solved tag teaming graphics cards?
      I remember in the 90’s you could plug six cards into a Mac IIFX to render Photoshop jobs.

  • @OperationsAndSmoothProductions
    @OperationsAndSmoothProductions 4 года назад +92

    Mans greatest achievement was working out how to do math faster than his mind would let him ! ! !

    • @dam1917
      @dam1917 3 года назад +5

      I prefer the quote: "Teaching sand to think was a mistake."

    • @OperationsAndSmoothProductions
      @OperationsAndSmoothProductions 3 года назад +5

      @@dam1917 It has a grain of Truth.

    • @danwe6297
      @danwe6297 3 года назад +3

      I think the greatest achievment was realizing math is not only about counting numbers together... and neither only about finsing faster way to do it.

    • @OperationsAndSmoothProductions
      @OperationsAndSmoothProductions 3 года назад +1

      @@danwe6297 Also: Geometry, Mathematics and Music are all ways to express the same thing ! ! !

    • @joescomics7652
      @joescomics7652 3 года назад +1

      Can it play steam games?

  • @f4d4x
    @f4d4x 4 года назад +6

    I really would like to build one of these, I've followed an HPC course at Uni and it fascinated me, beeing able to build a CUDA cluster for like 250€ is awesome!

  • @Flankymanga
    @Flankymanga 4 года назад +21

    9:53 Ok so if i understand correctly: time will return the number of seconds program has run, mpiexec is the utility responsible for cluster management and ./simpleMPI refers to a local binary which is then distributed and run across the cluster? 12:03 Also the Xavier GPU being more powerful you mean the number of cores it has right? Also i would like to see from professor Garry video on Amdahl`s law :)

    • @GaryExplains
      @GaryExplains  4 года назад +5

      Yes.

    • @HitAndMissLab
      @HitAndMissLab 4 года назад

      @@GaryExplains another vote for video about Amdahi's law. This is so sexy :-)

  • @kovlabs
    @kovlabs 2 дня назад +1

    What’s the hardware rack you are using ?

  • @rmt3589
    @rmt3589 3 года назад +1

    Is there a program that can calculate the speed of each one, and set up a unique weight automatically?

  • @JuanReyes-uc6mc
    @JuanReyes-uc6mc 2 года назад +1

    You are the first to explain that I understand

  • @notgabby604
    @notgabby604 4 года назад

    Fast Transform fixed-filter-bank neural nets don't need that much compute. Moving the training data around is the main problem. The total system DRAM bandwidth is the main factor. Clusters of cheap compute boards could be a better deal than an expensive GPU. For training you can use Continuous Gray Code Optimization. Each device has the full neural model and part of the training set. Each device is sent the same short list of sparse mutations and returns the cost for its part of the training data. The costs are summed and the same accept or reject mutations message is sent to each device.

  • @naeemulhoque1777
    @naeemulhoque1777 8 дней назад

    can you make a video on new Jetson Nano cluster and run LLM on it.

  • @abfig78
    @abfig78 3 года назад +2

    Hi. Can you give me a little information please? My question is basic. I will be using the Mate and three Nanos. One as the master and two nodes for now until I can get my last worker. My question is should I remove the desktop environment from the worker nodes to free up ram and processor usage. Using three Nanos it seems I should just install the main OS on all the SD cards from the carrier that they come with. Then install them into the mate. Is that right? I have found loads of information on running clusters and neat stuff. But the basic setup of the mate is what I am missing. I would just think the desktop environment on the worker nodes would be wasteful.
    I would love a video on the build from the start to running. Just the basic of the beginning. Setup of the SDs and any really important information needed.
    Regards,
    Adam

  • @rogerbruce2896
    @rogerbruce2896 4 дня назад

    how would this cluster stack up to the latest bitmain antminer?

  • @stephenfgdl
    @stephenfgdl 6 месяцев назад

    so what practical use can I use it for in my lab? except for counting numbers?

  • @dfbess
    @dfbess 4 года назад +13

    Gary, can you make the gpu's and cpu's work together ? And by the way that was awesome..

  • @urieldeveaud
    @urieldeveaud 4 дня назад

    hi, is this can be applied to the latest release of the jetson Nano super (8gb) ?

  • @krazykillar4794
    @krazykillar4794 3 года назад

    Your a very good teacher. Because im a noob and i understood everything and learned alot. I went from not knowing what a jetson nano was to learning about parrallel computing and building supercomputers.
    Thank you 👍

  • @Knighteem
    @Knighteem 3 года назад +1

    Just curious, is it possible to mix supercomputer clusters together?
    Like using a raspberry pi cluster for CPU n Nvidia Jetson cluster for GPU computation. Is crosstalk available?

  • @naturepi
    @naturepi 4 года назад +9

    Very Cool! You forgot to mention it take about what ~18W of power? Gary, can you, please, explain exactly how Xavier NX unit can be used for video encoding. I know it runs linux OS Ubuntu on it, so my question is, can it be booted directly of SSD and used as regular desktop PC, running one of the open source editors, such kdenlive, which, by the way, supports parallel video rendering.

    • @SoussiAfif
      @SoussiAfif 4 года назад

      i wouldn't recommend it.AFAIK it doesn't support boot from SSD but you can connect one on USB3. it's using ARM64 and many libraries and softwares are not present there. and the overall user experience and fluidity is of the OS is not the best. for the price of the NX you can build a mini Ryzen PC with much better performance and 4 real X86 cores at high clocks

    • @naturepi
      @naturepi 4 года назад

      @@SoussiAfif so the issue is with the soft, I know 6 cores ARM cpu is not enough but look at the demo with CUDA cores supporting multistreams video codecs with face recognition AI...so why this API can not be used to simply read XML instructions and run FFMPEG commands in parallel? Answering to your question, today, Ryzen rig make sense to build with PCI-X 4.0 support latest new coming CPUs with shared 3rd level cash, which will cost 3x to start with... most people will do that...

    • @tianjohan4633
      @tianjohan4633 4 года назад

      @@SoussiAfif I belive there is a hack for booting from the SSD already. Also I am sure there have been said officially Xavier NX will be able to boot from SSD as standard in the near future.

    • @tianjohan4633
      @tianjohan4633 4 года назад +1

      I think you could, after all as you said it runs Ubuntu so why not? Why I struggle to see is why you would render video via a cluster. Let''s say 4 x Xavier or even more in a cluster, well you get a rather power full x86 for that kind of money. But for the learning experience and the fun of it, just try it and see what you think. Hey, that could be a video if you ever thought of your own channel.

    • @naturepi
      @naturepi 4 года назад +1

      @@tianjohan4633 I heard that too, but none of reviewers can show that... they all just repeat that demo showcase...

  • @audiblevideo
    @audiblevideo 4 года назад +10

    Could you make this into a render farm? That is separate from the question as to whether that would be a good idea or even efficient.

    • @DionV
      @DionV 4 года назад +3

      This is exactly why I came to this video. Can this be an After Effects or Davinci Resolve render farm?

    • @audiblevideo
      @audiblevideo 4 года назад +4

      @@DionV I looked at some blender forum posts... yes it can. But not so efficient at all.

    • @Blessed_dna
      @Blessed_dna 3 года назад +1

      @@audiblevideo can you share the links to those posts please?

  • @georgelza
    @georgelza 2 дня назад

    any chance you have a video about clustering the new jetson nano supers....
    G

    • @GaryExplains
      @GaryExplains  2 дня назад

      Clustering in what sense? This video will work on the Jetson Nano Orin. But sorry to be pedantic but the Jetson Orin Nano isn't new. Nvidia didn't release any new hardware, only new software. I did a review of the Jetson Orin Nano about a year ago. I will cover the Super dev kit (and the new software) in the new year.

    • @georgelza
      @georgelza 2 дня назад

      @@GaryExplains ruclips.net/video/S9L2WGf1KrM/видео.html

    • @GaryExplains
      @GaryExplains  2 дня назад

      Yes, and? Trust me that is just excellent marketing from Nvidia. The board Jenson is holding is the 1 year old Jetson Orin Nano. I have both boards here, they are identical and I have confirmed this with Nvidia. Kudos to Nvidia for tricking loads of RUclipsrs into saying it was new when it wasn't.

  • @eotto1980
    @eotto1980 2 года назад

    Can this mine monero in the same fashion you demonstrated?
    Can you add a gpu to this setup?

  • @sudhirkumarpal5925
    @sudhirkumarpal5925 3 года назад

    Hello, Kindly provide a detailed video on how to make a cluster supercomputer using 4 Nvidea Jetson NX with detailed wiring and programing commands. Thank You.

  • @b71717
    @b71717 Год назад

    Hi @GaryExplains - fantastic video. Thank you for sharing your knowledge with the community.
    I have a quick question. Given that the Jetson Nano used in this video is discontinued, what Jetson module would you recommend instead? Could this work with 4 Jetson Orin Nano modules (and would the Dev Kit be needed or could we just go with the module)? Thanks!

  • @bennyholgersson4686
    @bennyholgersson4686 4 года назад +2

    How about setting it up for mining Monero? How to do that?

    • @TheSc8rpion
      @TheSc8rpion 3 года назад

      That was my thinking too, I have 19 1gb cards, can they work together as 1 card and then mine with them as 1?

  • @-7-man
    @-7-man 5 дней назад

    Can 4 Jetsons run a 70b model

  • @TheSc8rpion
    @TheSc8rpion 3 года назад

    I have 19 1gb cards, can they work together as 1 card and then mine with them as 1?

  • @nikkorocksalot5254
    @nikkorocksalot5254 4 года назад +5

    If I had $500, would it make more sense to make a cluster for Blender rendering, or get a 3070?

    • @darkfoxfurre
      @darkfoxfurre 4 года назад +2

      Probably better spent on the 3070, because 100% of the cost is going to the GPU. But if you bought Jetpacks or other single board computers, a lot of that cost would go to parts other than the GPU (CPU, connectors, cables, etc). Just mathematically you'd be spending more money on silicon with a 3070 than with a $500 cluster.

    • @PiotrG1337
      @PiotrG1337 4 года назад +2

      Not even close to worth it for rendering, price wise. a Jetson Nano has 128 Cuda cores and costs roughly 100 euro. A Geforce 1070 has 1900 Cuda cores and you can pick up a used one for around 100 euro. Do the math.

    • @stephenvillagonzalo9967
      @stephenvillagonzalo9967 3 года назад +1

      3070. Gpu are very cost effective.

  • @hanobbiee
    @hanobbiee 7 дней назад

    can u do at stable diffution

  • @Graeme_Lastname
    @Graeme_Lastname 4 года назад +3

    Will you be porting Doom to it?

  • @Oneill_from_Ireland
    @Oneill_from_Ireland 3 года назад

    Can you show us using an ampere altra, and thread ripper 64 proc x 2, and then nanos clustered. Price, price per core, wattage total, and then speed? We're all looking forward to it.

  • @malinyamato
    @malinyamato 2 года назад

    isnt it more cost efficient to compute on graphics card than Jetson stuff? I believe that Jetson is optimized for running with high requirements on low power consumption, being light and small and not at all on raw computation power. Where Jetson may be useful is to drive robots that move around but not for stuff I do like Machine Learning. Cuda cores as far as I know cannot handle floats, which is a requirement to compute SQRT. You use the tensor and cuda cores for linear algebra and matrices transformations on ints.

  • @SFoX-On-Air
    @SFoX-On-Air 2 года назад

    As someone who just stumbled over this video .. and have no idea what use you have for this. Let me ask a question.
    For IT Standards I have a rather old Graphics Card in my main system. It is an "Zotac GeForce GTX 970 AMP! Extreme Core Edition".
    This Card comes with 1664 CUDA Cores. Is there anything that these little computers can do, that my old graphics Card cannot?
    I really dont understand Why someone Spend 400 Bucks for 4 mini computers, that also need a specific configuration and lots of knowledge,
    to have less CUDA power then a 6 year old Graphics Card that you can buy for half the price.
    Am I wrong?

  • @leonardog27
    @leonardog27 3 года назад

    Do you think is possible to run OpenAI Jukebox using these kind of Jetson Nano Cluster arrangement? Thank you

  • @calvint3419
    @calvint3419 4 года назад

    Thanks for the video. Is the cluster able to run Apache Spark?

  • @christopherZisa
    @christopherZisa Месяц назад

    I will watch and study all your videos.. I want to do more than just study. There's something I'd like to create. If possible. I'll try to reach out when I'm finished studying all your videos. May it be possible that I could ask a few questions just to gain some knowledge. Great video. I know not much of it but I understood you. There's a lot to it. I need help with my project.

  • @JoelJosephReji
    @JoelJosephReji 4 года назад +5

    Yes, a video on Amdahl's law, please!

  • @1MarkKeller
    @1MarkKeller 4 года назад +9

    *GARY!!!*
    *GOOD MORNING PROFESSOR!*
    *GOOD MORNING FELLOW CLASSMATES!*
    Stay safe out there everyone!

  • @markharder3676
    @markharder3676 3 года назад

    I was just about to post my brilliant formula for calculating the number of (identical) processors above which the compute time of the cluster starts increasing because of latency (The time taken to communicate results over the inter-connects.) Fortunately, I hesitated to post it when I realized that the overall latency of the cluster might not be additive. In other words, total latency time might not = N * L(1) for N = number of processors and L(1) is the latency of one processor alone. Is there some simple formula that scales latency as a function of the number of processors in a cluster? I suppose that number might vary wildly depending on the topology and hardware of the inter-connects, but I have no idea, really.

  • @Fanchiotti
    @Fanchiotti 4 года назад +1

    Great video. One question, can you elaborate a little bit more on the github about commSize? I just didn't know how to set it as an argument. Thanks again for the video.

  • @yelectric1893
    @yelectric1893 2 года назад

    So fascinating. Wow . Thank you all. And the producer.

  • @crsereda
    @crsereda 4 года назад

    Doesn’t having nodes with varying specifications effect overall performance of the cluster?

  • @slawomirgontarek4213
    @slawomirgontarek4213 3 года назад

    hello, you have encountered a problem such as crypto mining with the Nvidia Jetson series of computers. it is most likely about mining Monero crypto

  • @selfscience
    @selfscience 3 дня назад

    where is the build?

  • @Oneill_from_Ireland
    @Oneill_from_Ireland 3 года назад

    Also, why not use a bunch of single boards with fibre meshed instead. Perhaps, a redone version using GenZ or something close? Id love to see your project results.

  • @KipIngram
    @KipIngram Год назад

    It would be great if you did videos covering all the details of setting up such a cluster, for a Linux-based environment. What software, how to cable it all up, etc. etc. etc.

    • @GaryExplains
      @GaryExplains  Год назад

      What areas exactly need expanding because I thought the video along with the documentation I created should be sufficient.

    • @KipIngram
      @KipIngram Год назад

      @@GaryExplainsI'm sorry - I've only watched the videos. I will look into your linked documentation. Thanks for the reply!

  • @jyvben1520
    @jyvben1520 4 года назад +3

    recommend Terminator (if available) for multi terminal window(s)

    • @edwardstables5153
      @edwardstables5153 4 года назад

      Tmux is even better. Just have to learn the weird keybindings.

  • @one-shotrailgun8713
    @one-shotrailgun8713 4 года назад +1

    I have question, should I get the 4gb Nano or 2gb Nano for the cluster?
    Does RAM matter?
    I dont plan to buy them, I just want to know.
    Aslo, can you run Android on the board?
    Id love to see a video about that!

    • @SoussiAfif
      @SoussiAfif 4 года назад

      you can't run Android on it. the drivers and Firwares are not available. for a GPU/CPU cluster (Aka calculations) the 2gb will do fine. if you're planning to do web hosting, run Docker/kubernetes the 4gb RAM will be very useful addition

  • @NickJustWill
    @NickJustWill 3 года назад

    So would there be a way to get these GPUS to encode a video using NVENC?

    • @juandenz2008
      @juandenz2008 3 года назад

      Almost certainly you could encode a video in parallel, if you have the right software to do that.

  • @pcflyer12001
    @pcflyer12001 4 года назад

    Can you use these clusters, since they share gpu's, to make a small form SBC gaming computer. For like AAA titles that one single board couldn't normally run?

    • @GaryExplains
      @GaryExplains  4 года назад +1

      No. The latency over the network is much too slow.

  • @muhammadattaurahman4684
    @muhammadattaurahman4684 3 года назад

    Is higher ram (4gb) better for deep learning?

  • @rivertam1921
    @rivertam1921 4 года назад

    What about SOLID using classes conditionally, or webpages/forms event driven dynamically with the controls triggering classes dynamically for a task or sets of tasks programmatically or sequential task activated?

    • @juandenz2008
      @juandenz2008 3 года назад

      Probably can't be easily run in parallel. Some stuff can be parallelized, some can't. In your example the webserver will be handling many requests in parallel as would the database, but each user interaction and the relevant objects would be a single thread.

  • @eliasdetrois
    @eliasdetrois 4 года назад

    Great job Gary! So fun to watch

  • @UkiDLucas
    @UkiDLucas 4 года назад

    Hello, I went that path a few years ago with a cluster of Raspberry Pi, the problem I kept running into was out of RAM errors with Apache Spark. In the end, I prefer to use a multithreaded CPU and a good GPU. Recently, I picked up USB Google Coral TPU (4 TOPS) which I wanted to run in parallel, but because I need "regression" models (not image recognition) I am not sure if I will succeed.

    • @Adroit1911
      @Adroit1911 2 года назад

      I'm curious as to how a cluster of APUs, like a Ryzen would run...🤔

  • @louielouie684
    @louielouie684 3 года назад

    Could a supercomputer be built on a network of clusters of the Nano that are remote from one another ?

    • @GaryExplains
      @GaryExplains  3 года назад +1

      Yes, as long as your program isn't latency sensitive. So if each node is given a task that takes hours to run and then reports it's results, that would be ok. But if the tasks are short and there is lots of IO then the performance will fall dramatically. There are also security implications of opening up those nodes for access across the internet.

  • @johnhoffmann1565
    @johnhoffmann1565 7 дней назад

    nice work!

  • @sumnemox
    @sumnemox 4 года назад

    Does anyone know whether this can run the ARM version of the Folding@Home software and whether it will utilise the GPU? (I currently have it running on a Pi 3B+ and a Pi4.)

    • @stephenvillagonzalo9967
      @stephenvillagonzalo9967 3 года назад

      All Raspberry Pi doesn't support CUDA. Rpi 3 support has partial opencl(cuda alternative) support, but none on rpi4.

  • @hpgramani
    @hpgramani 4 года назад

    How is MPI Execution different than Map Reduce Techniques like Hadoop?

    • @GaryExplains
      @GaryExplains  4 года назад

      MPI also has reduce etc functions.

    • @hpgramani
      @hpgramani 4 года назад

      @@GaryExplains oh so it seems both are similar 👍🏼

  • @LarryPanozzo
    @LarryPanozzo 3 года назад

    How much does it cost?

  • @Jkirk3279
    @Jkirk3279 3 года назад

    Hey, Gary.
    Do you think it would be possible to build a super GPU that could take any generic job from, say, a video game, split the job across a thousand smaller chips and respond fast enough ?
    I read about a pioneer in your field who used 55,000 cheap video game chips to do topological maps in an eye blink.
    The challenge being a) impersonating a recognized video card that the host computer will accept and b) lightspeed lag.
    As I recall, he used fiber optics.

    • @GaryExplains
      @GaryExplains  3 года назад

      Latency is the issue, hence the researchers use of fiber optics etc. Remember a typical graphics card uses a PCIe bus to its full extent. Hard to replicate that kind of speed over a distributed network.

    • @Jkirk3279
      @Jkirk3279 3 года назад

      @@GaryExplains
      I recall the original Power PC RISC 6000 used five chips linked by a 128 bit bus.
      And there’s no LAW that you have to use a standard enclosure.
      Make a filing cabinet into a PC case, put in a standard motherboard and snap three two foot by 1 foot high PCI cards into each drawer.
      I know even the length of the copper traces matters, but you could use fiber optics to distribute the jobs.
      That’s 5% faster than lightspeed through copper, right?
      Put as many cores on those cards as you can.
      Maybe, with a Herculean effort, the timing issues could be resolved and you could run video games on it.
      Plus, the idea scales well for solving conventional parallel processing jobs.

  • @RUFU58
    @RUFU58 4 года назад

    You can get the Nvidia Elroy as well which is even smaller.

  • @miladini1
    @miladini1 4 года назад +1

    Hey Gary, thanks for this video. Awesome!

  • @markharder3676
    @markharder3676 3 года назад

    Gary, great video!! I've been thinking of playing around with one of the nano 2gB boards and you have given me some courage to try. Did you have to write your own CUDA code to execute the SQRT function?

  • @palaniappanrm6277
    @palaniappanrm6277 4 года назад

    It will be really helpful if you can do a video on how to write some basic code that utilizes GPU(Nvidia as well as Radeon) in popular programming languages such as C, Java. Couldn't find proper resources, even if I do find, those are hard to understand

  • @thepropaganda1066
    @thepropaganda1066 4 года назад +3

    Couldn't you use this kind of setup for a open ai deep learning machine

  • @brogonbtw
    @brogonbtw 4 года назад

    I get xbox vibs from the stacking motherboard

  • @mysticalsoulqc
    @mysticalsoulqc 4 года назад +2

    Great video thanks, it would be great to compare these results with a regular pc.

  • @ThomasGodart
    @ThomasGodart 3 года назад

    Congrats! Very well done 👍

  • @abfig78
    @abfig78 3 года назад

    Gary, Hello sir and thanks for the awesome video. I will jump right into it. I followed the GitHub instructions and everything works great. Well almost. I can see the four nano I have as my nodes all crank up. The CPUs go up as the data is received, but only one GPU at a time will ring the program. When I edit the clusterfile I can remove all of them and just add one at a time and each will individually run the program. Each working fine. When I add all four it does show 16 cores and like I said with jtop I can see them all try to go. But only one will go at a time. Never all four. The instructions we simple to follow and it went great. Just wondering if you have any ideas or if there is some data I can share that would assist you with finding out what I am doing wrong.
    Thank you for the awesome video!!

    • @abfig78
      @abfig78 3 года назад

      I just watched your video again and I see you have in the clusterfile the IP and :1 or :3 in your examples. I did not do that part. I will try that when I get home today and see if that is my problem.
      Thank you

  • @ufohunter3688
    @ufohunter3688 4 года назад +4

    nVidia has some kick-ass courses on Ai and machine learning too. Some are even free!
    I'm too lazy to develop anything. I wait until someone else does all the hard work, then I become the consumer.

    • @GaryExplains
      @GaryExplains  4 года назад +1

      Indeed it does. I demo one of the course modules during my review of the Jetson Nano 2GB.

    • @ufohunter3688
      @ufohunter3688 4 года назад +1

      @@GaryExplains indeed. I heard it from you first.
      If I didn't have you, I'll be back in the dark ages like the rest of humanity.

    • @GaryExplains
      @GaryExplains  4 года назад

      @@ufohunter3688 😂

  • @drewwilson45
    @drewwilson45 4 года назад

    Would four intel or amd motherboards work in this manner

    • @GaryExplains
      @GaryExplains  4 года назад

      If they had NVIDIA graphics cards then yes. If not the they would work similar to the Raspberry Pi supercomputer I show in my other video.

  • @gohome2828
    @gohome2828 4 года назад +1

    Nice one sir

  • @Adroit1911
    @Adroit1911 2 года назад

    APUs?? 🤔

  • @RobertLugg
    @RobertLugg 3 года назад

    The code needed to do this is most interesting. Can you make it available?

    • @GaryExplains
      @GaryExplains  3 года назад +1

      All the details are in my GitHub repo: github.com/garyexplains/examples/blob/master/how_to_build_nvidia_jetson_gpu_cluster.md

    • @RobertLugg
      @RobertLugg 3 года назад

      @@GaryExplains Very nice. Thank you.

  • @jonjohnson2844
    @jonjohnson2844 4 года назад +1

    Just linking a few computers together is that really a 'supercomputer'?

    • @stephenvillagonzalo9967
      @stephenvillagonzalo9967 3 года назад +1

      From the demo he did yes. It did decrease the compute time. From 28 sec to 5sec. Isn't that super?

  • @thctech5822
    @thctech5822 2 года назад

    Imagine technology in 50 years, sooner or later we'll all be on the internet

  • @eythoreinarsson6260
    @eythoreinarsson6260 4 года назад

    Interesting, you can probably get access to a full size computer cluster/supercomputer (preferably sponsored by Nvidia) and do a video on some serious number crunching (e.g. 3rd root) using mpi+cuda+scheduler. Thank you for the video :)

  • @BrickTamlandOfficial
    @BrickTamlandOfficial 4 года назад +2

    i wonder if this could compare to the speed AI upscaling video with a regular GPU like a gtx 1080

    • @ekstrapolatoraproksymujacy412
      @ekstrapolatoraproksymujacy412 4 года назад

      It looks cool and may have some educational value, but otherwise it is useless, single gtx 980ti (same OLD maxwell architecture, so apples to apples) has 2816 cuda cores, jetson nano has 128 and it's clocked lower, so you need more than 22 jetsons to match theoretical computing power of old gtx 980ti which is worth 200usd on ebay and it will work only for special cases when it is not critical for algorithm that you have low bandwidth high latency link between nodes

  • @dtumas
    @dtumas 2 года назад

    can Ibuy this?

    • @GaryExplains
      @GaryExplains  2 года назад

      You can buy all the individual components or you could buy a Jetson Mate, I have a review of it here: ruclips.net/video/nWzcEUj0OHc/видео.html

  • @earlofmeme
    @earlofmeme 4 года назад

    But can it run "Crysis".

  • @georgeogrady7299
    @georgeogrady7299 3 года назад

    Change bios

  • @justindressler5992
    @justindressler5992 4 дня назад

    These devices are edge devices im not sure they are good for high performance compute i think a midrange GPU has 10,000 cuda cores. The top end 5090 will have 24,000 cores probably clocked faster too.

  • @ian9895
    @ian9895 4 года назад

    What type of operations are a GPU core and a CPU core optimized for? For instance, in what situations could a 4 core cpu out perform a 128 gpu of comparable price.
    I think it is very interesting, especially if the operations are simple enough (like monitoring IO voltages for temperature sensors and the like) you could reduce down time by segmenting the process over 128 cores instead of 4.
    This is interesting for square roots, but where would the downfall be?
    All-in-all this makes me want to get into developing a small super computer as you have shown, for the experience.
    Thanks!

  • @deeppatel3454
    @deeppatel3454 4 года назад +3

    Also upload a video on training of YOLO v4 on cluster of nvidia Xavier NX. The video is very impressive. I was waiting for this type of video.

  • @j.jwhitty5861
    @j.jwhitty5861 4 года назад

    Two words 'Pretty Cool'

  • @aliawwad1693
    @aliawwad1693 4 года назад

    That was Great, Thanks

  • @GermAndroidE
    @GermAndroidE 3 года назад

    And could it Run a VMware ESX as it does RaspBerry? I think its a New deal on computing... That Will be great to mount an operating system on a vmwesx clustered SBS machines... Only needed a motherboard to join all this properly and to add a real GPU to all that... Ram, Extra GPU And clustered mini computers can do it greater and better than Only one beast

  • @rmt3589
    @rmt3589 3 года назад

    That moment when your program is Embarrassingly Parallel, and you like it just the way it is.

  • @darnell8897
    @darnell8897 4 года назад

    Perhaps a foolish question, but as the Raspberry Pi (for instance) also contains a GPU, could this same thing be done with it via OpenGL?

    • @forcegk
      @forcegk 4 года назад

      You probably could with OpenCL or similar

  • @russelldicken9930
    @russelldicken9930 4 года назад

    I’d like to see it handle a jupyterlab server!

  • @tomer073
    @tomer073 4 года назад

    yes yes yes pls make a video about Amdahl's Law

  • @therecyclingguy256
    @therecyclingguy256 День назад

    here for the build specs XD

  • @Dretnep
    @Dretnep 4 года назад

    Not possible to fry a potatoes whilst it's being peeled and chopped, is this a challenge? Give me a space suit with knife and peel attachments and prepare to be amazed :)

  • @JuanReyes-uc6mc
    @JuanReyes-uc6mc 2 года назад

    Dude I want to build a personal super computer. Im giving myself 2 years to complete and I know nothing about computer's. Im just super interested. My goal is the worlds smallest and mos inexpensive super computer. I want to see Latin America progressing.

  • @avinesh6030
    @avinesh6030 3 года назад

    Can I built and mine crypto in this type of computers

    • @TheSc8rpion
      @TheSc8rpion 3 года назад

      That was my thinking too, I have 19 1gb cards, can they work together as 1 card and then mine with them as 1?

  • @KW-jj9uy
    @KW-jj9uy Год назад +1

    Practically, we would run as much as we can on one multi gpu machine. Then we move onto multi node

  • @pirateman1966
    @pirateman1966 4 года назад

    You'll need a central controller to distribute the load among thousand if not millions of nodes to mimic a brain in real time.
    What do they call this master controller? The "Soul" of the system?

  • @SevenDeMagnus
    @SevenDeMagnus 4 года назад

    So cool.

  • @ProDigit80
    @ProDigit80 4 года назад

    Just get yourself a single RTX 3070 or 3080 in your PC, and you'll be more than good enough on a super computer at home (not even going for 2 or 3x 3090 in a system).

    • @GaryExplains
      @GaryExplains  4 года назад +1

      Yes, of course, but that isn't the point, is it. You don't use MPI with just one card in your desktop.

    • @ProDigit80
      @ProDigit80 4 года назад

      @@GaryExplains But something like a 3090 with +10k cores, surely crunches way more data, than even 100 of these little boards.

    • @GaryExplains
      @GaryExplains  4 года назад

      @@ProDigit80 Absolutely. But again that is the point.

  • @yukieagle115
    @yukieagle115 3 года назад +1

    Now i can play minecraft