FASTER Ray Tracing with Multithreading // Ray Tracing series

Поделиться
HTML-код
  • Опубликовано: 19 окт 2024

Комментарии • 146

  • @TheCherno
    @TheCherno  Год назад +12

    Thank you all for watching! If you want to contribute to the optimization discussion, check out the GitHub issue here ► github.com/TheCherno/RayTracing/issues/6
    Also check out Brilliant to learn all the math you need for this series! Get started for free, and hurry-the first 200 people get 20% off an annual premium subscription ► brilliant.org/TheCherno/

    • @Theawesomeking4444
      @Theawesomeking4444 Год назад +1

      Can you please do a Morton Z order in C++ tutorial next? I feel that would be nice to learn considering graphics use it a lot.

    • @ivanivenskii6942
      @ivanivenskii6942 Год назад

      Здравствуйте, а вы знаете русский язык?

    • @ivanivenskii6942
      @ivanivenskii6942 Год назад +1

      @@dav1dsm1th героя Слава

    • @nathans_codes
      @nathans_codes Год назад

      can you take a look at the issues and PR's on the walnut repo?
      It has some serious problems right now

  • @blackbriarmead1966
    @blackbriarmead1966 Год назад +34

    This video seems made for me. I was on a huge time crunch so I had to implement a ray tracer with reflections, BVH, etc, in about 36 hours total. It took a lot of coffee but I got it done. It's reasonably performant but I rendered a similar scene using cycles in blender and it is simply so much faster. What takes blender seconds takes me minutes, even with mutlithreading, and I don't have "fancy" features such as texture mapping running yet

    • @blackbriarmead1966
      @blackbriarmead1966 Год назад

      the way I'm currently doing it is by using a library called CTPL, in which I push all of my future operations. I give each thread an nxn block, just like blender, and as the tasks complete ctpl deals with joining the threads and starting new threads and all of that. I have them all write to the same framebuffer which I display on the screen so you can keep track of the progress of the render

    • @blackbriarmead1966
      @blackbriarmead1966 Год назад +3

      update: minimized size of bounding boxes in BVH by using surface area heuristic, made it 50% faster

    • @Fragtex_CN
      @Fragtex_CN Год назад

      Hey bro. If it's possible may i have a link to your repository to learn sth from that or2

    • @Alkanen
      @Alkanen Год назад

      @@blackbriarmead1966 simply picking the two objects that create the bounding box with the smallest surface area to combine?
      Do you loop through all your objects to find the absolut smallest surface area, or do you do a more stochastic approach by sampling the objects and picking the smallest area from the objects in the sample to speed up BVH creation?

    • @blackbriarmead1966
      @blackbriarmead1966 Год назад +1

      @@Alkanen the way I do it currently is I sort the objects in terms of their centroids along the x, y, or z axis depending on the depth of the bounding box. I create two bounding boxes, one starting at the triangle with the smallest value, and the other starting at the triangle with the largest value, and I add triangles smaller to bigger and bigger to smaller respectively. I store the surface area of all of these potential bounding boxes, and I choose the pair of bounding boxes which minimize the surface area heuristic. The surface area heuristic is the surface area of the bounding box times the number of children it has. The lower this heuristic, the more optimized the BVH. so you would choose the candidates that minimize this, or choose not to split the parent if none of the candidates are better than the parent itself for some reason. I use axis aligned bounding boxes which allow for faster intersection calculations than some other methods

  • @FabricioSTH
    @FabricioSTH Год назад +14

    Maybe a Matt Parker fenomena erupts from the internets and we get a 40,832,277,770% improvement. Or maybe not, cause we are not starting with python.

    • @matthewparker9276
      @matthewparker9276 Год назад

      Probably not that much. It's not like the baseline was 1 month to render a frame.

  • @marcotroster8247
    @marcotroster8247 Год назад +9

    Superior performance can also be achieved with techniques other than multi-threading. In fact, threading can actually be slower when the synchronization effort outweighs the performance gains (see Amdahl's law).
    First, notice that CPUs already have parallelism built into their instruction feed pipeline. Fetching / decoding / executing / writing back results can be performed in parallel for successive instructions if they don't depend on each others results. Rearranging commands in assembly can have crazy gains (but with C/C++ we usually don't dig that deep).
    Second, there are dedicated SIMD instruction sets on modern CPUs that can perform the same operations for multiple inputs (256 / 512 bit wide registers) at once to increase data throughput (e.g. 8 or 16 float ops at once).
    Third, avoiding allocation can save lots of compute, too. Preprocessing data only once upfront is very nice. And having smaller stack frames to allocate / destroy is also important. Using some static, rewritable cache memory that's owned by one thread can really help performance such that there's smaller stack frames (at the downside of non-threadsafe code).
    And last, there are different CPU caching layers which have 1000x faster I/O delays. So fitting all the memory in a faster cache and constantly reusing it will skyrocket the performance. CPUs have great latency once the data is loaded into a register. Small and simple is fast.
    Maybe this inspires some devs here to write faster programs. Cheers, have fun at optimizing 🤓👨🏻‍💻🏎️

  • @jfgh900
    @jfgh900 Год назад +8

    I really appreciate this! I've always wondered how multithreading is implemented but always got stuck in the syntax. Are there any plans on showing how to set up rendering on a graphics card?

  • @ChrisM541
    @ChrisM541 Год назад +2

    Excellent challenge, cheers Cherno! Loving this series.
    There's a lot of optimisation possible here - 2x faster (around 60ms/16.6fps to 30ms/33.3fps) is some way below what we'd expect from fully independent worker units (check: are they? include worker timer and look for normal/abnormal timing distribution), all this assuming maximum threads isn't set to 2, of course ;)
    I'd also be checking the thread allocation process (hint: another, more 'direct' way?), and making sure the work is 100% optimally split up, and 100% optimally allocated to the maximum threads returned from hardware_concurrency() (though historically not 100% guaranteed to work (return 0), don't know if it's now fixed...been a while for me).

    • @zvxcvxcz
      @zvxcvxcz Год назад +1

      Might be from using so many more threads than there are cores. Probably should really restrict to the number of threads the hardware can actually use and have a proper thread queue.

  • @bishboria
    @bishboria Год назад +10

    In my own version of this, I initially tried grouping a chunk of rows per thread and got good improvements. But then I noticed that certain blocks would take longer to run if there were lots going on in the image, so you'd have 1 thread working alone when all others were finished. I ended up using a threadpool and allocating each thread in the pool to work on one pixel, once that pixel was calculated the thread would go back in the pool and pick up the next pixel to work on. This worked very well and keeps maxing out the cpu until there are fewer pixels left to calculate than than cores available to work on them.
    I'd love to change the code to work on GPU, and I did try for a while to get Metal to work but just couldn't work it out…

    • @bunpasi
      @bunpasi Год назад +1

      Good point. Have you tried interlacing the rows? So if you have 8 hyperthreads, you skip 7 rows. It's probably going to be better divided.

    • @bishboria
      @bishboria Год назад

      @@bunpasi I think you’d still have a similar problem as with chunking: one thread will be running the final row when all the others have finished and are now idle. The whole cpu won’t be maxed out.

    • @bunpasi
      @bunpasi Год назад

      @@bishboria We can simplify the problem by using an image with 3 regions. The 2 upper regions are primarily sky and take 1 ms to process individually, whereas the bottom section has a lot of objects taking 7 ms to process. With 1 thread, the image will take 9 ms. Now we look at 3 threads, so ideally it will take 9/3=3 ms.
      Scenario 1:
      We use chunks. Thread 1 and 2 will be done in 1 ms, but thread 3 will take 7ms. In total it will take 7 ms.
      Scenario 2:
      We skip rows. All threads will handle a third of each section. 1/3 + 1/3 + 7/3 = 3 ms. And yes, one thread might lag a few rows behind, but if we take a height of 1080px, this will be orders of magnitude less. Even if one thread is a 10 rows behind, this will only add 7 / (1080 / 3) = 0.02 ms

    • @bishboria
      @bishboria Год назад

      @@bunpasi yes I understood you originally. If you prefer to do it that way go ahead. For now, while I still need to work out how to convert to gpu based computation, I prefer the thread pool as I want as much of the cpu maxed out as I can for as long as possible.

    • @bunpasi
      @bunpasi Год назад

      @@bishboria Because in a gaming engine there are a lot more things you might want to do simultaneously, a thread pool (with event queue) might be the best solution indeed. Good luck!

  • @manuntn08
    @manuntn08 Год назад +2

    Thank you very much for your effort you put in this video. I've learnt a lot from your tips.
    Could you please make some videos about how to optimize in the case when the computation on one pixel related to pixels around (Example : convolution, Gaussian filtering...)
    Once again, thank you and have a nice year !

  • @theonetribble5867
    @theonetribble5867 Год назад +5

    Hey, thanks for the series. The first video kick started my learning process about path tracing. In my opinion the series was a little slow and I was eager to outpace it. So I wrote a Vulkan path tracer in rust and learned most things by doing them. now I'm writing my Bachelors Thesis about differentiable path tracing. Btw. Mitsuba3 is a great tool for learning about path tracing as well, especially if you don't want to deal with C++. Anyways Thanks for the inspiration.

    • @edu_rinaldi
      @edu_rinaldi Год назад

      Any suggested source for learning Vulkan raytracing extension ? (and maybe also Vulkan in general) Thanks in advance :)

    • @Pedro-jj7gp
      @Pedro-jj7gp Год назад

      I'm also interested in hearing about resources to learn Vulkan and path tracing. I might even try and learn Rust while I'm at it! :)

    • @theonetribble5867
      @theonetribble5867 Год назад

      @@edu_rinaldi Hi, I replied to @Pedro. I hope you get the notification.

    • @theonetribble5867
      @theonetribble5867 Год назад +2

      @@Pedro-jj7gp Hi, sorry for taking so long to reply. It seems that RUclips doesn't allow me to paste links but didn't warn me (If you can't find the resources contact me directly if that's possible on YT). There are some resources I used to learn vulkan though I still don't quite understand it (I used screen-13 a Vulkan abstraction layer in rust). First of all there is the vulkan tutorial which helped a lot.
      I can also recommend the Vulkan lecture series from "Computer Graphics at TU Wien".
      Specifically for ray tracing there are some blog entries from the Khronos group explaining the high level layout. For more detail there is a tutorial for NVIDIA which uses the KHR extension (Note there are, i think two extension for Vulkan ray tracing KHR and one from NVIDIA the KHR extension also works on AMD GPUs). If you want to learn more about path tracing in general there is also the Rendering Lecture from CG at TU Wien (thats where I learned about path tracing the most). In general If you want to know things about such topics I can recommend to look at lectures from universities many European universities put their lectures online but MIT also has some stuff under "Open Course Ware". I can also highly recommend the paper from Eric Veach if you want to have a more mathematical background but it's a very long paper and I mostly use it for reference.

    • @edu_rinaldi
      @edu_rinaldi Год назад

      @@theonetribble5867 Thank you so much! ❤️

  • @srisairayapudi6074
    @srisairayapudi6074 Год назад +1

    YO BELATED HAPPY BDAY MAN! Wish i came sooner, would have wished you on the day :( HAVE A GOOD ONE EVERYDAY

  • @Alkanen
    @Alkanen Год назад +4

    Messing around trying to optimise the code a bit, I noticed that your implementation of Random::InUnitSphere() is wrong. It's biased towards values in the directions of the corners of the unit box surrounding the sphere (because it draws a sample from the unit box and then normalizes that sample to fit on the surface of a sphere).

  • @Kazyek
    @Kazyek Год назад +3

    Isn't `std::execution::par` enforcing sequential execution, which is not required here? I believe simply switching to `std::execution::par_unseq` would be an instant speedup.
    But ultimately, thread creation have a overhead, and creating exactly as many threads as there is logical core and distributing the work would be faster.
    But then again, not all threads would have the same amount of work since some pixels take longer than others, so to fully saturate all threads for the whole frame it would be better to use a thread-stealing threadpool.
    However, maybe exactly N threads (N: amount of logical cores) might still be faster even if not fully well balanced if you have distinct tiles with thread-local data for them for better cache locality...

  • @Theawesomeking4444
    @Theawesomeking4444 9 месяцев назад +1

    4:10 thats actually wrong, gpus dont have thousands of cores, what they have is bigger simd widths usually 64-256, cpus also have simd widths of 8-16 so you can actually turn your cpu into a gpu if you are willing to vectorize or use intrinsics.

  • @nathans_codes
    @nathans_codes Год назад +1

    can you take a look at the issues and PR's on the walnut repo?
    It has some serious problems right now

  • @ezpzgamez
    @ezpzgamez Год назад

    I have been following along with this series while writing in Rust over C++ to see how things can compare. Until this series everything on the Rust side has been matching the C++ performance if not a somewhat better. (In comparison to the laptop, my desktop PC with an i9-9900k gets about 15ms where the laptop gets about 60ms for single-threaded).
    One thing Rust suffers from here is being able to mutate simple structures in an async context. A mutex or rwlock is required to be able to do what is asked of the multithreading unless allocating temporary buffers (one for both the image data and accumulation data). In an unsafe context it would be a lot easier but unfortunately Rust lacks a lot of things for async including some unsafe items. SyncUnsafeCell has yet to be stabilized.
    So from here on out I guess I'll stick with the single-threaded and see how the performance goes. Would rather do that than clone two large vectors on every iteration. Just my two cents from outside of C++ :)

  • @jeofthevirtuoussand
    @jeofthevirtuoussand Год назад +2

    I am not a programmer nor a developer but I am actually curious.
    Would it be possible to say to the hardware:
    " hey can you run raytracing in parallel on 3 cores but only use 60% of the cores and assigne the remaining 40% for ennemis AI calculations "

  • @peezieforestem5078
    @peezieforestem5078 Год назад +1

    Would you please do more episodes on various methods of multithreading? C++17 exclusive thing is nice, but I'd like to know the broadest applicable method, a method that works for C, the most optimal method, etc.

    • @Alkanen
      @Alkanen Год назад +3

      I suspect the most widely supported variant might be using pthreads. It's originally Unix (well, POSIX), but there are Windows compatible implementations available if you google for a couple of minutes, and then you'll have code that works on all POSIX compatible systems, which is pretty nice. And it's in C.
      Not to bad to work with either if I remember correctly, but it's been a few decades (jesus, I'm getting old) since I wrote my wrapper around it so I might be misremembering :)

    • @peezieforestem5078
      @peezieforestem5078 Год назад +1

      @@Alkanen Thank you, mate!

  • @1ups_15
    @1ups_15 5 месяцев назад +1

    hello, thank you for you video, it looks very useful, however I have a problem; I have noticed that my raytracer doesn't gain any performance from applying your changes, it even gets slightly worse, and when I look at my processor usage using htop, only one of my cores is being used. I am using linux and compiling using g++ through cmake, is there some flags I could use to actually make it multithreaded?

  • @jumponblocker
    @jumponblocker Год назад

    I actually had an assignment where we made a raytracer recently. Kind of funny that I also used std::for_each which I had not heard of before. The only difference was that I just looped over 1 vector containing each pixel index rather than an inner and outer loop.

  • @thebasicmaterialsproject1892
    @thebasicmaterialsproject1892 Год назад

    go on the cherno still killing it

  • @lithium
    @lithium Год назад +1

    std::iota is the "fancy function" you're avoiding to generate sequences, fyi ;)

  • @alessandrocaviola1575
    @alessandrocaviola1575 Год назад +2

    On my raytracer i got almost perfect scaling in performance: 4x the Speed the Moment i multithreaded It on a 4 cores CPU, so there Is definitely room for improvements there

    • @Theodorlei1
      @Theodorlei1 Год назад

      Yeah he got a 2.5x speedup on an 8core machine on a parallel problem - at least 8x should be possible for him

  • @sshawarma
    @sshawarma Год назад +1

    Awesome video as always!
    Why was the program not running 8x faster? Only thing I can think of is an IO bottleneck.

    • @psychoinferno4227
      @psychoinferno4227 Год назад +1

      Run a profiler and you'll find a different answer. If you want to spoil the fun see the responses in the Github discussion.

  • @Kaldrax
    @Kaldrax Год назад +8

    Interesting, I didn’t know about this one. I attended a lecture called high performance computing last semester in which we did similar things, starting with OpenMPI, then threads and in the end OpenMP. I absolutely cannot recommend OpenMPI since it’s a total nightmare. OpenMP on the other hand would simplify this code. You don’t need the iterators and I believe you can just write #pragma omp parallel for collapse(2) above the nested loops and it will achieve the same performance. 🙂

    • @unknownunknown6531
      @unknownunknown6531 Год назад +3

      OpenMPI does not address the same problem, it is used to distribute a task on multiple computers (a cluster) rather than only one, hence the additional complexity :). OpenMP is the tool to use in this case indeed !

    • @psychoinferno4227
      @psychoinferno4227 Год назад

      As an exercise, you should run a profiler and understand why it's only 2x faster on an 8 core machine.

    • @peezieforestem5078
      @peezieforestem5078 Год назад

      I did some testing with OpenMP and my code started working slower... not sure why this happens, I made sure to parallelize the independent loops.

    • @zvxcvxcz
      @zvxcvxcz Год назад +1

      Yup, iterators are gross, OpenMP is way nicer (suck it C++ committee).

    • @zvxcvxcz
      @zvxcvxcz Год назад

      @@peezieforestem5078 Slower than what was done in the video or slower than the code was before? You shouldn't really do even what he did in the video. In either case, creating way more threads than you actually have the hardware for can cause a lot of contention and cache misses and actually slow things down sometimes. He has 8x the hardware threads and was only getting like 2x the performance... not exactly ideal. What you should really do is create just 8-16 threads when you have 8 physical cores and have a thread queue so they pick up a new task each time they finish a pixel until there are no pixels left.

  • @eduardoassis2826
    @eduardoassis2826 Год назад

    hey, how you do to draw during explications over your current window? I'm curious for a long time now and can't help to ask :).

  • @ovi1326
    @ovi1326 Год назад +1

    allocating a vector of numbers going from 0 to width and height made me very sad altough I get that this is for the sake of simplicity
    for anyone interested though, here are some tips
    a more proper way to go about this would be to either implement a custom range iterator (look up legacy iterator on cppreference) or use std::ranges::iota_view which is roughly equivalent to python's `range()` or rust's `x..y` thingy
    you can also just avoid using parallel for_each, and instead split work for multiple threads by giving them responsibility over equally divided ranges of scanlines. this is pretty straightforward to implement and should yield good enough perfomance

    • @zvxcvxcz
      @zvxcvxcz Год назад

      Not just "good enough," but better because there will likely be less cache contention and less thread creation overhead.

    • @ovi1326
      @ovi1326 Год назад

      @@zvxcvxcz I meant that there are better methods than simply splitting work by rows, ie. someone in the comments mentioned using a thread pool to saturate the cpu which sounds kinda cool

  • @CreativeOven
    @CreativeOven Год назад +1

    Dude make us a chapter someday showing you programming in Cpp to get at your level .. ( idea ) , because some of us we are super in stone age in cpp

  • @anime_erotika585
    @anime_erotika585 11 месяцев назад

    7:07 I want multithreading, at my table, until tomorrow!

  • @gustavbw
    @gustavbw Год назад

    Wouldn't allocating the threads on every std::for_each() be highly inefficient compared to pre-allocating the pool when the program starts?

    • @dmitrysapelnikov
      @dmitrysapelnikov Год назад

      In fact the c++ runtime uses an internal thread pool for parallel for_each(). But AFAIK there is no way for the user to explicitly control this pool.

  • @thomasavino3450
    @thomasavino3450 Год назад

    What theme/color scheme are you using? (the default visual assist is not like this)

  • @ivansanz4029
    @ivansanz4029 Год назад

    If instead of having each thread do a row you make them do a column, the performance is even better as the "sky" is very cheap to process and the real complex part (the "ground") is distributed better across threads.

    • @ZeroUm_
      @ZeroUm_ Год назад

      It probably won't do much, if 20% of a scene is sky, with 1080 lines you still have 216 lines to go divided by a much smaller number of threads. With 8 threads, that's still 27 passes, enough to saturate them equally.

    • @ivansanz4029
      @ivansanz4029 Год назад

      @@ZeroUm_ Yeah I was forward-thinking to when he will use the GPU cores :D

  • @CP-sr6ml
    @CP-sr6ml 11 месяцев назад

    Don't get me wrong your content is great but... Why are we bothering with multithreading if we could just move to the gpu? I don't undedrstand why you keep building and even optimizing like this on cpu side now. Wont that just make it harder/more work to move to the gpu?

  • @ng.h9315
    @ng.h9315 Год назад

    Wonderful courses👌, but please continue the "Create Game engine in cpp" course add 3d game development option build for Android , ios ,,,
    Please teach us how to create a game engine like unreal Engine 😀.
    Im waiting for your answer......
    Thanks for all of things Cherno ♥️

  • @larryfulkerson4505
    @larryfulkerson4505 24 дня назад

    I like to write code by the principle of least astonishment.

  • @helmuthpetelin4613
    @helmuthpetelin4613 Год назад

    hey do you ve planed to show how to push the raytracing to the gpu?

  • @MorebitsUK
    @MorebitsUK Год назад +1

    Nice!! Always good content Cherno. Any Idea on how to use IntStream in Java to parallelize stuff.
    FYI I'm using `map`; not `for_each`.

    • @wuangg
      @wuangg Год назад

      Use IntStream.parallel() to return a parallel IntStream and after that, use forEach() to perform an action to each element in the stream in parallel, it will use all available processors to do the job.
      For example:
      IntStream stream = IntStream.range(1, 10); // create a sequential ordered IntStream from the range of 1 to 10
      stream.parallel().forEach(i -> {
      // do stuff to element 'i' here
      }); // perform an action to each element in the stream in multi-threaded
      This is equivalent to C++ std::for_each with parallel execution policy, which is being shown in this video.

    • @MorebitsUK
      @MorebitsUK Год назад

      @@wuangg Thanks for the reply, but I'm using Map not For_Each.
      I just need to return something from the map.
      String[] results = IntStream.range(0,imageHeight-1).parallel().map(i -> { // y value
      String row = String.join(System.lineSeparator(),
      IntStream.range(0, imageWidth).map(j -> { // x value
      Vec3 pixelColour = new Vec3(0, 0, 0);
      float u = (i + Utils.randomFloat(0.0f, 1.0f)) / (float) (imageWidth - 1);
      float v = (j + Utils.randomFloat(0.0f, 1.0f)) / (float) (imageHeight - 1);
      final Ray rayP = camera.getRay(u, v);
      pixelColour.addEquals(rayColor(rayP, finalWorld, maxDepth));
      String pixel = PPM.vectorToRGB(pixelColour, 1);
      }));

  • @stinkybeam
    @stinkybeam Год назад

    I know nothing of programing and coding, watch this video remind me of high school math class. I think I understand but actually I don't

  • @vasile2321
    @vasile2321 Год назад

    What RTX do you have on your pc? Thx

  • @CreativeOven
    @CreativeOven Год назад

    Comment 10 10 out of 10 : D, How is hazel ? I see it is not all about drawing that open GL 3d lines right for those vertices? : P

  • @ChaoticFlounder
    @ChaoticFlounder Год назад

    how difficult would it be to implement the RayTracing calculations on the integrated graphics on your cpu?

    • @zvxcvxcz
      @zvxcvxcz Год назад

      "It depends," is the unfortunate answer there. It depends just what types you're using, what the driver for that GPU exposes and if it supports the necessary extensions, etc... Maybe you can drop it on there with CUDA or OpenCL or maybe you can even wrangle the regular display part of the driver into giving you what you need with OpenGL or DirectX, etc... Often laptop manufacturers have not been great about switching these GPUs (sometimes if you're primarily on the discrete care, the integrated one can be almost totally deactivated, or vice versa). Sometimes that is seen as a plus, since it dealt with battery concerns.

  • @luigidabro
    @luigidabro Год назад +1

    Where are the triangles?

  • @kelvinpoetra
    @kelvinpoetra Год назад

    hallo cherno, I want to ask how to make graphic software and software such as Microsoft Word. Is the basis for making software all the same stages.

  • @andrewporter1868
    @andrewporter1868 Год назад

    Multi-threading is also a mistake. It's a failure to defer parallel computing to the programmer. Instead of providing an asynchronous master-slave universal scheduler system and then on top of that the ability to do cheap software scheduling by providing a simple custom scheduler that can use the exact same code (it's asynchronous, so you just insert the scheduler code at some point in the future on one of your existing execution pathways), we got this pile of garbage that requires us to add all this overhead by synchronizing everything and it's just this massive headache where you can't just write parallel code but you have to think about synchronization too, and if you think too hard, you get a synchronization bug that you spend the afternoon fixing instead of fixing your actual code that's supposed to be part of the design that you're implementing, not a standard library feature that's missing from every language and imposed on us by all major operating systems.

  • @rckeet
    @rckeet Год назад +1

    oh yesssssss!!😎

  • @hymen0callis
    @hymen0callis 10 месяцев назад

    Unfortunately, std::for_each() is not very "efficient". Apparently, you got a speedup of only about 2, while I (using the exact same parallelization scheme) got a speedup of 5.5 (I only have 8 logical cores) by using PPL's Concurrency::parallel_for() instead. It's not portable code, but if it is almost 3 times faster, I'll go with Microsoft's PPL.
    Edit: just watched the next video where you fixed your global RNG. In my code, the RNG was already thread_local, which explains the much higher speedup in my example. So, I guess std::for_each() isn't that slow after all.

  • @JATmatic
    @JATmatic Год назад

    I made it much faster than the MT version here by fixing the wonky Walnut::Random code
    and removing branches from Renderer::TraceRay() loop.
    Render runs in about ~11ms on Ryzen 2700 8-core.

  • @gabrieldesimone4644
    @gabrieldesimone4644 Год назад

    Hey there, I'm not familiar with C# or game making stuff but I was wondering that code is running on CPU cores, how do you make it use GPU cores instead?

    • @Alkanen
      @Alkanen Год назад +1

      That's coming in a future episode

    • @zvxcvxcz
      @zvxcvxcz Год назад

      3 main options 1) wrangle your GPU into doing so by sort of telling it that it is doing normal math for output using OpenGL/DirectX/etc... 2) use OpenCL 3) use CUDA.

  • @stephenkamenar
    @stephenkamenar Год назад

    GPUs don't have 8,000 cores. they have very wide instructions. like SIMD but on massive data at the same time.
    same difference tho

  • @IshanChaudharii
    @IshanChaudharii Год назад

    Oh my goodness finally!!!! ❤️🥲🎉

  • @AnalogFoundry
    @AnalogFoundry Год назад +2

    I wish the team at Striking Distance Studios would take notes and improve ray-tracing performance in their game called The Callisto Protocol. At the moment their CPU utilization with RT is abysmal.

    • @zvxcvxcz
      @zvxcvxcz Год назад

      Are they not doing their raytracing on the GPU though? Recent GPUs have hardware accelerated raytracing. I'm not at all familiar with the game or what they've done other than that it is supposed to be like a AAA title? I would expect any AAA to be using the GPU features on this (whether or not they should be).

    • @AnalogFoundry
      @AnalogFoundry Год назад

      @@zvxcvxcz - they are doing RT on the GPU using dedicated RT cores of AMD and NVIDIA, but building BVH and stuff is handled on the CPU. Thus RT can be very taxing even on the CPU. The problem with Callisto Protocol is that it uses very little of the CPU (i.e. not well multithreaded) even with the latest greatest multi-core CPUs which causes huge fps issues.

  • @Alkanen
    @Alkanen Год назад

    Wohoo!

  • @ricbattaglia6976
    @ricbattaglia6976 Год назад

    Is not faster a gpu render? Thanks

  • @TheApsiiik
    @TheApsiiik Год назад

    It's been 2 months.. where is next episode!!!1

  • @steellung
    @steellung Год назад

    Does anyone know which software he uses for drawing on the screen on the fly?

    • @rastaarmando7058
      @rastaarmando7058 Год назад +1

      It looks very similar to gInk.

    • @steellung
      @steellung Год назад

      @@rastaarmando7058 cool, didn't know this one. Thanks

    • @erikrl2
      @erikrl2 Год назад +1

      He uses ZoomIt

  • @fanisdeli
    @fanisdeli Год назад +1

    Complete assumption because I'm too lazy to search it: I would think that std::for_each would be smarter than creating a thread for every single item in your iterator.
    Creating the threads would be much slower than actually running on one thread. My assumption is that it creates a few threads, depending on your hardware, and it reuses them. When one iteration is done, the same thread is used for a future iteration. That would also explain why using nested std::for_each made no difference in performance.

    • @zvxcvxcz
      @zvxcvxcz Год назад

      You think it's only getting 2x rather than at least ballpark 8x if it is being smart? I think the nested for makes no difference because the single loop is already that bad for resource contention (several thousand threads on 8 hardware cores... ) and thread creation that it doesn't get any worse than that.

    • @fanisdeli
      @fanisdeli Год назад

      @@zvxcvxcz I don't think that it could possibly be creating millions (1920*1080) threads, 60+ times a second
      Also, in programming there's no such thing as "it can't get worse" lol. If it was a thread per iteration, then without nesting you'd have 1920 threads, with nesting you'd have over 2 million. So, yeah, that would be WAY worse for sure. Like "freeze the entire OS and blue screen" type of stuff

  • @mackerel987
    @mackerel987 Год назад

    Hey guys. Does anyone get the "no instance of overloaded function:"std::for_each" matches the arguments list " error? Afaik we only need to include the execution header for it to work. Am I missing something?

    • @simonmaracine4721
      @simonmaracine4721 Год назад +3

      Make sure you compile with C++17 flag or newer, and your compiler supports C++17.

    • @mackerel987
      @mackerel987 Год назад

      @@simonmaracine4721 exactly what was wrong. thank you.

  • @MrMirbat
    @MrMirbat Год назад

    Thanks for sharing knowledge. Can you do tutorial how to make casino games like slot machines - Book of Ra, Texas holdem poker or roulette? Thanks in advance.

  • @Notsorandomnumbers
    @Notsorandomnumbers Год назад

    anyone know of a channel similar to this but like 1 degree more amateur? I find myself having difficulty keeping up at points

  • @closingtheloop2593
    @closingtheloop2593 Год назад

    Why arent you doing this in cuda? Or in an opengl fragment shader?

  • @mr.mirror1213
    @mr.mirror1213 Год назад

    lesss gooo

  • @Jkauppa
    @Jkauppa Год назад

    multicore avx-512 on cpu

    • @Jkauppa
      @Jkauppa Год назад

      screen space dynamic baking surface light map caching

    • @Jkauppa
      @Jkauppa Год назад

      update the surface dynamic baked light map only when needed, new or when update is needed, like every 4th frame, at some fps, like 240fps, update light only at 60fps

    • @Jkauppa
      @Jkauppa Год назад

      pseudo-coding is a must, so that you are not tied to a language

    • @Jkauppa
      @Jkauppa Год назад

      focus on programming pipeline or in the pseudo-algorithm methods

    • @Jkauppa
      @Jkauppa Год назад

      language specifics are so 80's :)

  • @nenomius1148
    @nenomius1148 Год назад

    8:50 running around updating two vectors on each window resize is much simpler than that stinky over-engineered std::views::iota from "modern" C++

    • @ovi1326
      @ovi1326 Год назад

      yeah but like think of the cache friendliness of accessing a buffer of memory just to get the next consecutive number

    • @nenomius1148
      @nenomius1148 Год назад

      @@ovi1326 Yeah, reading consecutive numbers from memory is much more cache-friendly than generating them on CPU in registers

  • @zvxcvxcz
    @zvxcvxcz Год назад

    Iterators are gross... I would rather use OpenMP.

  • @anlcangulkaya6244
    @anlcangulkaya6244 Год назад +9

    #pragma omp parallel for

    • @psychoinferno4227
      @psychoinferno4227 Год назад

      The performance was nearly identical to the for_each with a parallel execution policy.

    • @peezieforestem5078
      @peezieforestem5078 Год назад

      Hey, I tried OpenMP once and my code got slower. I'm not sure why, do you have any ideas?

    • @zvxcvxcz
      @zvxcvxcz Год назад

      @@psychoinferno4227 Yes, but with OpenMP you don't need those silly ranges, that's the advantage there. I would expect the performance to be about the same as to what was done in the video if done the same way like that. Creating thousands of threads on a machine with 8 physical cores is begging for 1) overhead due to thread creation and 2) resource contention as all those threads want to get their task executed, so expect an increase in cache misses. The proper way to do it is to create a properly sized thread pool (somewhere between 8 and 16 most likely if you have 8 hardware cores) and have a task queue where each pixel's processing is a task. Have each thread pick up a new task when finished until there are no tasks left. I would expect something like a 5x-7.8x ish improvement rather than 2x. I might be wrong, but that's my naive expectation without knowing too many details about the raytracing algorithm itself. Offhand I don't think we're being memory bottlenecked in this case in terms of throughput, just perhaps by cache misses as the threads swap.

    • @zvxcvxcz
      @zvxcvxcz Год назад

      I use a sort of implied threading in Bash too with the same model. Can't have an ancient Bash though because they didn't add the feature to wait for any task to finish until like 4.something. So now you can start 8 commands while hundreds more wait and each time one finishes the next starts, it's pretty sweet. Prior to that bash you could only wait for all tasks to finish or you had to know the exact task you were waiting for (and of course you can't know ahead of time what order they will finish in (in most cases).

  • @irfanjames6551
    @irfanjames6551 Год назад

    Thanks a lot
    I was really waiting for the optimisations especially
    M
    u
    l
    t
    i
    -
    t
    h
    r
    e
    a
    d
    i
    n
    g.