Brute Force Processing

javidx9

Просмотров 156 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 26 окт 2024

Комментарии • 569

@wessmall7957 4 года назад ⁺³⁷⁸
Arm-chair fractal geometry expert here. Was expecting to be bored, but was actually thrilled to see how your optimizations would perform. However I'm dispointed you didn't go on a 45 minute tangent about the existential implications of the fractal nature of life, the universe, and everything.
@daedreaming6267 4 года назад ⁺³⁰
Well you see... What you forgot to take into account is that the fractal nature of things? Really boils down into one thing. And that thing? Is 42.
@SpaghettiToaster 4 года назад ⁺⁴
@@daedreaming6267 What he actually forgot to take into account is that there is no such things as the fractal nature of things, because things are not fractal in nature.
@Capris4EveR 4 года назад
@@SpaghettiToaster in my point of view everything in nature is paralel (mirrored to infinite). We can't find the start since its infinit but we will be able to predict the future and change the lane of our evolution. In my mind a higher being developed us for God knows what purpose and to be paralel with that, someone else created them too, when our turn comes we will create other if we didn't already (animals).
@Kenjuudo 3 года назад
@@emjizone I'm pretty sure that if you would iterate through every single color component of every single pixel in every single frame in every single video for every single video resolution as they are stored on the google servers you'd pretty much find the number 42 in all of them. And for the off-chance a video's video stream doesn't contain that number there are also the audio streams to sifter through... :)
@austinwoodall5423 2 года назад
Fractals are so named for their FRACTionAL dimensionality. for instance, the Mandelbrot set is neither 1D nor 2D but 1.72 dimensional. Fractals are characterized by their infinite perimeter but 0 area (or other applicable into)
@dartstone238 4 года назад ⁺⁴¹⁹
The SIMD conversion would be super interesting I thing. I do like the low level part of things.
@Agatio17x 4 года назад ⁺⁵
I would love to watch a SIMD video too!
@ElGuAcA11 4 года назад ⁺²
Me too, i'd love to see it!
@byrontheobald6663 4 года назад ⁺²
From my own testing with other problems (N-Body) you don't actually need to do much.
If you look here: gcc.gnu.org/projects/tree-ssa/vectorization.html
You'll see that GCC will actually automatically compile SIMD instructions if you compile past -O2, just make sure your code is in a vector/array or otherwise contiguous in memory. I'd imagine MSVC will do something very similar.
@mileswilliams527 4 года назад ⁺²
Yes PLEASE!!!
I'm at this point in my coding journey where I need to start getting the most performance I can and have started looking at parallelism so this would be very interesting
@ScottFCote 4 года назад ⁺²
I also would very much like to see this.
@dragan38765 4 года назад ⁺¹⁷⁴
More SIMD/AVX videos? Yes, please. I'd love to learn more about those, I only know the basics sadly. I pray to the compiler gods that they have mercy upon my soul and vectorize it automatically.
@douggale5962 4 года назад ⁺¹⁰
In most cases, they can't auto-vectorize. Usually because it can't be certain your pointers aren't aliasing, it thinks your stores may be changing the values of subsequent loads, so it is afraid to use SIMD, it changes the results. You can help a lot with "restrict" pointers (__restrict in g++). It can be absolutely sure there is no aliasing when you say they are restrict. Saying restrict in the signature's pointers means that the function cannot deal with the restrict pointers pointing to the same memory. Applying that restriction sets the compiler free to assume that its stores won't be feeding back into loads, allowing vectorization.
@Bravo-oo9vd 4 года назад ⁺¹
Instead of praying, you can profile and inspect compiler output. It's hard to speed up your code when you don't know where the bottlenecks are and you can help the compiler with autovectorising if you provide it with some hints. Writing code with intrinsics directly is not very maintainable, and better thing to do would be to look into parallel processing libraries that manage this complexity for you.
Also by looking at output assembly you're able to confirm that other optimisations are working such as instruction level paralellism
@GregoryTheGr8ster 4 года назад ⁺⁴
I hate to have to be the one to bring you the news, but there are no compiler gods. I SWEAR THAT I AM NOT MAKING THIS UP!
@daklabs 4 года назад ⁺²⁵¹
The AVX sounds interesting, I'd like to see more on it!
@gideonmaxmerling204 4 года назад ⁺¹
there is a youtuber named "what's a creel" that has some AVX in asm vids and also some VCL vids which is a vector library for c++
@glitched_code 4 года назад ⁺⁸⁵
very interesting, i'm rooting for the SIMD conversion video to come out :)
@javidx9 4 года назад ⁺¹⁹
thanks and noted!
@Omnituens 4 года назад ⁺⁴⁸
I'm in for a dedicated video on the SIMD conversion.
@DiamondShocked 4 года назад ⁺⁵⁰
This is exactly the type of computation that GPUs are designed for, and you would likely see a huge performance improvement there.
In addition, while this approach generalizes fractal computation somewhat well, there are many techniques to be explored to optimize specifically the rendering of the mandelbrot set. For example, there is no need for the computation to go to the max iteration count for pixels which are clearly in the middle region and will go forever if possible.
There are also ways to scale the values of the floating points based on the zoom so that the issue of precision is not a limitation. This would allow you to zoom indefinitely.
Next, it does not make sense that 32 threads should have improved performance over 8 threads on your machine, since your 4 core processor can only effectively execute instructions from 8 threads simultaneously (two instruction streams per CPU core).
I would also say avoid busy waiting for threads in the thread pool. Simply deschedule these threads if they do not have work to do, and make them runnable when they should run. It is perhaps possible that this is handled somehow by the OS/Compiler (if a loop is seen to jump to the previous instruction and the condition is not met, the processor may be able to dynamically see that no useful progress can be made and the thread will yield to another?), although unlikely. While it is a rule of thumb to avoid thread busy waiting at all costs on a single core machine, there are reasons why busy waiting may be an appropriate solution in a multi-core system: conditions can change while a thread is busy waiting due to the behavior of other threads, and the system call to make the thread runnable can be avoided. So the tradeoffs between these options should be weighed. Also your solution to thread synchronization (atomic counter) can be replaced more elegantly with a barrier or other synchronization primitive.
An interesting experiment for the approach you have taken would be to disable the optimization flags of the compiler, so that simple code optimizations such as loop unrolling or strength reduction can actually be measured.
@AusSkiller 4 года назад ⁺¹⁶
With the way he was doing things it absolutely makes sense that 32 threads would improve performance over 8 threads, what is probably tripping you up is that the number of threads isn't the reason it'll be faster (because more threads doesn't really help as you probably correctly concluded), but rather it is faster because he divides the area up based on the number of threads thus reducing the time it takes to process the most computationally expensive section, which with only 32 sections is likely still a major bottleneck. With just 8 threads and 8 sections whenever a thread completes its section then the hardware thread will sit idle until the others are all complete, but with 32 threads and 32 sections whenever a section is completed another software thread can utilise the hardware thread that was freed up, and the maximum time to complete a section is reduced to a quarter of what it would take with 8 sections allowing it to be up to nearly 4 times faster in the worst case scenario for 8 threads. However it is entirely dependent on the complexity of the sections and how well balanced they are, if all 8 sections are equally complex then there will be no performance benefit running 32 threads/sections, and 32 threads/sections may even be marginally slower in that case.
Basically the more sections there are the better it can load balance and keep the CPU fully utilised, at least up to the point where the overhead of more threads starts negating the better utilisation.
@luisendymion9080 4 года назад ⁺²
AusSkiller is right. You're speaking of a better but way more complex implementation. Javi's goal is to use a minimalist implementation that still makes a good use (although not optimal) of the cores. For his algorithm makes sense to use more threads because it reduces the overall time to join them.
@DiamondShocked 4 года назад ⁺¹
@@AusSkiller I see your point. The load balancing is the key issue here.
For each thread to have somewhat equal work, each pixel should have an estimated cost and the total cost of the pixels for each thread should be similar.
At least that's one way you could do it off the top of my head.
Also the pixels chosen for each thread should be done so with memory access locality in mind.
@Splatpope 4 года назад
yep, got 60fps all the way to float precision limit on my opengl implementation of the julia set
@Splatpope 4 года назад
can you tell me how to overcome float precision ?
@geotale95 4 года назад ⁺¹⁶⁴
I'm a simple man. I see brute forcing and the Mandelbrot set, and I click on the video.
@kepeb1 4 года назад ⁺⁹
I'm a simpleton, I see overused formulaic cliche comments, and I join in.
@geotale95 4 года назад ⁺¹
@@kepeb1 Yeah, sorry, I didn't see any other comments and wanted to comment quickly, and that's the best I could do :p
@luisgeniole369 4 года назад ⁺²
I think the Mandelbrot Set and fractals in general are so engaging because they're the closest thing we have of visualizing eternal cosmic inflation, and that pokes at a very powerful memory of our collective consciousness. Also the pretty colors.
@RogerBarraud 3 года назад
One does not simply brute-force into Mandelbrot...
@atrumluminarium 4 года назад ⁺¹⁸
Yes please I would be interested in seeing more about the intrinsics and performance advantages. I've been programming for years and had no idea that was even a thing you can do in C/C++
@sanyasanders9580 3 года назад ⁺¹
Wow! It's incridible! Thanks for this video
@antipainK 4 года назад ⁺²⁹
Could you make a video about dynamically resizing variables (ones, that when they run out of precision - they double their length)?
That would of course require defining all calculations and make program slower, but would enable us to zoom "indefinitely".
@ScramblerUSA 4 года назад ⁺³²
30:02 - should be "bit 255".
@cocorinow 4 года назад ⁺⁵
javid you angel! just as i'm starting my master's thesis having to write a program that will have to crunch through a lot of complex numbers you post this, thanks!
@jordanclarke7283 4 года назад ⁺⁵
I wish to register interest in the SIMD code. Also I wish to acknowledge that the part where you talked about running in release mode versus debug mode was for my benefit after the (stupid) question I asked in your last video regarding the speed of decal rendering. Keep up the good work! Awesome channel! 👍🏻
@taraxacum2633 4 года назад ⁺⁶³
maybe real time ray tracing??? 😄 I am trying to do it rn but I'd love your explanation and tricks!
@javidx9 4 года назад ⁺⁴²
Its a popular request for sure!
@GregoryTheGr8ster 4 года назад
This is a totally riveting video! Not only do you dive deep into the Mandelbrot fractal, to the point where a mighty 64-bit double is inadequate, but you also dive into unearthing the incredible power of the Intel CPUs. I completely fell in love with SSE and AVX when I finally decided to learn how to make use of them, but I have not yet ventured into threads and multi-cores. I know that I need to. It's on my bucket list. No, actually it is much more imminent than that. I will get into multi-core programming soon. This is a *must*.
@antipainK 4 года назад ⁺¹⁸
I would love to see how to do conversion to SIMD! :)
Btw, my first C++ project was Mandelbrot Set generator and I did most of the mistakes that you pinpointed here. Now I'm gonna rewrite it after final exams. :D
@nawdawg4300 3 года назад
I watched a numberphile video about the mandelbrot set (and the key fact that any value past a radius of 2 explodes to infinity) and I managed to write the code for a multithreaded implementation (not thread pooled) without any external resources. Seeing the mandelbrot pop up on the screen for the first time was an amazing feeling haha. Anyways now that I've stumbled across your video I can't wait to learn more about and implement the thread pool/vector co-processing (I just got my BS CS degree and am disappointed I didn't get to learn more about this stuff). This is definitely one of the better videos I've come across when it comes to mandelbrot implementation, so thanks for that!
@marcodebruin5370 2 года назад ⁺²
One way to avoid the "waiting for the slowest thread" problem and to keep every thread busy till the end (and thus shorten the total time needed). Have the workers calculate the iterations for 1 pixel, then ask for the next pixel not yet started, then feeding the pixels to those threads until you've run out of pixels to do. All threads will be kept busy until the last few pixels.
That should still give a significant improvement (especially on screen-areas that are unbalanced)
@unknownuser8044 4 года назад ⁺⁷⁷
And now, Implemented on the GPU? :)
@javidx9 4 года назад ⁺⁵⁴
Well that is indeed the next phase is suppose, though I reckon there is a lot to squeeze out of the CPU yet!
@Gelikafkal 4 года назад ⁺¹
@@javidx9 I think the communication between cpu and gpu over the PCI bus will eat up most of the performance that a GPU implementation would give you as long as you don't have a way to directly give a device pointer to the graphics library/engine.
@mariokart6309 3 года назад ⁺¹
@@Gelikafkal GPU compute is very much viable for anything that can take advantage of mass parallelisation, otherwise why would that functionality even exist?
@lucasgasparino6141 3 года назад ⁺¹
To anyone interested: the CUDA C book has a bare bones implementation of the Julia/Mandelbrot sets, written in C. It's very intuitive to use, and the difference is GIGANTIC if your GPU is reasonably modern.
@TenshiSeraph7 3 года назад
Yup, I reimplemented the Mandelbrot set using recursive CUDA, although it was not a live updating viewer (following a tutorial).
Indeed the transfer of data between host and device (GPU) is usually costly, but it should be possible to move fractal data directly to the frame buffer without transfering everything back to the CPU.
There is also an optimization for parallel implementations of the Mandelbrot set, which takes advantage of the fact, that all points in the set are connected. You can divide the set into different rectangles which are computed by different threads. First you only compute each pixel on the border of the rectangle. If none of the border pixels contain any part of the fractal, then you can skip the calculation of the pixels inside the rectangle (assuming you are not zoomed out too far).
@rhydermike 4 года назад
Highly enjoyable video there. I first came across the Mandlebot set on my BBC Micro as there was a viewer included on a Disc User Magazine cover disc. It wasn't *quite* as fast as your routine in execution.
You've made me want to break off from my regular programming project to render the Mandelbrot set, something I have never done. Just one for the nerdy bucket list. Great stuff, and thanks for making it.
@ghollisjr 3 года назад ⁺¹
I had a lot of fun making a Mandelbrot fractal Web app with WebGL by putting the fractal calculations in a fragment shader. Highly recommend it as a learning project.
@undersquire 4 года назад ⁺²⁸
luv ur vids very informational :) i've learned a lot
i would love to see a series on possibly making your own basic coding language in C++
@javidx9 4 года назад ⁺¹⁵
It's certainly an interesting idea! If I find a need for such a language then it may make a good series
@christianantfeld3827 4 года назад ⁺¹
I'm shocked by how smart you and the other commenters are. Absolutely incredible if anyone else can follow along.
@gilleswalther5964 4 года назад ⁺²
Great performance! Really impressive what you can get out of the CPU.
@DFPercush 4 года назад ⁺¹
+1 just for the profile pic. Very apropo.
@Kenjuudo 3 года назад ⁺¹
Awesome video. It seems I've struck gold by finding your channel. I can already tell I will never regret subscribing to you!
Don''t ever change your format! You hear???
@Zartymil 4 года назад ⁺¹
Such a fun video! Didn't even feel like it was almost an hour long. Thanks for the content! Subscribed :)
@javidx9 4 года назад ⁺¹
Cheers!
@battosaijenkins946 2 года назад
Amazing. I was using intrinsics to see if I could outperform the quake 3 square root dilemma a while back and brought back wonderful memories. I cant wait to try this mandelbrot concept using OpenGL! Thank you~ 👏👏👏
@ogr2 4 года назад ⁺²
Awesome video man. Actually I was watching another video about the "Mandelbrot Conjunction" and I found it quite interesting to analyze.
@javidx9 4 года назад
Cheers Oscar, they just never get boring - I really find Mandelbulbs quite interesting
@rachelmaxwell4936 4 года назад ⁺⁶
I enjoy and appreciate all of your content. I would really like a video on intrinsics.
@DestroManiak 3 года назад
this is fascinating, immensely helpful for scientific calculation coders
@davidbundgaard 4 года назад
Its such a joy to see you program this fractal generator and I would love to see more about avx, gpu calculations on this problem and threads, very nice walkthrough. Keep it up as always.
@oscill8ocelot 4 года назад ⁺⁴
Consider this my shout-out for more vector processing instruction videos :3
@Hacker-at-Large 4 года назад ⁺¹³
I’d also love to see this extended into using the GPU.
@Endede 3 года назад
In addition to the great walkthrough of optimizations, and the fantastic explanations of successes and failures of optimizations, I immensely enjoyed your near-giddiness when playing with your Mandelbrot Set Explorer. Clearly, you love coding, but I feel like in this video, it was extra-apparent and frankly infectious.
@portlyoldman 4 года назад ⁺¹
Blimey. Really enjoyed that 😀I’m not even a C++ programmer only comfortable with C# but I learned a lot from this. Also enjoyed the style and presentation. Thanks. Subbed 😁
@javidx9 4 года назад
Thank you Jim, much appreciated!
@morphx666 4 года назад ⁺¹
Absolutely! I'd love to see a video explaining the conversion to SIMD data types.
@zikomo8913 4 года назад
I just started learning programming and I was dreaming of making a Mandelbrot Fractal plotter and then improving it bit by bit(going lower and lower towards hardware) to optimize it.
and now you did a video on it, amazing !
@chrismingay6005 4 года назад ⁺¹
Great video thank you. I really liked the pacing (as always) and while not a c++ programmer in any capacity I learned quite a lot as well.
@Gordonfreems 4 года назад
The value of this video is incredible, thank you
@CrystalMusicProductions 4 года назад ⁺³
Wow this video and in general your channel are a great motivation to keep programming and exploring the field :)
@javidx9 4 года назад ⁺¹
Thank you Crystal Music!
@mattbland2380 4 года назад
Loved this. I remember entering Basic programs to plot the Mandelbrot set back on 8 bit computers and tweaking them myself without understanding them properly and getting wonderfully strange results.
@Collinoeight 4 года назад
Definitely interested in seeing that conversion. This is one of the best CS videos i've seen in quite a while.
@muckiSG 4 года назад
For everybody interested in the mandelbrot set and not deep in the subject i woukd suggest to search for "numberphile mandelbrot". Really nice videos ... have fun!
@ryanlyle9201 4 года назад ⁺²
"quite advance for this channel". My god, he's holding back his full power.
@ComposingGloves 4 года назад
If you had kept reading the wiki you would be my micro-processor professor who seems to think reading slides counts as teaching. Thanks for teaching! Your channel is amazing!
@jsflood 4 года назад ⁺²
Amazing as always.
This is great! Really interesting. I would love to see a video explaining the vector co-processing (immtrin/SIMD/AVX) further. Thank you. Great work :-)
@cameraforchristmas 4 года назад
What a terrific find! I'm working on a project based on the Mandelbrot set. I fell in love with it when It was an assignment in college many years ago. What is crazy is that I came across your channel not for the Mandelbrot part but when trying to find a good C++ development environment for the Raspberry Pi. I chose WXWidgets, Code:: Blocks and MinGW. So, I found the channel vid on WxWidhets setup. Now here I've run into you twice on the same project. How cool is that?
I've already compiled my Mandelbrot starting point in C#, my favorite environment.
My end point is a Raspberry PI and a 7" touchscreen in a permanent case. It should allow the user to zoom in (and out). I want it to behave as a purpose-built, single function appliance. No desktop, no multiple-windows, no choosing the app to run, just plug it in to the wall power.
For extra credit, I thought I'd see if I could output the display on the HDMI connector at the same time so someone can enjoy it on a monitor or the living room TV. This would be the Manderbrot-set-top-box. Haha.
So, thanks twice for the great content!
Dave
San Francisco
@SamClayMold 4 года назад ⁺¹
Hey javidx9, love this video and your channel. Commenting here to say I am definitely interested in that video about the 256 bit registers and parallelism
@patrickc1659 4 года назад
Really enjoyed this, thanks. I'm all for hearing about how the AVX-256 works.
@anakimluke 4 года назад ⁺²
THE INTRO SOUND is finally better!! thanks! hahah :)
@BuxleyHall 4 года назад ⁺²
Thanks for the video! Add my name to those interested in learning more about AVX and SIMD. On a different topic, but still related to the but related to rendering the Mandelbrot set, I’d love to see how to get past the resolution limit of 64-bit doubles so one could explore even deeper into the set. Thanks again!
@cumulonimbuscapillatus9312 4 года назад ⁺¹
The accidental video recording lag near the end though lmao. Great video as always!
@javidx9 4 года назад ⁺³
lol, hey its an interesting and very real side effect, so I thought it had value and kept it in :D
@duality4y 4 года назад
thanks for this! it explains a lot of questions i had regarding sse and simd
@photonymous 4 года назад
One of the first optimizations I like to try is to take advantage of "superscalarness" ie "pipelining". The goal here is to keep your ALU pipelines fed (no bubbles). Achieving it can be as simple as unrolling for loops (processing 4 pixels in one step of the for loop, for example). It can sometimes be shocking how much can be gained from this simple technique. To get the most out of this technique you must be mindful of "dependency chains" which prevent concurrent execution of code. There are cases where it can be faster to do some redundant calculations, if it increases concurrency.
@andersonklein3587 4 года назад
I absolutely loved this AVX implementation!
I might just try to implement that on a game engine I'm building for parallel processing inside thread pool.
Thank you so much for your amazing video on optimization!
PS: I'm actually ashamed that I hadn't used .wait on my code for the pool. I literally had my threads pooling constantly and wasting circles. lol
@matthewpalermo4957 4 года назад ⁺¹⁴
Loved the demonstration of how good the compiler optimization is! Question: Is there an abs_squared method for std complex values? That could be used instead of your implementation and still avoid the square root.
@johngreen1060 4 года назад ⁺⁴
That would be a dot product of the complex number against itself or multiplication of the number by a complex conjugate of itself.
@fredg8328 4 года назад ⁺¹
Compilers are not always good at optimizing. They are smart to do small optimizations, but generally you can do better with higher level ones. And there are a lot of differences between compilers. Visual Studio compiler is known to be one of the worst at optimizing. But that's the same story as why you should use STL or not. Visual Studio follows the standards more tightly and has to add a lot of tests in the generated code.
@wtk300 4 года назад ⁺²
std::norm(std::complex)
@kris030 4 года назад
please do more of this, optimization is my jam
@maybenexttime37 4 года назад ⁺¹
Great video! An improvement would be to divide the screen into many thin rows instead. Rows are better than columns for two reasons: 1- Better cache locality for threads when they're working on a particular job 2- You can increase the number of rows arbitrarily whereas number columns have to be tuned such that you're as close to a multiple of 8 pixels per column as possible (otherwise you give up potenial AVX benefits). This is important because in order to reduce bubbling you want to increase the number of jobs so that the workload is more balanced among threads. Of course to do that you need to decrease the size of each row/column. Maybe even down to one single pixel row per job, because at given width (1280) thread signaling overhead will likely be negligible even then.
@dhiahassen9414 4 года назад
Best fractal programming video on youtube
@bloodypommelstudios7144 4 года назад
Few ideas:
Every time you scroll you're recalculating the whole frame. If you saved the iteration data and scrolled 2 rows of pixels you'd only need to actually calculate the extra 2 rows which have become visible. Hell you could precalculate a border after the screenspace pixels have been calculated in anticipation of the user scrolling.
Tthere's a few ways certain pixels could be predicted. You get bands of given iteration values and never get lower values inside higher values. Therefor if you hit the highest iteration value you could find the outline of this shape and simply fill it in. You could then work your way out from that shape to the 2nd highest iteration value and repeat the process. This would probably be insanely complicated but in certain circumstances it could be a colossal speed boost.
Mirroring. If the screen space passes through zero on the Y axis you could simply mirror the pixels above the line for a nearly 2x speed boost.
Scale x2
. If you had a button to double the zoom then you've already calculated 25% of the pixels if it's lined up correctly. The same is true zooming out though it would need calculating slightly differently.
Between these I'd imagine it could hit an effortless 4K 60fps.
@wesleythomas6858 4 года назад
Awesome video. Great touching on the different techniques available, never even knew about SIMD! 👍
@Saxie81 4 года назад ⁺⁴
@32:54 I think it would be neat to look at the conversion process!
@javidx9 4 года назад ⁺³
yeah, me too! noted!
@stickfigure31 5 месяцев назад
@24:36 I'm sure in most cases using the standard libraries is the preferred way, but tracing this back down to the square root function cutting performance in 1/2 did gave me an appreciation for why Id software made their own faster (but less accurate) square root algorithm for the Quake engine.
@kyleeames8229 4 года назад
I too would be interested in the SIMD conversion video. I'm always fascinated by low level stuff and would happily watch the video.
@dioneto6855 4 года назад
funny because when you started using thread I thought what could happen if you go to a stressful region of the fractal to the processor while recording
and thanks you killed my curiosity
@DavidWeiss2 4 года назад
Please do more like this. I learned a lot!
@ratchet1freak 4 года назад
About the condition variable wait, you should wrap the unique_lock and the wait in a block to avoid holding that lock for too long (doesn't really matter in this case because each worker has its own mutex), and you should guard against spurious wakeups.
You could also try other pimitives for launching the threads like a semaphore or a barrier. The barrier is the superior method to do this kind of waiting as it can actually wake up multiple threads at the time (which you can't with condition variables even with broadcast due to thundering herd related issues) and the completion function can both signal completion to the main thread and then wait for the next task.
(edit) Also if you divided tasks by row instead of column you might get some gains because of cache prefetching and false sharing avoidance.
@stephenbradshaw6727 4 года назад
A splendid video! Thank you very much. It might be interesting to insert a generator function to be called in place of the Mandelbrot-specific code, with arguments that will create different kinds of fractals for exploration. I am certainly interested in seeing more specifics w.r.t. implementing the AVX extensions.
Amusingly, the Mandelbrot Set is based on iterating quadratic polynomials, similar to those which describe population growth and are being used to gain some insights into our current global predicament.
@blinded6502 4 года назад
Just imagine these fractals as continents in a solar-system-sized ocean.
Literally breath-taking.
@oblivionronin 4 года назад ⁺¹
Awesome long video love it ! Fractal are indeed hypnotizing ! Also Yes i definitly want to see that SIMD optimisation.
I was thinking, couple things we coudl do to optimise the display of this woudl be
1. cache the world position, zoom level and itteration count, if thoses havent changed we can just re-render that frame thats already baked in.
2. once thoses are calculated once, you can use that data (time for the thread to finish) to guess where and how you shoudl divide up your thread grid, maybe some area (like that big half screen with very little calculation shoudl be handled by only one thread and that one sliver of fractal shoudl be calculated by the other 31 threads. (IE : you hold the thread time to finish for each of them, and an average of them all. On each frame or itteration if the thread finish time is shorter then the average, it shoudl take some screen space from other threads (verically or horizontal) if its higher it shoud give some up. That way, threads kind of dynamically exchange their screen space depending on their own finish time and the average time it takes. )
Sorry for the long comment, just had to put that out there ! :P Cheers !
@antipainK 4 года назад
Yeah, dividing screen to threads by bitmasks should do the trick. Didn't think of it. Thanks, gonna implement it in my own Mandelbrot generator :)
@sabriath 4 года назад
You can divide the screen into a grid that is 4x the amount of threads that you plan on using in the pool (or more). As each thread is "start"ed, it can read the position of the list based on the mux variable that is decremented at the time of reading it....and when the thread is done, it can get another one and another, etc...only stopping if the mux is below zero and decrement the "thread done" mux. So if a particular set of grids are fast, they can help finish off other parts. For more advanced approach, you can have that same thread alter the grid coordinates at that list position when it is finished for the next iteration before advancing.....the controller code literally only has to reset the grid mux to the number of grids in the list and "thread done" mux to the number of threads, and then print the screen and repeat when the 'thread done' mux hits zero (the test for a mux to be zero is faster than comparing to a non zero value).
So in a 1280x760 with 32 worker threads, I would suggest a list of 608 grids of size 40x40 pixels each. This would give an average of each thread to work on 19 grids each round.
@postvideo97 4 года назад
Very interesting that brute force allows a two/three order of magnitude acceleration. Personal computers have gotten so fast it is incredible. The usual method at accelerating Mandelbrot set rendering was to use a series approximation. You can render a single pixel at full precision (the anchor) and use small residual numbers for all other pixels, thus shaving off many orders of magnitude in time.
@johndoe1909 4 года назад
Back in the days, about 93 I was working with massive parallelisation on multiple cups I remember there where big differences in how you iterated over the axles. This where related to the way the mmu swapped memory in and out (row or column oriented). Those where the days!
@Cr4zy4pple 4 года назад
Normal people pay top money for this kind of knowledge, not to mention the unfathomable amount of work put in scripting and compiling these vids. It's baffling how clean, simple yet detailed you can craft these tutorials. My IQ went up exponentially since I started watching you. Never stop making these vids, you'll go down in history as a legend. Also, please drop a link where I can donate single amounts for poor wankers like myself.
@alles_muss_anders_werden 3 года назад
I have done the Fractal calculation on a 64 bit Linux system that has 2 processor cores with 6 calculation cores for each processor. I included the libomp Library and it performed very well.
@benrex7775 4 года назад
This was really interesting. Thank you for making the video.
@sachinambetkar3637 4 года назад ⁺⁴
Thank you so much sir ... I always follow your tutorial.. it helped me a lot ... 🙂
@javidx9 4 года назад
Glad to hear that Sachin!
@lohphat Год назад
Also note the set is symmetrical across the y axis at 0. The negative complex components match their corresponding positive point. So another optimization is just to calculate for positive complex components then reflect them into negative when that portion of the set is visible.
Also, usually points which have exceeded max iterations are assumed inside the set are colored black -- have a go at that too, it makes for better plots.
@rednike500 4 года назад
It would be nice to see a video on parallelism and avx, but your video are always super interesting. Please continue to do as you like in your videos, you are an amazing programmer
@1...........123-t9b 4 года назад
Hi David, the vids are awesome, and I loved them! You are supposed to be earning more subs!
@TheMrNaatan 4 года назад
Oh yes we would like to see this video on SIMD/AVX ! :D
@rustycherkas8229 3 года назад
Processor envy!
I explored Mandelbrot on my '286 (without a coprocessor). Filling a single EGA screen, at higher zooms, could take hours! Those were the days...
Yours is running plenty fast for us consumers, but, seeking to 'optimise' my slow version, I recall using the log() of the zoom (somehow) to 'dynamically' adjust the iteration threshold... It was all long, long ago... Probably should have been "inverse square" functionality, but log() performed well enough... I'm no mathematician...
@lifeisgameplayit 4 года назад
Big thumb here. I find this tube content rather elegant with some funky twist in it. ;}
@tomay3000 4 года назад
Man, this is awesome, the math performance calculations, I didn't know about that header file before.
Yes I do want to see more videos about it and how you did the conversion.
TIA.
@axelBr1 2 года назад
Seeing your code optimisation, (at about the 20 min mark) not working was one of the best parts of the video. In the past I would have done the same things that you did, move variable declarations outside the loop (would have probably missed the calculation optimisation though), but doing that troubles me, as it follows bad programming practice; separates point of declaration from use and increases the variable's scope. I've been hoping that the compiler would do lots of magic to allow me to write clear code that is optimised.
@carsont1635 4 года назад
Thank you for the excellent video! I have a course next semester specifically regarding parallel processing and this was a great introduction into some of the techniques involved. And I would love to see the video mentioned about how to use a cpu's vector processor!
@Padeir0 4 года назад
that was a very enjoyable video! and i'm really hoping for a follow up video working around the precision issues :D
@s-sugoi835 4 года назад
Subbed since 10k and your videos still motivates me to not give up on C++ even tho I mainly program nowadays using Python 'cause of university.
C++ sure is beautiful 😊.
@fernandomartins5451 4 года назад
Awsome Video! I'd love to see more about SIMD and vectorization.
@seanvinsick 4 года назад ⁺¹
Very interested in simd conversion. Also would love to see more multithreading
@colinbrown7947 4 года назад
Very nice video! I would love to see you implement more mandelbrot specific optimization, perhaps even looking into perturbation or series approximation.
@marshallsweatherhiking1820 Год назад
I made one of these with Java about 10 years ago. I'd love to try and port it to c++ with wxwidgets and add these threading optimizations. There's a fairly major optimization you can use for drawing the interior. You can test for cycles so you don't have to go through all 1000 or however many iterations for every point in the interior. Because Java was quite slow I also had the rendering thread compute a low resolution image first, plot that to the screen, then fill in more and more detail with time. This made the explorations seem responsive while panning / zooming. Interesting it's now unnecessary as you're getting smooth panning even at full resolution. I also added conditional sub-pixel anti-aliasing ( activated based on the local contrast between adjacent pixels ) to make certain highly detailed areas appear less noisy. Without the limitations of Java I could explore unlimited precision methods to get deep zooms.
@gyokzoli 3 года назад
Excellent video, thank you!
@skilz8098 4 года назад
I found this to be an excellent video and very enjoyable. As for the possibility of you making a video strictly based on the intrinsics library that is based on the CPU's extensions, I would truly be interested in it. I've always used Intel CPUs and I'm decently familiar with the x86 architecture, but never got to work with that part of the CPU's instruction set.
Something else to consider...
One possible optimization could be to calculate each iteration of the Mandelbrot set as a recursive Function Template. The compiler can precalculate the values for you. Then you can cache them repetitively for each loop, each iteration.
Another technique would be similar to the concepts of how rendering pipelines work with a back buffer and a swap chain. You would write to a texture image that is the back buffer first, calculate the next iteration of values, and write it to the next texture to be rendered. Then have the swap chain present the image to the screen as a single texture image mapped to a quad that is the size of the screen using basic vertex and fragment(pixel) shaders. After that have a compute shader working in its own thread to calculate the next iteration based on the input of where the zoom took place within the image and the zoom amount and zoom level.
This will then create the appropriate texture for the next iteration and feed it into the swap chain's back buffer. This ought to make it possible to have nearly an infinite amount of zooming levels with minimal overhead, providing the precision of a 64-bit unsigned integer to represent which zoom level we are on. This zoom level variable would, of course, end up being incremented or decremented accordingly depending on the direction of the next zoom provided by the user.
Another thing to consider is that we do have double precision floating point variables to work with, but they shouldn't be needed. It should be possible to work with basic floating-point types, however, we have to know what level of scale we are at since this algorithm will end up generating a unique set of floating-point values for each iteration and zoom level. There may be a small amount of overhead as we would have to keep an extra floating-point value around to keep the state of the previous zoom level so that if the user does zoom out, we can easily reference the previous state to use that for the next set of calculations.
It's almost the same thing you've already done in this video, but it would be adding a few different set of features, template metaprogramming to precalculate values for you and the usage of a graphics pipeline involving not just the basic shaders, the swap chain, and the back buffer, but to utilize compute shaders as well as giving us the ability to perform both pre and post-process calculations of data for the next iteration to be rendered.
@RogueShadowTCN 4 года назад
This may be the most C++ video in a while.
@TimJSwan 4 года назад ⁺²⁶
For millenia, every kid:
"meh, math seems like boring black and whit..."
BAM
Mandelbrot entered the chat.
@GrayBlood1331 4 года назад ⁺¹
I come to this channel whenever I want to feel stupid.
@hblomqvist 4 года назад
SIMD optimization is always interesting as well as concurrence. Another thing I would like to try in the future when I can afford it, is to install a FPGA-card in my computer and load cores depending on what I wish to accelerate and use that core for calculations. That would be a fun project in many aspects.
@mendonisstudios 4 года назад
Great video. Thanks for the insights!

Следующие

Автовоспроизведение

Intrinsic Functions - Vector Processing Extensions