Optimizing Pseudo 3D Rendering // Code Review

The Cherno

Просмотров 32 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 27 ноя 2024

Комментарии • 134

@TheCherno 28 дней назад ⁺¹⁹
Thanks for watching! Did you follow along with the exercise and try and find issues yourself? What did you find? 👇
Also don't forget you can try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/TheCherno . You’ll also get 20% off an annual premium subscription.
@vlauderlauders741 28 дней назад ⁺⁴⁷⁵
I sign the petition to see this game running on GPU
@marco_martin 28 дней назад ⁺¹
Me too
@joey244 28 дней назад ⁺¹
me to
@ArcShahi 28 дней назад ⁺¹
yes, That would be interesting...
@балаж98 28 дней назад ⁺¹
+1
@michaelp_c 28 дней назад ⁺¹
me five!
@marco_martin 28 дней назад ⁺¹⁵⁸
I would ABSOLUTELY like it if you could run that code on the GPU
@DesyncX 28 дней назад ⁺⁹⁴
Yes please, make this run on the gpu; maybe even increase the resolution to full screen and compare the results.
@xxdeadmonkxx 28 дней назад ⁺⁴³
Recommended settings for that game:
CPU: Intel Core i3 8100
GPU: Yes
@ABaumstumpf 28 дней назад ⁺⁷²
Some of your assumptions are completely wrong - like with the sin/cos values:
Recalculating those values would be significantly slower but the compiler can see that they are identical and will not redo the work all the time. On the other hand explicitly storing those intermediate values has no chance of being a cold memory read as they would only be used right after getting calculated.
With the memory access in different parts of the array: Nah, that really isn't a problem. Going backwards for the sky is likely far worse - but without an actual isolated benchmark there is no way of saying what is going on.
And what really is slow is setting every single pixel with the SDL_RenderDrawPoint - this is extremely slow. The Function is doing renderer-setups, checks, allocations and a lot more for every single pixel. Use your own pixel-buffer and then send in the whole thing at once will be much much faster.
@xeridea 28 дней назад ⁺¹
If they were locally cached the values would not be cold. I was thinking it would be cached once before rendering then fetching each frame. They may end up being prefetched though, so would still be in cache.
Can CPUs detect reverse loop offsets and prefetch?
I was thinking the same thing about drawing pixels. There is going to be a massive overhead individually drawing pixels. I am surprised he didn't mention that. Perhaps I will compare.
@ABaumstumpf 28 дней назад ⁺²
@@xeridea "Can CPUs detect reverse loop offsets and prefetch?"
Can? pretty sure - yes. But as there are other things going on it is still better to avoid that.
I have seen instances where the branch-predictor managed to get better than chance performance on data that was basically random, and memory-prefetch for lists.
@streamdx 28 дней назад ⁺³
Yep! Cherno missed the elephant in the room this time! )
@jlewwis1995 28 дней назад ⁺²
If the values are stored on the stack they would basically never be in cold memory right? Because the CPU is accessing the stack all the time when you call functions, push function arguments to the stack, write to a stack allocated buffer, etc. So the area of memory that contains the stack would be in the CPU cache most of the time wouldn't it since it's being used constantly?
@bobjones304 28 дней назад ⁺²
Doesn't the complier just inline them?
@stendaneel4879 28 дней назад ⁺⁴⁵
9:34 Make the raytracing series run in a shader, it would be really cool to see how you would implement it. Maybe another cool video idea would be compute shaders with vulkan, or a vulkan series in general, kind of like the opengl series.
@mertcanzafer5446 28 дней назад ⁺³
+1
@emomaxd2462 28 дней назад ⁺⁴
or maybe running ray tracing with CUDA that gives a lot more low-level control
@pyropoops139 24 дня назад
@@emomaxd2462opencl would probably be a better bet no? but i think an opengl shader would be the best bet as it is the closest to practical application in game engines
@Jellow2202 28 дней назад ⁺²⁴
lerp is available in the standard library since C++20 as std::lerp in
@hassaniq0777 26 дней назад
whatfffff 💀😭😭😭
@oskardeeream1846 28 дней назад ⁺²¹
You should do a collaboration video with one lone coder :D that would be awesome.
@RequiDev 28 дней назад ⁺³
15:30 While you're right in most cases, that caching does come with the cost of memory and reading from the memory, in this specific case its just a constant that not only never changes, it can be computed at compile-time and will very very likely just live directly inside the instruction as an immediate operand. Compilers are very smart.
@OlxinosEtenn 28 дней назад ⁺³
17:40 - 20:20
It's not *that* bad since it's almost sequential.
Also, 19:25 suggests that reading an array in reverse is always bad, which is wrong (it might not be what was meant, but it's very easy to interpret it like that).
I made a small program to illustrate that but youtube dislikes comments with links and ate it (so I'm reposting my comment without that, I hope I'm not being a bother). The takeaway was that:
- going through an array sequentially or in reverse (sequentially but backwards) doesn't noticeably change performance
- reversing the order of rows (like in the video) or columns causes a small performance hit (about +5% time spent on my machine with x=y=10000 and a loop body consisting of a single addition), possibly not noticeable if the loop body does as much work as the one shown in the video
- iterating over x in the outer loop and y in the inner loop however causes a massive performance hit (about +900% time spent, same context as above), that's the main thing to avoid if possible
- random accesses is even worse (about +1500% time spent, same context as above)
@oliverdowning1543 28 дней назад ⁺²
I did actually do, for a project a few months ago, this exact code as a GLSL fragment shader. It's quite fun as a project.
@mohamedyusuf4777 27 дней назад ⁺¹
I love these code review series. Keep up the good work.
@Steven-tw7iz 28 дней назад ⁺⁵
I would love to see you convert this loop to use SSE/AVX intrinsics to really start to use the power of modern CPUs, not enough people really know or understand about that stuff
@Waffle4569 28 дней назад ⁺⁶
6:12 When std::chrono is such a cumbersome namespace that you need to make a wrapper around it.
@mjthebest7294 28 дней назад ⁺²
just like pretty much the entire standard library
@mr.anderson5077 28 дней назад ⁺⁴
Please teach multi threading on such scenarios and offloading stuff on the multiple cpu cores
@juanmacias5922 28 дней назад
This was really cool, I hope you continue the code review series. :D
@matsv201 28 дней назад ⁺²
I would say for caching you would want a mix of function.
The issue is that a typical modern CPU do about 4 instruction per cycle, but every instruction takes anywhere between about 4 and 15 cycles to do.
If you feed the result from one instruction that take a lot of cycles to do into a other one, it have to wait for it to catch up.
The shedular often do a good work of this for short issues, but if you do a loop, that may not be possible.
So inside the loop you would want a good mix of cach calls, floats and other instruction mix. The more mix you have, the faster it will execute.
Of cause in this case, if you want it to do it quickly, you would really want to use the SIMD function
Its also worth saying that the L1 cache is typically fairly small, but its instant. Like typically 32kB of L1 cach. If you do something like a 256 bit simd you really would want to do no more than 100 of them in cache at any one time, preferably quite a bit less.
I i would speculate that a resonable aproch would be to set up a calculation for a block of simd and run it for 20-30 sets at the time, then rework them, and during the rework set up the next block, allowing it to draw form memory while its calculating the old work
@TheMaginor 28 дней назад ⁺⁴
You would probably speed up a lot using SSE on the inner x loop too (can be combined with threading). The compiler may not be able to do that on its own since it can't know if the rows are aligned or have lengths that are multiples of 4. The texture lookups could not be vectorized, but the math could. Could probably even vectorize the rgba unpacking (it is just bit shifting).
@someidiot6359 28 дней назад ⁺³
Do you think you could make a video explaining the cache and how to optimize for it?
@user-sl6gn1ss8p 28 дней назад ⁺⁴
Makes entire screen white
"Well, yes, looks much better"
@mspeir 28 дней назад ⁺¹
Heck yeah! I'd love to see that rewritten on a GPU. In fact, I was thinking of how to rewrite it as a shader as you were going through it.
@SuperCamelFunTime 28 дней назад ⁺¹
Please do make a video on leveraging the GPU as much as possible. It would be great if you can go into particle emitter calculations on GPU as well.
@arcoute9108 28 дней назад ⁺¹
Mode 7 rendering is cool and was first implemented with hardware in the SNES
@ciCCapROSTi 27 дней назад
+1 for the GPU video, especially if you can make it simple in your usual style. I last used a GPU when OpenGL was still properly pipelining, no shaders, so that's like 2 decades out of date knowledge.
@mjthebest7294 28 дней назад ⁺¹⁴
Raytracing series comeback when?
@mr.anderson5077 28 дней назад ⁺¹
Hey @TheCherno
I request you to please make a tutorial on running the exit screen rendering on GPU.
I would love to see you deploy this workload on iGPU if your on intel or amd or use dGPU like Nvidia. May be we can dive into some CUDA programming too if needed in future or raw CPP is fine for now.
Also include this idea to utilise the same with your ray tracing examples.
Netizens please hit the like button below if you feel the same.
@ZackLivestone 28 дней назад
I want to see more like this in the future
@AlexSmolyankin 27 дней назад
Really cool video. Now I'm waiting the video about bringing that code to the GPU.
@xeridea 28 дней назад ⁺²
Caching can be good if only done on code that is looped a lot, and small to keep within the CPU cache, unless if operations are really expensive then larger cache would be fine.
AFAIK, looping backwards may not necessarily be horrible because prefetchers can detect offsets and fetch accordingly, but forward is still likely better.
I would say a big slowdown is calling a function to draw each pixel. You could just save everything to a buffer, then do the 1 draw.
@lptimey 28 дней назад ⁺²
15:50 give that some of these don’t ever change. Wy even cache them if you could precalculate them with your compiler with a const_expr I think
@Kazyek 27 дней назад
15:52 The cost of "caching" mainly depend about where you put it and how to retrieve it. A HashMap for example, while being the most awesome data structure ever, involve quite a bit of math to retrieve a value from a given key. In THIS specific case though, since the variable doesn't depend on anything else at the moment, you'd probably simply keep it in the same struct alongside the fWorldA and fFoVHalf that you're already accessing, so it would be in a very similar place in memory, no expensive math to retrieve, and the relative cost of trigonometry function on the sum of two variable in a struct is definitely higher than retrieving a single variable in that same struct.
@anon_y_mousse 20 дней назад
One tiny hint, if you have to specially handle an iteration because of an initial zero value, it's better to have that code before the loop and then start the loop at one. It'd be nice if the compiler would always recognize what's happening and do that for you, but it's also significantly more clear if you do it yourself.
@christianlett 28 дней назад ⁺¹
From what I could see in the video (I've not got the source code so can't be sure), the values pre-trig functions could all be constexpr. In C++26, the trig functions will also be constexpr. But the first thing I saw was the sin and cos calculations were each repeated 4 times, so I'd start there. Good observations re memory caching and memory access in the inner loop.
@SueDoeNym-b4d 19 дней назад
19:39 The CPU fetches memory as fixed lines. It basically divides the whole address range into fixed lines of (usually) 64 bytes. When a particular address is accessed, its whole line will be fetched, some of which could be behind it.
Suddenly looping backwards may result in some waste, as a line may have been loaded going forward that doesn't get fully utilised, but the difference would be imperceptible.
@Xoduz85 27 дней назад
Yes yes yes, please make a video on how to take this code and transform it into a GPU version :D
Keep on making these awesome videos, they're great!!
@empireempire3545 28 дней назад ⁺¹
9:40 DO IT, There is always need for GPU coding tutorials!
@MWPSBCID 28 дней назад ⁺³
I would like to see you do this on the GPU
@bronkolie 24 дня назад
creating this exact look in GPU would be really interesting
@zeusdeux 28 дней назад ⁺³
Just here upvoting all “let’s get this on the GPU” comments
@anonanon6596 28 дней назад ⁺⁴
I would actually love to see you review a code of javidx9 himself.
Like his pixel game engine or any other project he has shown in his videos.
@theairaccumulator7144 28 дней назад ⁺⁶
when i wrote a raytracer in js caching everything made it like 350x faster LMFAO (don't ask why i was writing a raytracer in js)
@m.raflyyanuar9886 27 дней назад
Why were you writing a raytracer in js
@hassaniq0777 26 дней назад
why
@shadow_blader192 14 часов назад
Why I wrote raytracer in python?
@SoederHouse 27 дней назад
Now we just need Cherno and olc to collaborate on an light weight engine and the world would be a little more perfect.
@an1n-dya 28 дней назад ⁺²
Could we please have the return of the ray tracing series? 🙏🙏
@frankreeser4400 28 дней назад ⁺¹
Good video! Thanks. Please make the GPU (imported original code) video! :)
@FarisEdits 28 дней назад
I LOVE THIS KIND OF VIDEOS
@LoopSkaify 27 дней назад
YES I WANT THAT!
@Lunastela64 23 дня назад
I'd love to see a video where you make something like this run on the GPU please :)
@MatrixHound_Dungeon 28 дней назад ⁺¹
Yes make a video on how to run this on a GPU. Thanks!
@Rob_III 27 дней назад
Although you split ground and sky rendering into two loops, you didn't change the sky accessing memory "backwards"; I think that would've made a big(er) change in performance than just splitting the two into two separate loops.
@debsarkar4893 28 дней назад ⁺²
9:45 I would absolutely love to see how u take code like this run it on a GPU
@awesomeguy11000 28 дней назад ⁺¹
I'm surprised you didn't make an rgba32 backbuffer and format the loaded textures as rgba32, then you could avoid all function calling overhead and copy pixels directly from source to destination buffer. SDL uses the GPU behind the scenes so setting up render state and issuing a draw call for each pixel has a much higher overhead than performing all of the work in the CPU and flushing the buffer at the end.
@TheJonatanMr 28 дней назад ⁺³
Petition to continue the ray tracing
@linavabai8470 28 дней назад ⁺⁴
Yes, GPU it, please
@denravonska 28 дней назад
It would be interesting to see if making all those constants const or locking them to an anonymous namespace will make a difference.
@furyzenblade3558 28 дней назад ⁺¹
9:37 Yess
@ralph_d_youtuber8298 28 дней назад
I feel like if u still wanted the cosf to be there for easy reading. Then put it in a const scope. So it can just be calculated at compile time. Why cache the compiler can inline the results for u😊.
@BadMemoryAccess 28 дней назад
lmao, last week I wrote a Cacher class which held cached values and the relevant recomputing functions ... granted, my cached computations were actually costly, not just arithmetic operations and trig (there was noticable lag without caching)
@whoshotdk 25 дней назад
I’m more interested in how you’d multi-thread the rendering of this than seeing it run on a GPU, which I think would just be a ton of boilerplate code and a fragment shader that looks very similar to the existing code (I could be wrong!). I guess multi-threading would introduce concepts like synchronicity? I.e how do you avoid tearing effects if different cores are spitting out pixels at different rates? Mostly guessing here, I’m new to C++ myself and am strictly a single-thread guy right now.
@gsestream 28 дней назад
there is the maf, and there is the code. math is interesting, code is boring maintenance. or just do post processing filtering like FXA. or TAA, or MSAA or just super sampling anti-alias from higher resolution down.
@Redeam. 24 дня назад
When new "The Cherno's Adventures in Minecraft"?
@Doogle41 23 дня назад
These magic scribbles are well and good but where do I get a cool hoodie like Cherno's?
@ringo2715 28 дней назад
Just for clarity, javidx9 has migrated development of console game engine to pixel game engine which does use the GPU.
@brendandower9021 27 дней назад
Yes please.
@adamagrest8215 28 дней назад ⁺¹
Baited my comment :) Please show us running this on GPU
@ThunderSphun 28 дней назад ⁺¹
please convert this code to a shader, i have been waiting for something like this since i finished the raytrace series
@GEKKOGAMES_RETRO 28 дней назад
yes me too 🤟🏻
@saurabhmehta7681 28 дней назад ⁺¹
Its called Skyward Scammer because you scam Gonzo, the guy running after you, and fly into the sky after taking his money. You're welcome
@shawnbucholtz8082 28 дней назад ⁺¹
13:55 ........ me too.
@Diamonddrake 28 дней назад
Does this mean looping through an array backwards is a cache miss party?
@Jurasebastian 27 дней назад
Is following idea good? i have 1MB memory fragment in which i have many memory fragments i want to access many times, my idea is to copy all of those fragments to one smaller buffer that will fit into cpu cache and then do calculations on that memory? it will be probably faster even i add extra allocation at beginning and end assuming i access memory like milion of times
@mr.anderson5077 28 дней назад ⁺¹
Anybody know what tool does Cherno uses to draw stuff on the screen kindly drop a comment
@cacticrown 22 дня назад
do you only review c++ code?
@luizgarciaaa 28 дней назад
Pleeeeeaze do it!!!
@bobjones304 28 дней назад
If it is the same calculation, why not just inline it?
@axjb2428 27 дней назад
I would love to see you to port that code to GPU =)
@ralphyreece4687 28 дней назад
My code is a mess of AI for stuff like SDL and structures, a ton of copy and pasting, and hundreds of if statements.
@jouniosmala9921 28 дней назад ⁺²
My memories of pseudo 3D was mostly with a system having 8Mhz but some memory of pseudo 3D with 1Mhz CPU. The performance of that thing is horrendous when considering that.
@R_eal-G_rude 28 дней назад
Cool
@mirabilis 27 дней назад
Super Mario Kart!
@adrien8768 24 дня назад
Lets go GPU video
@timmygilbert4102 28 дней назад
gpu gpu GPU GPU ❤
@Markus-fw4px 28 дней назад
speed to 0.75, then it's watchable 😅
@ericisconfused 28 дней назад
GPU! GPU!
@casdf7 27 дней назад
I want this game running at 1000 fps on a gpu
@codinghuman9954 27 дней назад ⁺¹
MOAR GPU VIDS PLS!!!!!!!!!!!!!!!
@NintendoJimmy 28 дней назад
GPU!
@ShivamKumarPal-nc3nx 24 дня назад
cpu to gpu code
@dj10schannel 28 дней назад
🧐
@Tyler-z8r 28 дней назад
Absolutely port this code to run on discrete graphics hardware! I have no idea how to do that!
@andrewdunbar828 28 дней назад
This is a pseudo comment
@defini7 28 дней назад
That code was originally designed for rendering in Windows Command Prompt so it was not expected to be run on GPU
@pfqniet 28 дней назад
13:30 I feel incredibly called out LMAO - I was exactly the same way 10 years ago and now I am "Future Me" and have to deal with "Past Me" being all clever and stuff. Help...
@kingofbattleonline 27 дней назад
очень понравилось, еще видосов с этим движком про оптимизацию.
@KeithKazamaFlick 28 дней назад
javidx9 & ChiliTomatoNoodle