are stack based vms really slower?

leddoo

Просмотров 19 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 12 сен 2024

Комментарии • 65

@joseph199627 Год назад ⁺⁶⁷
Very interesting, I imagine it took a lot of head scratching to really understand what was going on with those experiments. Great video
@leddoo Год назад ⁺¹⁵
thanks!
yeah, took me about a week to figure out what was going on :D
initially, i thought it was just about the number of memory accesses, but things were just not adding up.
the experiments from the video are the ones that convinced me it was about dependencies. but yeah, there were more :D
@SaHaRaSquad Год назад ⁺¹³
I once wrote a small stack-based vm but had two bits left over in the encoded instructions, so I decided to just have two independent stacks and the last two bits determined which stack would be read for arguments and which would receive the result.
I didn't do a lot with it so I'm not sure about how this affected performance, but I think it could allow quite some optimizations.
Also I saw a talk about a very small chip running Forth code, and it used a stack as well as two registers, so a bit of extra space for temporary values really pays off I guess.
Oh, and because I saw your lisp-based syntax: It is surprisingly easy to write a simple Lisp-to-Lua transpiler. Lua lets you use most things as expressions and also supports tail-call optimization, so you can basically map Lisp expressions 1:1 to equivalent Lua code.
@isotoxal Год назад ⁺¹⁹
You are now in the list of my favourite youtubers
@leddoo Год назад ⁺⁵
damn, big words!
thanks man 😎
@JoeHinkle11 Год назад ⁺¹⁹
I haven't seen many channels focus on programming language development like yours. Amazing job! Also, I love your animations. Really makes it clear what's happening when you explain concepts
@randomstuffgamess Год назад ⁺¹⁶
Awesome video! I really like the 'modern cpus go brr (if you let them)' section. And the animations really helped with clarity. How did you make them? And your explanations are really on point.
@leddoo Год назад ⁺⁴
thanks! the stack & cpu were animated using motion canvas (i think i left links to the components in the description, but they're not updated to the latest version of motion canvas)
@devtadeo Год назад ⁺⁸⁷
Me wondering why he uses a knife to point around
@pp_up 3 месяца назад ⁺¹
It's because... it's _pointy_
@OMFGxIamxAxNinja Год назад ⁺⁸
Wow, this just popped up in my feed, but I really enjoyed this video! You made it really interesting!
@JoStro_ Год назад ⁺⁴
fascinating, i've never considered this aspect of performace before.
@JH-pe3ro 4 месяца назад
There's a strong path-dependency to how we've ended up in "registers are faster", I think(and I was aware of that before the video, though it's a great explanation of why); register architectures are intuitive when computers are viewed in terms of arithmetic processing, and they map well to local variables. Therefore we have hardware that engages with that, provides a lot of enumerated registers, and asks for languages to utilize all of them, and that has a consequence on VM design, since the VM's goal is to remap program description to efficient hardware utilization: if it looks stack-y, then the JIT rewrites it to register-y code.
On the other hand, the CPU architecture can avoid this and do something like Minimal 64x4, a non-pipelined hobby computer using discrete logic: it achieves very good performance at a relatively low clock rate by reducing the register set and optimizing the instruction set around zero page memory(a construct which exists on 6502 as well, but is used supplemental to registers), thus algorithm descriptions are smaller and the total sequential work is smaller than on other 8-bits, despite having a lower "MIPS" rating. I believe the distinction between VMs would be relatively less important on Min 64x4 because the stack VM's instruction set would also benefit from the ability to code random access efficiently.
I've started to think that local variables are worth questioning, though. They serve a certain "black-boxing" purpose of presenting short routines as a matter of loading up the registers in an initial state, doing the computation, and then writing that back using the stack as a protocol between that routine and the rest of the program. But the usual alternative to a stack protocol is a buffer, and buffers have certain benefits in terms of coordination of resource use. If the buffer were implemented like a ring buffer and the computer optimized around that, it presents ideas like the Mill computer's "belt", an interesting concept, albeit one that sadly still hasn't shipped anything.
@j-r-hill Год назад ⁺⁴
Strongly recommend looking into work done by Anton Ertl if you're interested in stack machine compiler optimizations
@thomasmewily4012 Год назад ⁺⁴
This is exactly the channel I needed : cleans, concises and really good comparison and explanation about Stack VM vs Register VM, thank you !
Just one question : When you have a stack VM, it is like having an infinite number of register. But when you are using a register VM, you usually get a limited number of register.
What happen if you use more variable than the number of available register, for a register VM ?
And how do the VM handle that, or even your computer if you compile it instead of interpreting it ?
@leddoo Год назад ⁺⁴
the register vm in my scripting language is currently limited to 255 registers (one byte per operand).
so the compiler would raise an error (currently just panicks, lol).
i may add instructions to use more than 255 registers later, not sure.
python has a special instructions for when there are more than 255 local variables (docs.python.org/3/library/dis.html#opcode-EXTENDED_ARG )
in general, compilers tend to have all kinds of limits that you never notice during regular use (like max number of nested blocks, etc).
@thomasmewily4012 Год назад ⁺¹
@@leddoo Yeah I was thinking about tricky/edges cases
Thank for the link btw ^^
@oblivion_2852 Год назад ⁺²
Funny enough I came across the max number of arguments in a java function call because I was transpiling from a language that had arbitrary length arguments. It was a really dumb concat function that 700 arguments but that day I learnt that Java only supports 254 arguments (since the class itself is always passed as an argument)
@thomasmewily4012 Год назад
@oblivion8459
Wow, I wonder in which context you manage to have a concat function with more than 700 arguments ? (Maybe the function call was written by another program)
Look like you reach some cool hidden technical details too. From memory in C it is at least 127 arguments.
@thomasmewily4012 Год назад
@oblivion8459 The translated function can also have a single list or somethings iterable that contains the variadic arguments, but I guess it's just better to don't handle the case and see where the weird code is
@shilangyu Год назад ⁺⁵
Subscribed. High quality content and touching exactly the topics I love.
@AK-vx4dy Год назад ⁺⁴
Nicely explained, good work !! Keep explaining :)
@luna010 Год назад ⁺²
great video. I'd be curious to see how your scripting language compares to VFX Forth (which is the fastest Forth compiler/interpreter afaik).
@kevinrobben1999 Год назад ⁺²
Learned so much from your videos! Thank you, subbed
@Diamonddrake Год назад ⁺¹
Very well explained!
@Richard_Knusprig Год назад ⁺³
I really like your editing style and how you ... aaahm ..., anyway.
@leddoo Год назад ⁺¹
asufutimaehaehfutbscueme, anyway
@dimitar.bogdanov Год назад ⁺²
Really interesting!
@flareflo362 Год назад ⁺³
What are the various profiling tools used here? They look really neat!
@leddoo Год назад ⁺²
intel's vtune (on windows, i'm using the msvc integration)
and apple's instruments (for the macos stuff)
@evanfreethy2574 Год назад
This is my favorite type of video 🎉
@treelibrarian7618 Год назад
pretty sure the reason the "smart" version is slower is the rotation - since that would have to read and write every entry in the stack memory to shuffle them down by one. better would probably be:
Load { src: a },
Load { src b },
Store { dst: a },
Add,
@aredrih6723 Год назад ⁺²
I'm more familiar with the JVM and in these stack machine, the locals (which include arguments) and the stack are kept separate.
The stack starts out empty and you have to load the value from the locals.
A bit as if there was a shelf alongside the stack and instructions to add the value of a local and instructions to save the current value to a local.
Basically, these stack machine _have_ register, they just have 2 operation allowed on them.
I'm not sure how python does it differently or if I'm misinterpreting the visualisation.
So does python really load all the arguments at the bottom of the stack?
@leddoo Год назад ⁺¹
yes, python puts the locals at the bottom of the stack frame and loads them to the top for operations. (i can recommend messing around with python's `dis` module to see how python implements language constructs in bytecode!)
from what i've read on wikipedia, the stack in the JVM is more of an abstraction though. the jit turns it into regular native register code. similar to how WASM is stack based, but wasm runtimes convert the code to native register code before execution.
@aredrih6723 Год назад
@@leddoo Hum, I did a quick test by making an identity function:
def ident(x): return x
It compiled to "load_fast" and "return_value".
I used the command `python3 -m compileall` to get it compiled, removed the "load_fast" from the compiled function (couldn't found how to load arbitrary byte code) and added a nop (byte code 9; for opts and arg) after the return to keep alignment (no clue how the format works, just searched byte code matching the disassembly and replaced it).
Then I tried to load the module. It worked.
Then I tried to called the function. It's SIGSEGV.
I'll need to dig deeper to find out if my "unconventional" edit are the caused or if the function need the "load_fast" to put a value on the stack.
Or I might be completely misunderstanding what you mean by "stack frame". Maybe what you call "stack frame" is what java call " slot" and wasm "local".
@leddoo Год назад
@@aredrih6723 yeah, load_fast loads the value to the top, return_value returns that value.
maybe it crashes, cause python does not expect the parameters to be popped?
by stack frame i mean the area of the stack, reserved for the function invocation. (en.wikipedia.org/wiki/Call_stack#Structure )
you know, actually i'm not sure whether parameters are on the stack in python (that's what i do in my VM). so maybe it crashes, cause return_value tries to pop from an empty stack?
@aredrih6723 Год назад ⁺¹
@@leddoo So, I tried digging a bit and dis.code_info seems to confirm the existence of locals (which would be register only supporting writing to and reading from the stack).
I also seem python encode the (I assume maximum) size of the stack and numbers of locals.
(also, "def f(a, b) : return 0" is marked as having 2 local and a stack of 1)
Now, storing the arguments at the bottom of the stack is a possible implementation but you would either need to copy an element from an absolute index or every loading depends on the size of the current size of the stack which would need to consider the size of the stack in both case.
Still, it seems to allow for some neatly optimized code so far so I'm curious how the vm will mature.
@leddoo Год назад
@@aredrih6723 in my vm, the call instruction currently takes a variable number of register operands & copies those into the newly created stack frame (so params are the bottom `n` registers upon entry, but the callee is free to modify them).
@blinking_dodo Год назад ⁺²
This is interesting...
@cemgecgel4284 Год назад
Great video!
@otesunki Год назад ⁺¹
this makes me curious
is it faster to push a dummy value like 0 then add the top two elements and copy over the dummy value
@leddoo Год назад ⁺²
hmm, not sure what you mean 🤔
my takeaway from that video was: if two "instructions" access the same memory location, they can interfere. (a "read after read" is fine, but "read after write" causes delays)
@u9vata Год назад
I understood this before the vid being hiperf coding guy... But I am still not totally sure how good an optimizer can optimize stack based code to avoid these. I used to like Factor for example and although I never used it for hiperf code like I did with C++ they claim to have stack based good optimizer...
I mean for example a stack base language optimizer could spot that your load + dup pattern could be exchanged with the two separate load - I see no issue why the optimizer could not have this rule for example and this is just an example.
Generally informative video though. I did not know about the tool that shows that "green / red" thing for pipelining. What is the name of it? Ah never mind... its vtune. I prefer perf and such tools generally.
@OhadLutzky Год назад ⁺¹
That is super cool, and seems highly relevant to Java as well, which people like to at least pretend can be performant. (JVM is stack-based, Dalvik is register-based)
@leddoo Год назад ⁺²
well kinda :D
wasm is also stack based, but close to native in perf.
the question is whether the *interpreter* is stack based.
java is typically jitted, so it runs natively using the physical registers.
@bernatrosello4375 Год назад
I don't think the "parallel" vs sequential argument is necessarily right. Even if you're running both versions using more/less variables you have register usage risks, which are going to slow yo down significantly on a segmented data path CPU.
@leddoo Год назад ⁺¹
hmm, which part of the video are you referring to? and which registers - cpu regs or vm regs?
@nahblue Год назад
Nice! Does anyone have a link for the pipeline diagram tool?
@leddoo Год назад
link in the description :P
framework is motion canvas.
script is not ported to latest version (of mc) however.
@hfspace Год назад
on the other hand these smart instructions take less place in the cpu even if they are dependent, since they are only occupying one "execution" slot. So in principal you have more capacity to execute other completely independent instructions. This seems to be a trade off here or am i missing something?
@hfspace Год назад
i guess the product of taken execution slots and the time needed to execute is needed to compare different instructions absolutely.
@elarajade7893 Год назад ⁺¹
First ✌
@nexovec Год назад
I'm pretty sure not padding every stack frame with 200 bytes helped in making your interpreter faster than python 😂
@themcchuck8400 Год назад
You really need to work on your audio levels. They're all over the place. Keep the microphone a steady distance from your mouth for the entire session, and keep your mouth pointed at the microphone.
@HXMCPP Год назад
what do you use ? to benchmark it ? 6:44
@leddoo 10 месяцев назад
i use vtune on windows (the msvc extension) and instruments for macos.
@HXMCPP 10 месяцев назад
does it work with amd ? @@leddoo
@leddoo 10 месяцев назад
@@HXMCPP i don't think so. amd has "amd uprof", but i'm not familiar with it.
@kfsone Год назад
For some reason people look at dis output and always miss the elephant in the room.
2 0 LOAD_FAST 0 (a)
2 LOAD_FAST 1 (b)
4 BINARY_ADD
6 LOAD_FAST 2 (c)
8 BINARY_ADD
10 LOAD_FAST 3 (d)
12 LOAD_FAST 4 (e)
14 BINARY_ADD
16 BINARY_ADD
18 STORE_FAST 5 (r)
You're looking a the center column and not the far right.
2 0 LOAD_FAST 0 (this_couldnt_possibly_be_slower)
2 LOAD_FAST 1 (could_it)
4 BINARY_ADD
6 LOAD_FAST 2 (just_changing_the_variable_names)
8 BINARY_ADD
10 LOAD_FAST 3 (couldnt_make_it_slower)
12 LOAD_FAST 4 (right)
14 BINARY_ADD
16 BINARY_ADD
18 STORE_FAST 5 (i_mean_theyre_just_hash_lookups_arent_they)
def f(a, b, c, d, e): r = (a + b) + c + (d + e)
%timeit f(007, 14, 21, 27, 351)
from os import *
from sys import *
from collections import *
from itertools import *
from functools import *
def well_what_about_this(this_couldnt_possibly_be_slower, could_it, just_changing_the_variable_names, couldnt_make_it_slower, right, i_mean_theyre_just_hash_lookups_arent_they):
i_mean_theyre_just_hash_lookups_arent_they = (this_couldnt_possibly_be_slower + could_it) + just_changing_the_variable_names + (couldnt_make_it_slower + right)
%timeit well_what_about_this(007, 14, 21, 27, 351)
(bonus points if you figure why the imports)
@kfsone Год назад
In my testing, "42 + 27 + 351" takes 25ns. calling "adder(42, 27, 351)" takes 75ns. changing adder to "adder(first, second, third)" and the body to just "result = (first + second) + third" adds 1.1-1.6ns to each call.
Well, 1.2ns on top of 76? No. 50ns of that is the overhead of PyCall, the actual instructions are only 25 + 1.2ns, or those few extra letters of the names accounted for 5% of the cost for that one line of 2 additions.
And *that* is just for top-level local parameters.
@kfsone Год назад
ruclips.net/video/vobGqYSvHCU/видео.html
@leddoo Год назад ⁺²
i actually have no idea what you’re talking about 🤔
the length of the variable names doesn’t matter at runtime. they’re turned into indices. the last column is just for the programmer.
you seem to make some other point about call overhead, which is something that can matter, but in this video, the function call overhead was insignificant. and there wasn’t any python in this vid.
i’m just very confused.
@0LoneTech Год назад
Those indices are the second rightmost column (outside the parenthesis) and the reason the accesses are labeled "fast". The imports certainly increase the odds of hash collisions, but it only affects the function lookup. Long names take (usually insignificantly) more time to parse, hash and intern (all during compile time), but once interned can be matched by id.
You might want to rerun those timeit tests a handful of times; in my tests the differences are entirely noise and swing either way.

Следующие

Автовоспроизведение

it took me 1 month to fix this compiler bug