As a person from the other end (Exchange), I find the differences quite interesting. We target to have an end to end latency of around 30us with a very low amount of jitter since we also care a little about throughput and use a fault tolerable distributed system. We often make performance sacrifices to keep the code base easy to understand for most people.
currently working as an intern at an exchange and I'd agree, the differences are really interesting. Especially when looking at it from the high performance point of view
I wonder if the original presentation is available? it's not as good as having the presenter talk through the "missing" slides however it's better than nothing.
The entire presentation is about why you need to learn assembly, machine language and your hardware inside out when going for the fastest execution possible. Mid and high-level languages don't have any of the tools you need. You need to create them specifically for each platform if you want to exploit every single cycle to your advantage. If you change something / upgrade the hardware, then you're 100% going to need to rewrite part of the code or all of it to make it optimized again. Mid-level languages are still insanely fast though, no question about that. The question is rather whether you can afford to place 2nd when you aimed to be 1st.
thanks for the eye-opening talk. If you turn off all other CPU cores, where does your OS kernel run on? would there be interference from OS if your thread and OS share the single core?
Hey Carl, great talk. During the talk you mention having to delete a whole lot of slides; I wonder whether it is possible for you to share those slides as well? I think folks will be interested in those as well.
Hi Ton, thanks, excellent point. I'm working on another (very similar talk now to be presented at pacificplusplus.com), once those slides are ready I'll post a link here. It's mainly just material on branch prediction, but will still be useful.
How does the flat unordered map not have cache misses? I thought it'd be a cache miss whenever the data has to be dereferenced from a pointer? Is the idea to allocate the data such that the data follows the map in memory so that the l2 and l3 caches hold it?
According to my understanding, it saves cache misses on traversing nodes in case of collisions. Still, it will have cache miss on the accessing `value` after we found correct node. With list-based node store, every jump to the next node is a cache miss. But, for example, if we have 3 items in a the bucket and you look up for the third, with linked list-based approach, you going to have two cache misses (because of traversing nodes in a bucket) and with flat map you'll have just one (when dereferencing data).
Late but speculative hardware pre-fetching. Guess the presumed destination for a function and pre-fetch the area before the instruction is even executed so it arrives early. Also if you cache coherency is good you can keep a very large portion of the map in _at least_ L3 cache.
Fighting the operating system isn't too hard to be honest. It's the hardware that causes me the most problems. IncludeOS is cool, but it has a long way to go before the tooling for it is good enough to run commercial HFT software (think about debuggers, profilers/instrumentation, monitoring support, etc). I do see potential in includeOS, but won't be an early adopter of it!
14:38 "HandleError(errorFlags);" should only tell another thread to handle the error. The current thread should only perform the hot-path and non of the actual error handling. This way, the current thread's cache is hot-path-optimized. So we now have a hot-path-thread. Your thoughts?
22:45 I want to do my calculations for an entry point in separate threads man. I'm coding pattern detection, if I'm monitoring 20 stocks I need my threads. Thoughts?
Indeed there is logging in there. But I will say a couple of things: 1.) We keep logging to an absolute minimum, and 2.) A lot of work goes into keeping the logging code as low-touch as possible, i.e. making sure that the logging is done from a separate thread, not evaluating format strings in the hotpath, and a few other tricks. I actually spoke about logging in this talk a year ago if it helps: ruclips.net/video/ulOLGX3HNCI/видео.htmlm14s
@@carlcook8194 that's seem more like tracing pre-determined things and for it to be fast it better fit in a cache line with some timestamp even at cost of precision or maybe even relative. This store can actually be even marked as non temporal for it not to eat cache (excluding WC). Then once in a while it may be picked by another thread or well memory mapped FIFO. Anyways it's nice hearing you things from you Carl, you do have that ability to go through hard stuff without unnecessary details which if one is interested, will have to go anyways. Good job
Indeed! But often you need to move faster than FPGA development, and/or hit limitations about how many FPGAs you can have per exchange. So far I've been fine not needing real time operating systems, but it's certainly worth thinking about.
Interesting talk. Do you actually measure ( aka. use performance counters ) cache hits / TLB hits when optimizing code, or do you optimize code based on instinct and experience? Modern cpu's often have HW prefetchers for instructions. Strive for long instruction streams (aka no jumping in Program Counter) to fully utilize this feature. There are many other optimizations to be done based on the microarchitecture of the processor. Do you guys have experts from Intel helping out?
Hi, indeed I do take a look at cache and TLB metrics, but really it's all about measuring end-to-end time, first in the lab, and then in production. So if I see that the results in the lab are slow (or even worse, production), one of the first things I'll look at is the output of linux perf. In terms of coding/optimizing based on experience, I have found that for both my code and other people's code, making educated guesses doesn't usually work. It's all about measurement, refinement, and then measuring again. For your next question, I don't talk to Intel engineers directly, although I have done a few training courses which gave me some useful insight (including learning that the micro-opcode inside the Intel chips is propitiatory, not documented, and not guaranteed to stay the same over different releases of CPUs). So it's a bit hard to specialize in a single Intel CPU. What I will say is that for when we really need ultra low latency, then FPGAs and other custom hardware is often most suitable. What helps is that for when you are competing against other software-based trading systems, then knowing to a reasonable level about how to make optimizations is often good enough. You can spend a lot of time with one chipset, bios, CPU, network card, operating system, and get a software setup very, very fast, but that's a lot of effort. Which in some cases that's fine and justifiable, but in other cases isn't worth the effort, i.e. it's sometimes better to be the first to find new market opportunities and get something relatively fast up and running in software, and be the first to have a successful trading strategy from this. This is why I think being good at writing fast C++ is always a useful skill to have.
Hi, I was wondering about the added insights of using vtune instead of perf. Seems there's a lot more pcm to get there, but didn't have time so far. Skylake-SP made such a great arch change, gotta check everything again ! Very nice talk !
Ha ha, my wife is a linguist and she was impressed by your observation! Yeah, I lived in the southern part of NZ for a three years where there is a Scottish population, and then I also lived in The Netherlands which influenced my strange accent.
Hey Carl. Great Talk ! How do you lock L3 cache for one core ( or a set of cores ) without actually turning off the CPUs ? I tried searching but in vain. Can you provide some pointers for that ? Thanks!
Hi, here is a link to a basic paper about cache access technology: www.intel.com/content/www/us/en/communications/cache-allocation-technology-white-paper.html. Hope that helps!
Hey, great talk. I'm curious if there is any viability to using non intel parts for this type of single core workload. In particular I was curious if it is viable to use for instance POWER architecture CPUs here. The very fast 5HGz+ cores which are available there and the very large caches might be interesting. This seems like one of the few still existant markets where these parts might make sense for non legacy work. I originally had a more interesting reply but the RUclips autoplay function deleted it, the bane of good videos one actually wants to watch to the end I guess.
Interesting comment! I've never looked at other architectures/CPUs, but I am sure some will probably give a decent advantage, once you get familiar enough with them. For example, ARM gives less guarantees about order of reads/writes from multiple threads, which probably means it is faster in some multi-threaded cases (as just one example).
At slide 19, how can we call manager->MainLoop()? Does it mean, that we have a "virutal void MainLoop() = 0; " at IOrderManager? Then we still have virtual funcs+templates instead of just virtual funcs :)
Exactly. And there also might be a memory leak there. Factory() function is implicitly casting OrderManager to it's base class. IOrderManager. If there are some member objects of OrderManager which allocated RAM from heap their destructor won't be called. Also I'm not sure that unique_ptr can be assigned a value from unique_ptr with different template.
There isn't a memory leak as long as the base defines a virtual destructor and OrderManager's destructor works to begin with. Also, that assignment of unique_ptr's is a valid conversion: en.cppreference.com/w/cpp/memory/unique_ptr/unique_ptr
"There isn't a memory leak as long as the base defines a virtual destructor". I don't see that anywhere in the example. Also, the derived class' destructor needs to call destructor of base class. Yes, you're right, unique_ptr assignment works, but there's an implicit cast from OrderManager to it's base class. There are too many "hidden" things here. I know that this is just an example but, in my opinion, an example for a conference like this one must be much better. I also don't like seeing some other stuff, like function definitions with empty brackets - "void SendOrder ()". Is it so hard to put "void" in it? A lot of work has been done on C++ standard to increase strict type enforcement. After all, C and C++ are supposed to be strongly typed languages. C programmers and even standard broke this a lot with huge amount of software exchanging "void *" pointers all around the code. And now C++ programmers are breaking this with a lot of implicit conversions and class to base casts. In this example they could have just used a pointer to sender function and switch it to different one when needed.
The source for IOrderManager isn't given in the example, it is simply implied that it has a virtual destructor (if not, you will get a leak like you mention). Also, dtor's of derived classes call the dtor's of their parents implicitly, reversing the order of construction, so this example works as intended assuming he didn't mess up IOrderManager. See this: stackoverflow.com/questions/677620/do-i-need-to-explicitly-call-the-base-virtual-destructor
Indeed. And being able to move very quickly to new markets and strategies because you are running fairly standard hardware and operating systems which are relatively easy to program against and install. However, every trading company will no doubt also be running their ultra low latency systems on custom hardware and no real operating system to speak of.
Sure thing. Check out the open onload API documentation for SolarFlare cards (support.solarflare.com/index.php/component/cognidox/?file=SF-104474-CD-23_Onload_User_Guide.pdf&task=download&format=raw&id=361), in particular this section: 8.15 ONLOAD_MSG_WARM
Been playing with sg14::fixed_point lately and it's very good! But hey, which size of fixed points is usual in fintech? What "distribution of bits"? Is it fixed or varied distributions for different uses? Also, at 13:10, are you initialising the error flag somewhere else? Would it be faster with, say, int64_t errorFlags{NO_ERROR}; ? Thanks again for the talk.
QUIC and gRPC are network protocols used for communication. This means stock exchange must support it and honestly I don't think any exchange would support gRPC since that's not ideal for good latency.
He says intel wouldn't be interested in making CPUs for this market, but this is literally the market that pays for all the international undersea fiber optic cables just to reduce latency slightly. You pay intel enough, they;ll be interested.
people who write "benchmarks" sometimes baffle me it's like they don't really care about measuring anything just to have the cpu pegged at "100%" or something and then spit out numbers with lots of significant digits
Any thoughts on Rust? It seems to me an ideal language for the requirements - predictable, consistent high performance, and high productivity due to modern, high level features that still don't introduce high latency behaviour (by avoiding memory allocations and memory churn). It's like the "low latency techniques" happen to be the "best practices" of Rust development, whereas the "best practices" of C++ is not the same as the low latency techniques of C++ dev.
Yes, this is exactly the point. Rust does look cool, but C++ has so many 1000s of man-hours built into the language, the tools, etc, it's going to take one heck of a language to convince people to give up on C++. The same is true for includeOS. Another great idea, fresh thinking, and should in theory make a lot of the tricks that I need to pull obsolete. But it's still too early to start using this product commercially.
Except that best C++ practices are largely variant across different domains, and usually extreme low latency design is considered "best practices" at places where Cpp is still used heavily today.
As a leftist I don't disagree with you, but he's obviously not the bourgeois capitalist, just a developer, and this talk has some valuable information.
Well that's the idea - but statistically it doesn't work that way: HFT is just too fast. The orders they provide - though they dominate numerically - don't provide better prices for the counterparty. Also they don't trade illiquid contracts. Paid market-makers have to be pulled in for that. Not blaming them BTW: they're just using technology effectively to make money. I can't see it as unethical either. But there should be more regulation - contact your democratic representative.
@@richardhickling1905 Isn't it entire idea that curtain HFT-s can make money by just squeezing in into the inefficient market? I.e. someone wants to sell curtain instrument for $100, but someone else wants to buy that instrument for $90. HFT is going to see this inefficiency and offer to buy for $92 and sell for $98 for example, thus saving $2 for a buyer, earning $2 for the seller and earning $6 itself.
High-frequency trading is in the high end of the spectrum of the soul-sucking experience (Speculation -> Day Trading -> High Frequency Trading). It probably feels like Satan himself is both sucking your soul and your dick at the same time, or, to use a good old trick of obfuscation through abstract parlance and legalese, it probably feels like adding liquidity to the market.
I've learned an immense amount from CppCon's uploads, this video included. Great talk, Mr. Cook.
Says in the description he holds a PhD, so Dr. Cook 😄
Eye-opening talk. That trick with replacing if's with templates was simple and neat.
we have if constexpr now which is basically the same thing
"I did a runthrough this morning, threw away about half the slides."
I felt that one. There's always too much to say and not enough time to say it.
One of those jobs where finding a trick like this makes the company your salary back in earnings
Been looking for a talk like this for 10 years. Thanks!
Glad it was helpful!
Great talk, easy enough to understand. Thank you for taking the time.
This was fascinating, gave me a lot to think about.
As a person from the other end (Exchange), I find the differences quite interesting. We target to have an end to end latency of around 30us with a very low amount of jitter since we also care a little about throughput and use a fault tolerable distributed system. We often make performance sacrifices to keep the code base easy to understand for most people.
Do you use Linux/C++ there, or what else? Also, if you don't mind telling what is the exchange in question I'd appreciate :)
@@LuizCarlosSilveira Yes, that's generally the combination used. It's Millennium Exchange - used by the London Stock Exchange and others.
An Exchange does not need to be too fast. When it slows, it slows equally.
currently working as an intern at an exchange and I'd agree, the differences are really interesting. Especially when looking at it from the high performance point of view
This is great .. love C++
This is a great talk. Very different from the others, but still interesting, easily approachable, valuable info.
Glad you enjoyed it!
I wonder if the original presentation is available? it's not as good as having the presenter talk through the "missing" slides however it's better than nothing.
The entire presentation is about why you need to learn assembly, machine language and your hardware inside out when going for the fastest execution possible. Mid and high-level languages don't have any of the tools you need. You need to create them specifically for each platform if you want to exploit every single cycle to your advantage. If you change something / upgrade the hardware, then you're 100% going to need to rewrite part of the code or all of it to make it optimized again.
Mid-level languages are still insanely fast though, no question about that. The question is rather whether you can afford to place 2nd when you aimed to be 1st.
This is a great talk, good content and well presented
Excellent practical talk
really want to know more about the measurement setup, how do you actually tap the line, how to actually do the configurations?
Very good talk. Thanks for sharing, Carl.
thanks for the eye-opening talk. If you turn off all other CPU cores, where does your OS kernel run on? would there be interference from OS if your thread and OS share the single core?
That's why it's important not to generate kernel calls - in that case your kernel doesn't have to do anything.
Hey Carl, great talk.
During the talk you mention having to delete a whole lot of slides; I wonder whether it is possible for you to share those slides as well? I think folks will be interested in those as well.
Hi Ton, thanks, excellent point. I'm working on another (very similar talk now to be presented at pacificplusplus.com), once those slides are ready I'll post a link here. It's mainly just material on branch prediction, but will still be useful.
@@carlcook8194 link please? It's been 4 years haha
Not sure you're supposed to mention the star gate. Mr SG14. Not SG1, I get it. You just want some recognition for your hard work.
How does the flat unordered map not have cache misses? I thought it'd be a cache miss whenever the data has to be dereferenced from a pointer? Is the idea to allocate the data such that the data follows the map in memory so that the l2 and l3 caches hold it?
According to my understanding, it saves cache misses on traversing nodes in case of collisions. Still, it will have cache miss on the accessing `value` after we found correct node. With list-based node store, every jump to the next node is a cache miss. But, for example, if we have 3 items in a the bucket and you look up for the third, with linked list-based approach, you going to have two cache misses (because of traversing nodes in a bucket) and with flat map you'll have just one (when dereferencing data).
Late but speculative hardware pre-fetching. Guess the presumed destination for a function and pre-fetch the area before the instruction is even executed so it arrives early.
Also if you cache coherency is good you can keep a very large portion of the map in _at least_ L3 cache.
Very cool talk
Very good talk. Thank you.
58:40 I do not think I follow this, what is the condition he's pointing that would block all mutexes?
Awesome talk!
Really great understandable insights!
Why fight with your operating system when your program can be operating system itself thanks to solutions like includeOS?
Fighting the operating system isn't too hard to be honest. It's the hardware that causes me the most problems. IncludeOS is cool, but it has a long way to go before the tooling for it is good enough to run commercial HFT software (think about debuggers, profilers/instrumentation, monitoring support, etc). I do see potential in includeOS, but won't be an early adopter of it!
Isn't includeOS DOS with extra steps?
Could anyone reproduce the problem shown at 44:47 ? I couldn’t with older GCC compilers..
Top tier talk
14:38 "HandleError(errorFlags);" should only tell another thread to handle the error. The current thread should only perform the hot-path and non of the actual error handling. This way, the current thread's cache is hot-path-optimized. So we now have a hot-path-thread. Your thoughts?
great talk
great talk!
good job
22:45 I want to do my calculations for an entry point in separate threads man. I'm coding pattern detection, if I'm monitoring 20 stocks I need my threads. Thoughts?
how do you log the trading systems actions? that should be overhead
Indeed there is logging in there. But I will say a couple of things: 1.) We keep logging to an absolute minimum, and 2.) A lot of work goes into keeping the logging code as low-touch as possible, i.e. making sure that the logging is done from a separate thread, not evaluating format strings in the hotpath, and a few other tricks. I actually spoke about logging in this talk a year ago if it helps: ruclips.net/video/ulOLGX3HNCI/видео.htmlm14s
@@carlcook8194 that's seem more like tracing pre-determined things and for it to be fast it better fit in a cache line with some timestamp even at cost of precision or maybe even relative. This store can actually be even marked as non temporal for it not to eat cache (excluding WC). Then once in a while it may be picked by another thread or well memory mapped FIFO. Anyways it's nice hearing you things from you Carl, you do have that ability to go through hard stuff without unnecessary details which if one is interested, will have to go anyways. Good job
That should be an RT OS or a specialized FPGA sitting right there on that network card!
Indeed! But often you need to move faster than FPGA development, and/or hit limitations about how many FPGAs you can have per exchange. So far I've been fine not needing real time operating systems, but it's certainly worth thinking about.
Does this form of trading generate any value (for the world) or is it just parasitic?
Market makers provide liquidity.
Hi Carl, Was just trying to see how one can measure the number of collisions happening while accessing STL unordered map?
Interesting talk. Do you actually measure ( aka. use performance counters ) cache hits / TLB hits when optimizing code, or do you optimize code based on instinct and experience?
Modern cpu's often have HW prefetchers for instructions. Strive for long instruction streams (aka no jumping in Program Counter) to fully utilize this feature. There are many other optimizations to be done based on the microarchitecture of the processor. Do you guys have experts from Intel helping out?
Hi, indeed I do take a look at cache and TLB metrics, but really it's all about measuring end-to-end time, first in the lab, and then in production. So if I see that the results in the lab are slow (or even worse, production), one of the first things I'll look at is the output of linux perf. In terms of coding/optimizing based on experience, I have found that for both my code and other people's code, making educated guesses doesn't usually work. It's all about measurement, refinement, and then measuring again. For your next question, I don't talk to Intel engineers directly, although I have done a few training courses which gave me some useful insight (including learning that the micro-opcode inside the Intel chips is propitiatory, not documented, and not guaranteed to stay the same over different releases of CPUs). So it's a bit hard to specialize in a single Intel CPU. What I will say is that for when we really need ultra low latency, then FPGAs and other custom hardware is often most suitable. What helps is that for when you are competing against other software-based trading systems, then knowing to a reasonable level about how to make optimizations is often good enough. You can spend a lot of time with one chipset, bios, CPU, network card, operating system, and get a software setup very, very fast, but that's a lot of effort. Which in some cases that's fine and justifiable, but in other cases isn't worth the effort, i.e. it's sometimes better to be the first to find new market opportunities and get something relatively fast up and running in software, and be the first to have a successful trading strategy from this. This is why I think being good at writing fast C++ is always a useful skill to have.
good question; but.. a trading system is not just calc heavy; it's also about decisions, and that means branching.
Hi, I was wondering about the added insights of using vtune instead of perf. Seems there's a lot more pcm to get there, but didn't have time so far. Skylake-SP made such a great arch change, gotta check everything again !
Very nice talk !
Absolutely great talk, but my guy had a 140 puls the whole way through lol
2:40 They are selling std::future and std::option. That's good to know.
He's speaking a rhotic New Zealand accent... Interesting
Ha ha, my wife is a linguist and she was impressed by your observation! Yeah, I lived in the southern part of NZ for a three years where there is a Scottish population, and then I also lived in The Netherlands which influenced my strange accent.
@@carlcook8194 John oliver has done his part quite well to edify the global population about New zealand accent :)
Yes map is slower then unordered_map since map it trying to sort as well. I found this out when dealing with millions of records.
It depends.
How can you create an exception without throwing it?
i don't even do anything in C++ idk how I got here
Hey Carl. Great Talk !
How do you lock L3 cache for one core ( or a set of cores ) without actually turning off the CPUs ? I tried searching but in vain. Can you provide some pointers for that ? Thanks!
Hi, here is a link to a basic paper about cache access technology: www.intel.com/content/www/us/en/communications/cache-allocation-technology-white-paper.html. Hope that helps!
Can be done with CAT, pmqos has a decent CLI to get that going.
great talk thanks
22:10 completely hilariuos slide xDDDDD
Hey, great talk.
I'm curious if there is any viability to using non intel parts for this type of single core workload. In particular I was curious if it is viable to use for instance POWER architecture CPUs here. The very fast 5HGz+ cores which are available there and the very large caches might be interesting.
This seems like one of the few still existant markets where these parts might make sense for non legacy work.
I originally had a more interesting reply but the RUclips autoplay function deleted it, the bane of good videos one actually wants to watch to the end I guess.
Interesting comment! I've never looked at other architectures/CPUs, but I am sure some will probably give a decent advantage, once you get familiar enough with them. For example, ARM gives less guarantees about order of reads/writes from multiple threads, which probably means it is faster in some multi-threaded cases (as just one example).
At slide 19, how can we call manager->MainLoop()? Does it mean, that we have a "virutal void MainLoop() = 0; " at IOrderManager? Then we still have virtual funcs+templates instead of just virtual funcs :)
Exactly.
And there also might be a memory leak there. Factory() function is implicitly casting OrderManager to it's base class. IOrderManager.
If there are some member objects of OrderManager which allocated RAM from heap their destructor won't be called.
Also I'm not sure that unique_ptr can be assigned a value from unique_ptr with different template.
There isn't a memory leak as long as the base defines a virtual destructor and OrderManager's destructor works to begin with. Also, that assignment of unique_ptr's is a valid conversion: en.cppreference.com/w/cpp/memory/unique_ptr/unique_ptr
"There isn't a memory leak as long as the base defines a virtual destructor". I don't see that anywhere in the example. Also, the derived class' destructor needs to call destructor of base class. Yes, you're right, unique_ptr assignment works, but there's an implicit cast from OrderManager to it's base class.
There are too many "hidden" things here. I know that this is just an example but, in my opinion, an example for a conference like this one must be much better.
I also don't like seeing some other stuff, like function definitions with empty brackets - "void SendOrder ()". Is it so hard to put "void" in it?
A lot of work has been done on C++ standard to increase strict type enforcement. After all, C and C++ are supposed to be strongly typed languages. C programmers and even standard broke this a lot with huge amount of software exchanging "void *" pointers all around the code. And now C++ programmers are breaking this with a lot of implicit conversions and class to base casts.
In this example they could have just used a pointer to sender function and switch it to different one when needed.
The source for IOrderManager isn't given in the example, it is simply implied that it has a virtual destructor (if not, you will get a leak like you mention).
Also, dtor's of derived classes call the dtor's of their parents implicitly, reversing the order of construction, so this example works as intended assuming he didn't mess up IOrderManager. See this: stackoverflow.com/questions/677620/do-i-need-to-explicitly-call-the-base-virtual-destructor
Yes, you're right, my mistake.
Subtitle: How to trick the wrong operating system running on the wrong hardware into usually being good enough.
Indeed. And being able to move very quickly to new markets and strategies because you are running fairly standard hardware and operating systems which are relatively easy to program against and install. However, every trading company will no doubt also be running their ultra low latency systems on custom hardware and no real operating system to speak of.
Good job! Could you please provide a link about the warming up of SolarFlare adapters?
Sure thing. Check out the open onload API documentation for SolarFlare cards (support.solarflare.com/index.php/component/cognidox/?file=SF-104474-CD-23_Onload_User_Guide.pdf&task=download&format=raw&id=361), in particular this section: 8.15 ONLOAD_MSG_WARM
Hope the use of float is for brevity :-P
It's funny that no one commented on this at the time. Indeed, float would never be used in practice... most likely fixed point integer.
Hey Carl, I really do enjoy the fact that you're here in the comments. I also really enjoyed the talk! Greetings!
Been playing with sg14::fixed_point lately and it's very good! But hey, which size of fixed points is usual in fintech? What "distribution of bits"? Is it fixed or varied distributions for different uses?
Also, at 13:10, are you initialising the error flag somewhere else? Would it be faster with, say, int64_t errorFlags{NO_ERROR}; ?
Thanks again for the talk.
41:24 Why would you say something so controversial, yet so brave
Dude's ripped.
Given the job he admits to having - sneaking a profit by cutting in line (queue) - the rippitude may come in handy.
might as well use Fortran instead of fighting with the compiler so much
35:08 👍
Here is another version of this talk: ruclips.net/video/BxfT9fiUsZ4/видео.html
Why not just use C? Also, I'm curious to know how QUIC and gRPC would fit into your plans to reduce latency
QUIC and gRPC are network protocols used for communication. This means stock exchange must support it and honestly I don't think any exchange would support gRPC since that's not ideal for good latency.
Why not use C instead of C++ when speed is so important?
Because C is not faster than C++ and lacks a lot of abstractions that C++ can get without costs. Think templates vs Macros.
C need not be faster than C++. And since comp time programming hit Cpp I've seen it beat C quite a no. of times.
Thank you for the talk! 33:28 😂
He says intel wouldn't be interested in making CPUs for this market, but this is literally the market that pays for all the international undersea fiber optic cables just to reduce latency slightly. You pay intel enough, they;ll be interested.
Probably too much for them...
Is there even any infrastructure left to produce single core CPUs?
Zero cost abstractions == perpetual motion machine.
people who write "benchmarks" sometimes baffle me
it's like they don't really care about measuring anything
just to have the cpu pegged at "100%" or something and then spit out numbers with lots of significant digits
Any thoughts on Rust? It seems to me an ideal language for the requirements - predictable, consistent high performance, and high productivity due to modern, high level features that still don't introduce high latency behaviour (by avoiding memory allocations and memory churn). It's like the "low latency techniques" happen to be the "best practices" of Rust development, whereas the "best practices" of C++ is not the same as the low latency techniques of C++ dev.
Yes, this is exactly the point. Rust does look cool, but C++ has so many 1000s of man-hours built into the language, the tools, etc, it's going to take one heck of a language to convince people to give up on C++. The same is true for includeOS. Another great idea, fresh thinking, and should in theory make a lot of the tricks that I need to pull obsolete. But it's still too early to start using this product commercially.
Except that best C++ practices are largely variant across different domains, and usually extreme low latency design is considered "best practices" at places where Cpp is still used heavily today.
Rust looks way worse than cpp when you want to go nano …
Let Carl Cook
Now if this isn’t an expert way of saying “ Functional Programming in C++”
16:39 He still needs one virtual function call to Mainloop.
But that only run once, which is not in hotpath.
4:09
faster by a picosecond ?
ie
faster by 1E-12 second
the granularity of ethernet packets is larger than that
let him cook
ok but which random bot do i play russian roulette with on github and download to use as my first bot?
This one: github.com/askmike/gekko
I guess the trading CPUs are all 7700k? Why bother with Xeon there, which is there to have many cores?
ECC memory access.
ark.intel.com/products/120499/Intel-Xeon-Platinum-8156-Processor-16_5M-Cache-3_60-GHz
Yuge L3 cache
Good talk, but a lttle too brief... leaving me a lot of questions...
32:20 dragons lie here.. lol
tfw store didnt have any white shirts so you had to give the talk wearing a pink one
52:14 tfw the guy who bought the last white shirt you wanted used your profiling server as a build machine
Ha ha yes, I asked them to cut this calendar screen-grab from the video, but oh well...
I would be pissed if some tab did that to me during a presentation. You handled it well. Great talk btw, very informative.
C ++ is only for males-alphas system.
Do this in Java and you will lose millions of dollars.='D
"high frequency trading systems" .. should be illegal.
Total unethical industry he is working in. This should be forbidden, thats even more important then gun control.
As a leftist I don't disagree with you, but he's obviously not the bourgeois capitalist, just a developer, and this talk has some valuable information.
it's the other way around, trying to provide liquidity causes low latency trading.
Well that's the idea - but statistically it doesn't work that way: HFT is just too fast. The orders they provide - though they dominate numerically - don't provide better prices for the counterparty. Also they don't trade illiquid contracts. Paid market-makers have to be pulled in for that.
Not blaming them BTW: they're just using technology effectively to make money. I can't see it as unethical either. But there should be more regulation - contact your democratic representative.
@@richardhickling1905 Isn't it entire idea that curtain HFT-s can make money by just squeezing in into the inefficient market? I.e. someone wants to sell curtain instrument for $100, but someone else wants to buy that instrument for $90. HFT is going to see this inefficiency and offer to buy for $92 and sell for $98 for example, thus saving $2 for a buyer, earning $2 for the seller and earning $6 itself.
High-frequency trading is in the high end of the spectrum of the soul-sucking experience (Speculation -> Day Trading -> High Frequency Trading). It probably feels like Satan himself is both sucking your soul and your dick at the same time, or, to use a good old trick of obfuscation through abstract parlance and legalese, it probably feels like adding liquidity to the market.