- Видео 89
- Просмотров 71 136
Performance Summit
США
Добавлен 18 дек 2019
Performance Summit events serve as a place for software performance enthusiasts and practitioners to meet and discuss challenges, research and possible solutions around delivering delightful and efficient software solutions.
You can find additional event resources in github.com/ttsugriy/performance-summit
You can find additional event resources in github.com/ttsugriy/performance-summit
Panel discussion with Ivica Bogosavljevic, Nadav Rotem, Sergey Slotin and Taras Tsugrii
In this session hosted by Taras Tsugrii, speakers from both days of Performance Summitt 2022 come together to explore what's next in performance improvement, and to answer any lingering questions posted by the talks.
Просмотров: 211
Видео
Faster ETL Pipeline with Bodo by Ahmad Khadem
Просмотров 1872 года назад
Faster ETL Pipeline with Bodo: A Compiler Based Parallel Computing for Big Data Analytics The growing scale and complexity of data and ML workloads demands enormous compute power. Bodo is a new compute platform that aims to improve efficiency and performance of ETL pipelines through automated parallelization of native Python and SQL workloads. Bodo’s resource efficiency and extreme performance ...
Running a Datacenter Performance Optimization Campaign by Nadav Rotem
Просмотров 1 тыс.2 года назад
In recent years, the number of software services that are running on datacenter and cloud computing platforms has increased rapidly. Datacenters are expensive to operate and represent a significant portion of the world's energy consumption. Efforts to improve the efficiency of software that runs on datacenters can greatly reduce the energy use and costs of operations and improve the responsiven...
The Art of SIMD Programming by Sergey Slotin
Просмотров 12 тыс.2 года назад
Modern hardware is highly parallel, but not only in terms of multiprocessing. There are many other forms of parallelism that, if used correctly, can greatly boost program efficiency - and without requiring more CPU cores. One such type of parallelism actively adopted by CPUs is "Single Instruction, Multiple Data" (SIMD): a class of instructions that can perform the same operation on a block of ...
Bytecode Rewrite Optimizations by Shiqi Cao
Просмотров 6102 года назад
ByteDance in-house Redex passes A handful Redex passes have been developed in ByteDance in addition to existing Redex passes. We'd like to share some of them, RenameClass, StringBuilderOutliner, KtDataClass. These three passes saved about ~1M, ??%. *Exploring ART Optimization Complement * ART optimization is triggered during both jit and dex2aot, however both are running with restricted computa...
Performance Laws by Taras Tsugrii
Просмотров 8002 года назад
Most modern performance work is centered around technologies and tools. Following Jeff Bezos' advice, this talk, instead of trying to predict the future, covers some of the oldest mathematical principles and ways they've shaped modern software. Parallel and distributed computing, caching and proxy computations are just a few examples of leveraging associativity, idempotence, Kleisli category an...
Where Have All the Cycles Gone? by Sean Parent
Просмотров 2,7 тыс.2 года назад
Personal computers and devices are unbelievably fast but often struggle to keep up with mundane tasks. Why hasn't software managed to scale in performance with hardware? This talk looks at some of the reasons, the characteristics of fast systems, the limits of human perception, and the need to rethink how software is authored to build efficient systems. skillsmatter.com/skillscasts/17836-where-...
Why do Programs Get Slower with Time? by Ivica Bogosavljevic
Просмотров 5392 года назад
When it comes to software performance, ideally we want two things: 1. The performance of our program doesn’t change as the features are added. 2. The runtime of the program grows linearly to the data set size. But this often fails to happen. As new features are added or the dataset grows, the software starts running much slower than one would expect. In this talk we will explore what are the fa...
Performance Summit - September 2021 - Day 2 Panel discussion
Просмотров 1523 года назад
Join Luca Chiabrera, Paul McKenney, Daniele Salvatore Albano, Moritz Beller, and our host lonut ZoIti for a thought provoking conversation of what's next in the Computing Performance world !
Tailoring Google Dataproc to Reduce Spark Execution Time and Cut the Bill - Luca Chiabrera
Просмотров 1783 года назад
Dataproc is a fully managed and highly scalable Google service that facilitates quickly deploying clusters and executing Spark applications. Nevertheless, the user remains responsible for sizing the Dataproc infrastructure and defining the Spark application execution parameters. These activities have a relevant impact on both the execution time of the Spark applications and the cost of the Data...
The Price of Fast and Robust Concurrent Software - Paul McKenney
Просмотров 4043 года назад
Although there is no magic wand that can wish away all bugs in concurrent software, there is a wealth of tools, techniques, and experience that permit moving towards that goal. This talk takes on the viewpoints of semantics, software engineering, installed base, software stack, and finally natural selection in order to name the price of fast and robust concurrent software. Performance Summit Se...
cachegrand - A Super Scalar Caching Platform - Daniele Salvatore Albano
Просмотров 3023 года назад
cachegrand is a super-scalar caching platform, aiming initially to be a Redis drop-in replacement, to achieve high performances a lot of components have been written from the scratch with performance in mind, among other things the internal parallel hashtable, built only to store 64 bit numbers, is using a number of different techniques to maximize the throughput: data oriented design to take a...
Learning to Predict Performance Regressions in Production - Moritz Beller
Просмотров 2623 года назад
Imagine what we could do if we could predict performance regressions before they actually impact production, at code authorship time? In this talk, I will set the stage of the performance prediction landscape, will clarify why predicting performance regressions of a piece of code is difficult, go over the many things we learned as part of the SuperPerforator initiative at Facebook, and share ou...
Performance Summit - September 2021 - Day 1 Panel discussion
Просмотров 593 года назад
Join Stefano Doni, Ivica Bogosavljevic, Yaniv Sabo, Simone Casagranda, Raman Pandey, Nian Sun and our host lonut ZoIti for a thought provoking conversation of what's next in the Computing Performance world !
Connected Vehicle Platform - Raman Pandey
Просмотров 2953 года назад
Connected vehicle platform is one of the important performance sensitive platforms, where the role of performance engineer is to assess the IoT devices which are connected to the cloud solution and communicate via telemetries. Performance Summit September 2021
Using ML to Automatically Optimize Kubernetes for Cost Efficiency and Reliability - Stefano Doni
Просмотров 3243 года назад
Using ML to Automatically Optimize Kubernetes for Cost Efficiency and Reliability - Stefano Doni
Performance Optimization Techniques for Mobile Devices: An Overview - Ivica Bogosavljevic
Просмотров 3373 года назад
Performance Optimization Techniques for Mobile Devices: An Overview - Ivica Bogosavljevic
Task scheduling optimizations in Mobile Application - Nian Sun
Просмотров 2083 года назад
Task scheduling optimizations in Mobile Application - Nian Sun
Shipping 2021 user experiences on 2011 phones - Simone Casagranda & Yaniv Sabo
Просмотров 1353 года назад
Shipping 2021 user experiences on 2011 phones - Simone Casagranda & Yaniv Sabo
London Perf Summit - Panel Discussion
Просмотров 2543 года назад
London Perf Summit - Panel Discussion
We Should Become Good at Optimizing our Code - Denis Bakhvalov
Просмотров 1,2 тыс.3 года назад
We Should Become Good at Optimizing our Code - Denis Bakhvalov
Understanding and Measuring CPU throttling in containerized environments - Francesco Fabbrizio
Просмотров 3,3 тыс.3 года назад
Understanding and Measuring CPU throttling in containerized environments - Francesco Fabbrizio
How AI optimization will debunk 4 long-standing Java tuning myths - Stefano Doni
Просмотров 5023 года назад
How AI optimization will debunk 4 long-standing Java tuning myths - Stefano Doni
Virtual machine warmup blows hot and cold - Laurence Tratt
Просмотров 4743 года назад
Virtual machine warmup blows hot and cold - Laurence Tratt
Debugging performance issues with Java Flight Recorder - Alexander Kachur
Просмотров 1 тыс.3 года назад
Debugging performance issues with Java Flight Recorder - Alexander Kachur
On access control model of Linux native performance monitoring - Alexei Budankov
Просмотров 1103 года назад
On access control model of Linux native performance monitoring - Alexei Budankov
Query-based compiler architectures - Olle Fredriksson
Просмотров 8483 года назад
Query-based compiler architectures - Olle Fredriksson
nanoBench: A Low-Overhead Tool for Running Microbenchmarks on x86 Systems - Andreas Abel
Просмотров 3913 года назад
nanoBench: A Low-Overhead Tool for Running Microbenchmarks on x86 Systems - Andreas Abel
Automatically avoid the top performance patterns in distributed systems - Andreas Grabner
Просмотров 2713 года назад
Automatically avoid the top performance patterns in distributed systems - Andreas Grabner
Awesome! Watched this right after Dave Chiluk's conference reference-this video perfectly complements it. It really helps paint a clear picture of what CPU throttling is and the key takeaways from it.
Thanks! Glad i could help
I don't understand why, in the masking intro slide (at 22:42), the author says that the following has no branches: for (int i = 0; i < N; i++) s += (a[i] < 50 ? a[i] : 0); That's a ternary operation, which branches between left and right expressions. What am I missing?
That will execute both options and disregard (mask) incorrect ones. Counterintuitively this yields a massive speedup! No branches.
@@az09letters92 Is that because the compiler can prove that executing both sides has no side-effects? Because if the left or right were expressions that could have side effects, then it would be a short-circuiting branch, correct?
Hard to understand English and unpleasantly small text...
thanks so much for the tutorial
I personally use 9 performance principles - some of which are the same as the 5 outlined in this video. Some of the differences seem to be "CPU to data distance" and "data size" ... I cover the principles I use in my video "My 9 + 1 core performance optimization principles" : ruclips.net/video/ULlFWomaPVw/видео.html
This is cringe
Somebody should have taken a minute to optimize that terrible audio compression.
Adobe should optimise their software to have less source code in general based on the results of this. Write slimmer code and the performance isssue goes away.
Awesome 🇷🇺
The "Closure" part of the optimization is basically how Forth works: a code pointer and a piece of data for each "word".
Thanks for the video. Is there any new material on topic of "closure generation" interpreters ? Very hard to find material that is not bytecode based.
Thanks very appreciated. Especially the examples in C. Is this directky compatible in Cython ?
The intrinsics i mean
I don't understand why these architecture specific instructions are not recognized directly by gcc on O3.
they are, when you give the -march= argument, otherwise the compiler doesn't know which instruction sets are allowed and will fall back to a default (usually x86-64 without avx)
Anybody really cares about latency what to use a RPC and streaming? If I can tolerate 18ms I did not see why I can not tolerate 100ms.
well, spending $1m vs $10m on compute makes a good case for preferring one over the other ;)
great video. Thank very much for your lightening example and insightful explanation!
Great talk. Thank you Sean.
great talk
This was amazingly useful! Loved it. Thanks for the great work!
Amazing and full of knowledge talk.
50:00 Sean might be referring to “Eytzinger binary search”
21:45 ruclips.net/p/PLGvfHSgImk4Y1thqJLpcSscMTwgN-XM9l
31:43 Great talk. More about optimization remarks and viewing them can be found in Ofek’s talks such as ruclips.net/video/6HbyacS5eZQ/видео.html
thank you for the kind words and sharing this excellent talk!
This guy is seriously a god.
What happened to Concord.io?
sold it to akamai.
Audio is too low. The (section) flags are neat. As it's a performance presentation, it would have been neat to include a running tally as go through the optimizations. You do cover it in the conclusion, of course. Good job!
thank you laurence very interesting. i have a trading app written in java that takes a good few hours to speed up. this has given me the incentive to find out exactly when jit compilation is happening.
this was brilliant. thank you very much. next stop ScyllaDB
I recognise Suchakra's slide at 4:00 :)
Link in the desc is broken link
Fixed ! Thank you for raising this !
"Developers love Kafka api" - are you serious? After a year working with Kafka, I know some configuration choices and design problems I still shudder at
Oh hell yes. Completely agree.
lossy compression here. what ppl want is their existing apps to go faster w/ no code changes. the partitioning scheme of un ordered collection with totally ordered sub collections is pretty handy as a modeling. what you point out is the heavy weight nature of partitions which is true, but the mental model is helpful.
He is missing baseline jits. Very fast jit compilers. They are about 10x faster than the fastest interpreter
As he mentioned in the intro, the whole point of "fast CHEAP interpreters" is that you wouldn't need JIT compilation which is "not cheap" because special domain knowledge on assembly/hardware is necessary for JIT/inline threading.
@@SimGunther I think rurban's point is that there are JITs that are quite simple to make. Look at the Kaleidescope tutorial for LLVM. You don't actually need to know anything about platform specific ASM.
@@hardknockscocIt's one thing to use someone else's JIT library so you don't need "platform specific knowledge" about the assembly you're creating, but it's a different story rolling your own JIT library, which is what I assumed in the original comment. You could also be a LISP-kinda person if you want to write a whole interpreter in SBCL/scheme running your whole program tree with C shared libraries for performance intensive stuff. That is technically a way of going about JIT without "platform specific knowledge" unless performance is the only concern for the interpreter.
@@hardknockscoc Even when using LLVM IR, it's a huge hassle for high-level languages. For instance, implementing a language like Scheme (or any language with call/cc) is straightforward in a virtual machine or interpreter, but it's a major headache on the machine stack. Simply translating it won't yield effective results. Another issue is compile time. Usually, we use scripting languages not for compute-bound program, but to assist the host language in configuring some data. Imagine your build system script just contains some build logic and file paths, but the script's compile time is longer than its execution time. To address this, sufficiently mature JIT compilers are typically multi-tiered, but implementing this is also a significant challenge.
Benchmarking these approaches provide a very valuable insight for anyone considering these approaches! Thanks for providing all these learnings. 😎
Why I never see throughput benchmark for this XD.
Excellent sneak peak of query based compilers.
The discussion in the last 10 minutes is full of insights! Thanks!
Good talk. 👍
Any tutorial on this?
vectorized.io has an extensive documentation for Redpanda - vectorized.io/docs.
Here is the link to the GitHub project page: github.com/Genivia/ugrep
he stated that intra-datacenter network latencies nowadays between machines was on par with inter-NUMA latencies. that’s just categorically untrue. 100s of nanoseconds vs 100s of microseconds. that’s 3 orders of magnitude.
i mentioned for a 'contended' resource. I've measured in the milliseconds.
but to add, you *can* in fact do a network call in low double digit microseconds. It's easy to measure, just hook up 2 servers back to back with some SPF link or smth and DPDK and you'll see. All of this is quite easily verifyable really.
Fantastic talk! Thanks for the wisdom!
Great job guys, and thanks Nitin to share this discussion and your vision. I really liked the concept and topic which fits o app such Linkedin. Nevertheless, there were some missing points like: - Evaluation of used prognostic models (Accuracy for prediction concepts expressed by error analysis) - Compare ML (Xboost) Vs DL (DNN) or offline NQS - Type & quality of NNs architectures were used for predictive models (info about the best model and its hyperparameters) - feature importance mechanism was used for feature engineering just for Xboost since in Deep Learning it's done by NN itself. - Is it possible to access the dataset you used like open-source?
Hey Mehryar. These are some good questions. Thanks for posting them. We are working on a detailed LinkedIn engineering blog focussed on Machine Learning aspects of this solution. We haven't open sourced this data yet, but we have it in our radar.