Ronald S. Bultje - Low-level wizardry in dav1d

Поделиться
HTML-код
  • Опубликовано: 1 фев 2024
  • In a world of datacenter virtualization and high-level languages, today we take a peek behind the curtain and watch how the Wizard of Oz does his magic: let’s dive into the world of low level optimizations in dav1d - Videonlan’s software AV1 decoder.
    First, we will dig into dav1d's AVX512 optimizations for Intel’s most recent CPUs: IceLake and TigerLake. Historically, video encoders and decoders have had issues optimizing for Intel's AVX512 instruction set. The wider vector length should in theory improve performance, but prior to IceLake, the associated clockspeed frequency penalties and lack of interesting new instructions meant that most applications saw little gains. IceLake brought about significant reductions in clockspeed penalties, and new video codecs (like AV1) utilize bigger block sizes: ideal conditions to leverage AVX512’s wider vectors. More importantly, by MacGyvering the new cross-lane shuffle, multiply-accumulate and cryptographic instructions supported in IceLake’s AVX512 subset, we've reached up to 3x speedups in our new AVX512 functions compared to their AVX2 counterparts. Overall, we see a 10% speedup in a fully optimized AVX512 decoder vs. using AVX2 instructions on the same machine.
    Second, we describe our redesigned threading model. Current (tile- or frame-) threading models need many resources (threads, memory) to saturate a limited number of cores, scale poorly across different-throughput cores (like big.LITTLE) and depend on bitstream features which negatively affect compression. Our new threading model scales regardless of specific features present in the bitstream, requires limited system resources and is ideal for big.LITTLE core combinations that are popular in today’s latest mobile devices. With this design, dav1d can keep all your cores busier than a Barista during a Demuxed break, on ARM as well as x86. 720p real-time AV1 software playback on the large majority of Android devices out in the wild is now a reality.
    This talk was presented at Demuxed '23, a conference for video nerds in San Francisco featuring amazing talks like this one.
  • НаукаНаука

Комментарии • 2

  • @TheoneandonlyRAH
    @TheoneandonlyRAH 2 месяца назад

    congrats!!

  • @BobHannent
    @BobHannent 4 месяца назад +5

    Here's a question: Is the dependence on Assembly an indication that compilers are insufficient or that higher level languages aren't capable of describing things in ways that they can be compiled efficiently? If it's the compiler, what would it take to make a higher level compiler to approach the efficiency of assembly?