Speech Recognition with Next-Generation Kaldi (K2, Lhotse, Icefall)
HTML-код
- Опубликовано: 9 фев 2025
- Title: Speech Recognition with Next-Generation Kaldi (K2, Lhotse, Icefall)
Authors: Sanjeev Khudanpur, Daniel Povey, Piotr Żelasko
Category: Tutorials
Abstract: This tutorial introduces k2, the cutting-edge successor to the Kaldi speech processing, which consists of several Python-centric modules to enable building speech recognition systems, along with its enabling counterparts, Lhotse and Icefall. The participants will learn how to perform swift data manipulation with Lhotse; how to build and leverage auto-differentiable weighted finite state transducers with k2; and how these two can be combined to create Pytorch-based, state-of-the-art hybrid ASR system recipes from Snowfall, the precursor to Icefall.
Dr. Daniel Povey is an expert in ASR, best known as the lead author of the Kaldi toolkit and also for popularizing discriminative training (now known as "sequence training" in the form of MMI and MPE). He has worked in various research positions at IBM, Microsoft and Johns Hopkins University, and is now Chief Speech Scientist of Xiaomi Corporation in Beijing, China.
Dr. Piotr Żelasko is an expert in ASR and spoken language understanding, with extensive experience in developing practical and scalable ASR solutions for industrial-strength use. He worked with successful speech processing start-ups - Techmo (Poland) and IntelligentWire (USA, acquired by Avaya). At present, he is a research scientist at Johns Hopkins University.
Prof. Sanjeev Khudanpur has 25+ years of experience working on almost all aspects of human language technology, including ASR, machine translation, and information retrieval. He has lead a number of research projects from NSF, DARPA, IARPA, and industry sponsors, and published extensively. He has trained more than 40 PhD and Masters students to use Kaldi for their dissertation work.
For more details and PDF version of the paper visit:
tutorial03
Why does HMM model the P(A|W) not P(W|A) as it done in DNN in the speech recognition domain where A is a speech signal and W are words or phones?
May I please get a link to the pdf resources?