- Stanford CS25: V1 I Audio Research: Transformers for Applications in Audio, Speech, Music

Stanford CS25: V1 I Audio Research: Transformers for Applications in Audio, Speech, Music

Transformers have touched many fields of research and audio and music is no different. This talk will present 3 of my papers as a case study done, on how we can leverage powerfulness of Transformers, with that of representation learning, signal processing and clustering. For the first part, we wo...
Transformers have touched many fields of research and audio and music is no different. This talk will present 3 of my papers as a case study done, on how we can leverage powerfulness of Transformers, with that of representation learning, signal processing and clustering. For the first part, we would discuss how were able to beat wildly popular wavenet architecture, proposed by Google-DeepMind in raw audio synthesis. We would also show how we overcame, the quadratic constraint of the Transformers by conditioning on the context itself. Secondly, a version of Audio Transformers for large scale audio understanding, which is inspired by viT, operating on raw waveforms is presented. It combines powerful ideas from traditional signal processing, aka wavelets on intermediate transformer embeddings to produce state of the art results. Investigating into the front-end to see why they do so well, we show they learn auditory filter-bank which in a way adapts time-frequency representation according to a task which makes machine listening really cool. Finally, for the third part, the powerfulness of operating on latent code, and discuss language modeling on continuous audio signals using discrete tokens will be discussed. This will describe how simple unsupervised tasks can give us strong competitive results compared with that of end-to-end supervision. We will give an overview of some recent trends in the field and papers by Google, OpenAI etc about the current “fashion”. This work was done in collaboration with Prof. Chris Chafe, Prof. Jonathan Berger and Prof. Julius Smith, all at the Center for Computer Research in Music and Acoustics at Stanford University. Thanks to Stanford’s Human Centered AI for supporting this work, by a generous Google cloud computing grant.

Prateek Verma is currently a research assistant working with Prof. Anshul Kundaje in the Department of Computer Science and Genomics. He works on modeling genomic sequences using machine learning, tackling long sequences, and developing techniques to understand them. He also splits his time working on audio research at Stanford’s Center for Computer Research in Music and Acoustics, with Prof. Chris Chafe, Prof. Jonathan Berger and Prof. Julius Smith. He got his Master's degree from Stanford, and before that, he was at IIT Bombay. He loves biking, hiking, and playing sports.

View the entire CS25 Transformers United playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rNiJRchCzutFw5ItR_Z27CM

0:00 Introduction
0:06 Transformers for Music and Audio: Language Modelling to Understanding to Synthesis
1:35 The Transformer Revolution
5:02 Models getting bigger ...
7:43 What are spectograms
14:30 Raw Audio Synthesis: Difficulty Classical FM synthesis Karplus Strong
17:14 Baseline : Classic WaveNet
20:04 Improving Transformer Baseline • Major bottleneck of Transformers
21:02 Results & Unconditioned Setup • Evaluation Criterion o Comparing Wavenet, Transformers on next sample prediction Top-5 accuracy, out of 256 possible states as a error metric Why this setup 7 1. Application agnostic 2. Suits training setup
22:11 A Framework for Generative and Contrastive Learning of Audio Representations
22:38 Acoustic Scene Understanding
24:34 Recipe of doing
26:00 Turbocharging best of two worlds Vector Quantization: A powerful and under-uilized algorithm Combining VQwih auto-encoders and Transformers
33:24 Turbocharging best of two worlds Leaming clusters from vector quantization Use long term dependency kaming with that cluster based representation for markovian assumption Better we become in prediction, the better the summarization is
37:06 Audio Transformers: Transformer Architectures for Large Scale Audio Understanding - Adieu Convolutions Stanford University March 2021
38:45 Wavelets on Transformer Embeddings
41:20 Methodology + Results
44:04 What does it learn -- the front end
47:18 Final Thoughts

#Stanford #Stanford Online #Audio Research #Music #AI #Artificial Intelligence

Stanford Online

※本サイトに掲載されているチャンネル情報や動画情報はYouTube公式のAPIを使って取得・表示しています。

Timetable

動画タイムテーブル