【Deep Learning研修（発展）】系列データモデリング (RNN / LSTM / Transformer)　第１３回「マルチモーダルモデル、事前学習済み言語モデルの利用」

【Deep Learning研修（発展）】（ https://www.youtube.com/playlist?list=PLbtqZvaoOVPA-keirzqx2wzpujxE-fzyt ）はディープラーニング・機械学習に関する発展的な話題を幅広く紹介する研修動画シリーズです。Neural Network Consoleチャンネル（https://www.youtube.com/c/NeuralNetworkConsole/ ）でもディープラーニングに関するより基礎的な内容の解説動画を公開しておりますので、ぜひそちらも御覧ください。

本動画は「系列データモデリング」の第１３回の動画です。画像と言語の両分野を横断した知識を獲得するマルチモーダルモデル(Vision and Languageモデル)を中心に説明していきます。

[スライド3, BERTology] A Primer in BERTology: What we know about how BERT works
https://arxiv.org/abs/2002.12327

[スライド3, Transformer-XL] Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
https://arxiv.org/abs/1901.02860

[スライド3, XLNet] XLNet: Generalized Autoregressive Pretraining for Language Understanding
https://arxiv.org/abs/1906.08237

[スライド3, Reformer] Reformer: The Efficient Transformer
https://arxiv.org/abs/2001.04451

[スライド3, Big Bird] Big Bird: Transformers for Longer Sequences
https://arxiv.org/abs/2007.14062

[スライド3, KnowBert] Knowledge Enhanced Contextual Word Representations
https://arxiv.org/abs/1909.04164

[スライド3, LUKE] LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention
https://arxiv.org/abs/2010.01057

[スライド3, XLM] Cross-lingual Language Model Pretraining
https://arxiv.org/abs/1901.07291

[スライド3, UNITER] UNITER: UNiversal Image-TExt Representation Learning
https://arxiv.org/abs/1909.11740

[スライド6] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
https://arxiv.org/abs/1612.00837

[スライド7] Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision
https://arxiv.org/abs/2103.04037

[スライド9] UNITER: UNiversal Image-TExt Representation Learning
https://arxiv.org/abs/1909.11740

[スライド9, Faster-RCNN] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
https://arxiv.org/abs/1506.01497

[スライド9, BERT] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
https://arxiv.org/abs/1810.04805

[スライド9, COCO] Microsoft COCO: Common Objects in Context
https://arxiv.org/abs/1405.0312

[スライド9, Visual Genome] Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
https://arxiv.org/abs/1602.07332

[スライド9, Conceptual Captions] Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
https://aclanthology.org/P18-1238/

[スライド9, SBU Captions] Im2Text: Describing Images Using 1 Million Captioned Photographs
https://papers.nips.cc/paper/2011/hash/5dd9db5e033da9c6fb5ba83c7a7ebea9-Abstract.html

[スライド11] DALL·E: Creating Images from Text
https://openai.com/blog/dall-e/

[スライド11] Zero-Shot Text-to-Image Generation
https://arxiv.org/abs/2102.12092

[スライド12] Learning Transferable Visual Models From Natural Language Supervision
https://arxiv.org/abs/2103.00020

[スライド15, Q8BERT] Q8BERT: Quantized 8Bit BERT
https://arxiv.org/abs/1910.06188

[スライド15, ALBERT] ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
https://arxiv.org/abs/1909.11942

[スライド15, DynaBERT] DynaBERT: Dynamic BERT with Adaptive Width and Depth
https://arxiv.org/abs/2004.04037

[スライド17] BERT
https://github.com/google-research/bert

[スライド17] GPT-2
https://github.com/openai/gpt-2

[スライド17] T5: Text-To-Text Transfer Transformer
https://github.com/google-research/text-to-text-transfer-transformer

[スライド17] Transformers
https://github.com/huggingface/transformers

[参考文献] 事前学習済言語モデルの動向
https://speakerdeck.com/kyoun/survey-of-pretrained-language-models-f6319c84-a3bc-42ed-b7b9-05e2588b12c7

[参考文献] 事前学習言語モデルの動向
https://speakerdeck.com/kyoun/survey-of-pretrained-language-models

--
ソニーが提供するオープンソースのディープラーニング（深層学習）フレームワークソフトウェアのNeural Network Libraries（ https://nnabla.org/, https://github.com/sony/nnabla/ ）に関連する情報を紹介する動画チャンネルを開設しました（ https://www.youtube.com/c/nnabla ）。Neural Network Librariesのチュートリアル・Tipsに加え、最先端のディープラーニングの技術情報（講義、最先端論文紹介）などを発信していきます。チャンネル登録と応援よろしくおねがいします！

同じくソニーが提供する直感的なGUIベースの深層学習開発環境のNeural Network Console（ https://dl.sony.com/ ）が発信する大人気のYouTubeチャンネル（ https://www.youtube.com/c/NeuralNetworkConsole/ ）でもディープラーニングの技術講座やツールのチュートリアルを多数公開しています。こちらもチャンネル登録と応援よろしくおねがいします。

【Deep Learning研修（発展）】系列データモデリング (RNN / LSTM / Transformer) 第１３回「マルチモーダルモデル、事前学習済み言語モデルの利用」

nnabla ディープラーニングチャンネル

Timetable

よく話題になっている単語

事前学習

結果パート「GANベースのADM」--> 「Diffusion ModelのADM」です

右側の論文タイトルは「Pre-training Vision Transformers with Very Limited Synthesized Images」-->「SegRCDB: Semantic Segmentation via Formula-Driven Supervised Learning」です

Finetuning, adaptor, prompting

人認識（ロバスト性とドメイン汎化性）

人認識（新しいタスクとデータセット）

人認識（一貫性）

3D認識（シーン依存型）

3D認識（シーン非依存型）

まとめ

効率の良いアーキテクチャ

Pruningと量子化

データを使わない・限られた量のデータを用いた量子化とプルーニングの手法が近年提案されています

Lowレベルと物理ベースコンピュータビジョン

AOセンサ向け低ビット量子化の論文を紹介します

Graphics2RAW, GlowGANはそれぞれ以下の論文です．Graphics2RAW: Mapping Computer Graphics Images to Sensor RAW ImagesGlowGAN: Unsupervised Learning of HDR Images from LDR Images in the Wild

Neural architecture search (supernet編)

Neural architecture search (スケーラブル・動的なアーキテクチャ編)

【Deep Learning研修（発展）】系列データモデリング (RNN / LSTM / Transformer)　第１３回「マルチモーダルモデル、事前学習済み言語モデルの利用」