Chuyển sang chế độ ngoại tuyến với ứng dụng Player FM !
Open Pre-Trained Transformer Language Models (OPT): What does it take to train GPT-3?
Manage episode 355037186 series 3446693
Andrew Yates (Assistant Professor at the University of Amsterdam) and Sergi Castella i Sapé discuss the recent "Open Pre-trained Transformer (OPT) Language Models" from Meta AI (formerly Facebook). In this replication work, Meta developed and trained a 175 Billion parameter Transformer very similar to GPT-3 from OpenAI, documenting the process in detail to share their findings with the community. The code, pretrained weights, and logbook are available on their Github repository (links below).
Links
❓Feedback Form: https://scastella.typeform.com/to/rg7a5GfJ
📄 OPT paper: https://arxiv.org/abs/2205.01068
👾 Code: https://github.com/facebookresearch/metaseq
📒 Logbook: https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdf
✍️ OPT Official Blog Post: https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/
OpenAI Embeddings API: https://openai.com/blog/introducing-text-and-code-embeddings/
Nils Reimers' critique of OpenAI Embeddings API: https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9
Timestamps:
00:00 Introduction and housekeeping: new feedback form, ACL conference highlights
02:42 The convergence between NLP and Neural IR techniques
06:43 Open Pretrained Transformer motivation and scope, reproducing GPT-3 and open-sourcing
08:16 Basics of OPT: architecture, pre-training objective, teacher forcing, tokenizer, training data
13:40 Preliminary experiments findings: hyperparameters, training stability, spikiness
20:08 Problems that appear at scale when training with 992 GPUs
23:01 Using temperature to check whether GPUs are working
25:00 Training the largest model: what to do when the loss explodes? (which happens quite often)
29:15 When they switched away from AdamW to SGD
32:00 Results: successful but not quite GPT-3 level.
Toxicity? 35:45 Replicability of Large Language Models research. Was GPT-3 replicable? What difference does it make?
37:25 What makes a paper replicable?
40:33 Directions in which large Language Models are applied to Information Retrieval
45:15 Final thoughts and takeaways
21 tập
Manage episode 355037186 series 3446693
Andrew Yates (Assistant Professor at the University of Amsterdam) and Sergi Castella i Sapé discuss the recent "Open Pre-trained Transformer (OPT) Language Models" from Meta AI (formerly Facebook). In this replication work, Meta developed and trained a 175 Billion parameter Transformer very similar to GPT-3 from OpenAI, documenting the process in detail to share their findings with the community. The code, pretrained weights, and logbook are available on their Github repository (links below).
Links
❓Feedback Form: https://scastella.typeform.com/to/rg7a5GfJ
📄 OPT paper: https://arxiv.org/abs/2205.01068
👾 Code: https://github.com/facebookresearch/metaseq
📒 Logbook: https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdf
✍️ OPT Official Blog Post: https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/
OpenAI Embeddings API: https://openai.com/blog/introducing-text-and-code-embeddings/
Nils Reimers' critique of OpenAI Embeddings API: https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9
Timestamps:
00:00 Introduction and housekeeping: new feedback form, ACL conference highlights
02:42 The convergence between NLP and Neural IR techniques
06:43 Open Pretrained Transformer motivation and scope, reproducing GPT-3 and open-sourcing
08:16 Basics of OPT: architecture, pre-training objective, teacher forcing, tokenizer, training data
13:40 Preliminary experiments findings: hyperparameters, training stability, spikiness
20:08 Problems that appear at scale when training with 992 GPUs
23:01 Using temperature to check whether GPUs are working
25:00 Training the largest model: what to do when the loss explodes? (which happens quite often)
29:15 When they switched away from AdamW to SGD
32:00 Results: successful but not quite GPT-3 level.
Toxicity? 35:45 Replicability of Large Language Models research. Was GPT-3 replicable? What difference does it make?
37:25 What makes a paper replicable?
40:33 Directions in which large Language Models are applied to Information Retrieval
45:15 Final thoughts and takeaways
21 tập
Tất cả các tập
×Chào mừng bạn đến với Player FM!
Player FM đang quét trang web để tìm các podcast chất lượng cao cho bạn thưởng thức ngay bây giờ. Đây là ứng dụng podcast tốt nhất và hoạt động trên Android, iPhone và web. Đăng ký để đồng bộ các theo dõi trên tất cả thiết bị.