[2002.05645] Training Large Neural Networks with Constant Memory using a New Execution Algorithm

[1812.04948] A Style-Based Generator Architecture for Generative Adversarial Networks

[2107.06917] A Field Guide to Federated Optimization

[1511.05641] Net2Net: Accelerating Learning via Knowledge Transfer

[2104.03113] Scaling Scaling Laws with Board Games

[2105.12806] A Universal Law of Robustness via Isoperimetry

[2006.10621] On the Predictability of Pruning Across Scales

[2106.05237] Knowledge distillation: A good teacher is patient and consistent

[2009.06807] The Radicalization Risks of GPT-3 and Advanced Neural Language Models

[1712.02950] CycleGAN, a Master of Steganography

[2010.03660] Fast Stencil-Code Computation on a Wafer-Scale Processor

[2205.05131] UL2: Unifying Language Learning Paradigms

[2203.15556] Training Compute-Optimal Large Language Models

[2203.03466] Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer

[2108.07686] Untitled Document

[2110.05457] Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World

[1910.07113] Solving Rubik’s Cube with a Robot Hand

[1901.08652] Learning Agile and Dynamic Motor Skills for Legged Robots

[2202.06009] Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam

[1608.05343] Decoupled Neural Interfaces using Synthetic Gradients

[1809.02942] Cellular automata as convolutional neural networks

[2102.02579] Regenerating Soft Robots through Neural Cellular Automata

[2103.08737] Growing 3D Artefacts and Functional Machines with Neural Cellular Automata

[2105.07299] Texture Generation with Neural Cellular Automata

[2201.12360] Variational Neural Cellular Automata

[2111.13545] 𝜇NCA: Texture Generation with Ultra-Compact Neural Cellular Automata

[1904.11455] Ray Interference: a Source of Plateaus in Deep Reinforcement Learning

[1609.09106] Untitled Document

[1511.09249] On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models Technical Report

[1911.08265] Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

[2103.10948] The Shape of Learning Curves: a Review

[2106.10207] Distributed Deep Learning In Open Collaborations

[2004.08366] DynamicEmbedding: Extending TensorFlow for Colossal-Scale Applications

[1812.06162] An Empirical Model of Large-Batch Training

[1802.08864] One Big Net For Everything Technical Report

[2102.01293] Scaling Laws for Transfer

[1906.01820] Risks from Learned Optimization in Advanced Machine Learning Systems

[2205.06175] A Generalist Agent

[2005.14165] Language Models are Few-Shot Learners

[2106.08254] BEiT: BERT Pre-Training of Image Transformers

[2112.09332] WebGPT: Browser-assisted question-answering with human feedback

[2202.08137] A data-driven approach for learning to control computers

[2112.03178] Player of Games

[2107.12808] Open-Ended Learning Leads to Generally Capable Agents

[2105.12196] From Motor Control to Team Play in Simulated Humanoid Football

[2009.01719] Grounded Language Learning Fast and Slow

[2012.05672] Imitating Interactive Intelligence

[2110.15349] Learning to Ground Multi-Agent Communication with Autoencoders

[2110.08176] Collaborating with Humans without Human Data

[2201.01816] Hidden Agenda: a Social Deduction Game with Diverse Learned Equilibria

[2103.04000] Off-Belief Learning

[2104.07219] Multitasking Inhibits Semantic Drift

[2004.02967] Evolving Normalization-Activation Layers

[2007.03898] NVAE: A Deep Hierarchical Variational Autoencoder

[2011.10650] Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images

[1512.03385] Deep Residual Learning for Image Recognition

[2108.05818] PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Dynamic Memory Management

[2203.12533] 1 Introduction

[2105.04663] GSPMD: General and Scalable Parallelization for ML Computation Graphs

[2107.06925] Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines

[2104.04473] Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

[2102.07988] TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

[2104.04657] Meta-Learning Bidirectional Update Rules

[2012.14905] Meta Learning Backpropagation And Improving It

[2003.03384] AutoML-Zero: Evolving Machine Learning Algorithms From Scratch

[1706.03762] Attention Is All You Need

[1610.06258] Using Fast Weights to Attend to the Recent Past

[1911.08265] Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model