Benchmarks CoLA acceptable sentence: gj04 1 They drank the pub dry. unacceptable sentence: gj04 0 * They drank the pub. slightly unacceptable sentence(?): cj99 0 ?* I can well imagine with a hatchet Mary destroying the Jeep. cj99 0 ?? I can well imagine the more him eating, the fatter him getting. gj04 0 *? Bill floated into the cave for hours. Stanford Sentiment Treebank [0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0] for very negative, negative, neutral, positive, very positive, respectively. actual data is only 0 or 1? weak and 0 skip this dreck , 0 generates 1 , though many of the actors throw off a spark or two when they first appear , they can't generate enough heat in this cold vacuum of a comedy to start a reaction . 0 in memory 1 respectable new one 1 yet this grating showcase 0 hawaiian shirt 1 Microsoft Research Paraphrase Corpus Semantic Textual Similarity Benchmark similarity score ranging from 0 for no meaning overlap to 5 for meaning equivalence annotated by: human judges without any formal expertise in linguistics. A man is playing a guitar. A girl is playing a guitar. 2.800 (gender) A woman is eating something. A woman is eating meat. 3.000 (hyponymy) The boy fell off his bike. A boy falls off his bike. 4.800 (tense + definiteness) *why does the time of an event make less difference? A woman is writing. A woman is swimming. .500 A man pours oil into a pot. A man pours wine in a pot. 3.200 difference action vs object Quora Question Pairs How do people join ISIS? Why do people join ISIS? 1 (is_duplicate) MultiNLI Matched https://cims.nyu.edu/~sbowman/multinli/ semantical entailment with label; "433k sentence pairs" premise - Your gift is appreciated by each and every student who will benefit from your generosity. label - neutral (contradiction, entialment) hypohthesis - Hundreds of students will benefit from your generosity. MultiNLI Mismatched Question NLI Recognizing Textual Entailment A man is due in court later charged with the murder 26 years ago of a teenager whose case was the first to be featured on BBC One's Crimewatch. Colette Aram, 16, was walking to her boyfriend's house in Keyworth, Nottinghamshire, on 30 October 1983 when she disappeared. Her body was later found in a field close to her home. Paul Stewart Hutchinson, 50, has been charged with murder and is due before Nottingham magistrates later. Paul Stewart Hutchinson is accused of having stabbed a girl. not_entailment ? Winograd NLI https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html pairs of sentences to disambiguate by changing only one word example: I. The trophy would not fit in the brown suitcase because it was too big (small). What was too big (small)? Answer 0: the trophy Answer 1: the suitcase ~830 lines of data in total ----------------------------------------------------- Popular Language Models BERT - versions - e.g. different casing versions, model sizes: bert-base-cased bert-base-uncased bert-large-cased bert-large-uncased - with hyperparameters of the different versions: number of layers, heads, dimensions) base hidden_layers 12 hidden_size 768 attention_heads 12 110M parameters large hidden_layers 24 attention_heads 16 hidden_size 1024 340M parameters - pretraining objectives MLM, NSP - training data: BookCorpus (English Wikipedia texts + unpublished books) RoBERTahttps://arxiv.org/abs/1907.11692 - versions - e.g. different casing versions, model sizes: roberta-base, roberta-large - with hyperparameters of the different versions: number of layers, heads, dimensions) Base: layers:12, heads: 12, hidden_size=768 Large: layers: 24 heads: 16 hidden_size=1024 Base training: Vocabulary size: 50,000 1024 V100 GPUs for 500K steps with a batch size of 8K and a sequence length of 512. The optimizer used is Adam with a learning rate of 6e-4, β1=0.9\beta_{1} = 0.9β1​=0.9, β2=0.98\beta_{2} = 0.98β2​=0.98 and ϵ=1e−6\epsilon = 1e-6ϵ=1e−6, a weight decay of 0.01, learning rate warmup for 24,000 steps and linear decay of the learning rate after. Large training: The model was trained on 1024 V100 GPUs for 500K steps with a batch size of 8K and a sequence length of 512. The optimizer used is Adam with a learning rate of 4e-4, β1=0.9\beta_{1} = 0.9β1​=0.9, β2=0.98\beta_{2} = 0.98β2​=0.98 and ϵ=1e−6\epsilon = 1e-6ϵ=1e−6, a weight decay of 0.01, learning rate warmup for 30,000 steps and linear decay of the learning rate after. - pretraining objectives Masked-language-modeling Byte-pair Encoding tokenization Dynamic masking - training data: bookcorpus, English wikipedia, CC-News, OpenWebText, Stories (160 GB of text) DistilBERT(https://arxiv.org/abs/1910.01108) - versions - e.g. different casing versions, model sizes: distilbert-base-uncased distilbert-base-multilingual-cased - with hyperparameters of the different versions: number of layers, heads, dimensions) distilbert-base-uncased: 6-layer, 768-hidden, 12-heads, 66M parameters distilbert-base-multilingual-cased: 6-layer, 768-hidden, 12-heads, 134M parameters - pretraining objectives Distillation loss, Masked language modeling (MLM), Cosine embedding loss - training data: English wikipedia, bookcorpus XLNet - versions - e.g. different casing versions, model sizes: xlnet-base-cased xlnet-large-cased - with hyperparameters of the different versions: number of layers, heads, dimensions) base hidden_layers 12 hidden_size 768 attention_heads 12 110M parameters large hidden_layers 24 attention_heads 16 hidden_size 1024 340M parameters - pretraining objectives question answering, natural language inference, sentiment analysis, and document ranking permutation language modeling - training data: BookCorpus (English Wikipedia texts + unpublished books) multilingual BERT - versions - e.g. different casing versions, model sizes: bert-base-multilingual-cased, 104 languages bert-base-multilingual-uncased, 102 languages distilbert-base-multilingual-cased - with hyperparameters of the different versions: number of layers, heads, dimensions) base: 12 layers 768 dimension 12 heads 110M/177M parameters distil: 6 layers 768 dimension 12 heads 134M parameters - pretraining objectives Masked Language Modeling Next Sentence Prediction - training data: the entire Wikipedia dump for each language XLM-RoBERTa - versions - e.g. different casing versions, model sizes: - with hyperparameters of the different versions: number of layers, heads, dimensions) xlm-roberta-base - 12 layers - 8 attention heads - 3072 dimensions in the ff layer - 125M parameters xlm-roberta-large - 24 layers - 16 attention heads - 4096 dimensions in the ff layer - 355M parameters vocabulary size: 50 000 - pretraining objectives MLM supposed to be fine-tuned on tasks such as - sequence classification - token classification - question answering - training data: 2.5TB of filtered CommonCrawl data containing 100 languages. GPT https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf a semi-supervised approach for language understanding tasks using a combination of unsupervised pre-training and supervised fine-tuning - versions - e.g. different casing versions, model sizes: - with hyperparameters of the different versions: number of layers, heads, dimensions) a 12 layer decoder masked self-attention heads (768 dimensional states and 12 attention heads) reuse the hyperparameter settings from unsupervised pre-training - pretraining objectives Tasks: Natural language inference Question answering Sentence similarity Classification - training data: The training procedure consists of two stages: unsupervised pretraining and supervised fine-tuning the BooksCorpus dataset was used for training that contains contains over 7,000 unique unpublished books from a variety of genres. GPT-2 (http://www.persagen.com/files/misc/radford2019language.pdf) - versions - e.g. different casing versions, model sizes: - with hyperparameters of the different versions: number of layers, heads, dimensions) - pretraining objectives: CausalLM (text generation) - training data:WebText (~40GB), webpages linked on reddit with 3 or more upvotes GPT-3 also trained on German data https://www.golem.de/news/textgenerator-gpt-3-auf-deutsch-getestet-genau-wahrscheinlich-sie-sind-wie-die-ameisen-2111-160468.html - versions - e.g. different casing versions, model sizes: https://www.springboard.com/blog/data-science/machine-learning-gpt-3-open-ai/#:~:text=The%20largest%20version%20GPT%2D3,and%203.2%20M%20batch%20size.&text=Shown%20in%20the%20figure%20above,that%20it%20is%20quite%20larger. GPT-3 Small 12 layers GPT-3 Medium 24 layers GPT-3 Large 24 layers GPT-3 XL 24 layers GPT-3 2.7B 32 layers GPT-3 6.7B 32 layers GPT-3 13B 40 l. GPT-3 175B 96 l. - with hyperparameters of the different versions: number of layers, heads, dimensions) GPT-3 175B: 175 B Parameters, 96 attention layers and 3.2 M batch size - pretraining objectives Tasks: Question answering Human-like text generation Code, poems, stories generation etc - training data: (Wiki) 60 % of the weighted pre-trained dataset from a filtered version of Common Craw 22% from WebText2 8 % from Books1 8% from Books2 3% of tokens from Wikipedia representing 3%