Server Notice:

hide

Public Pad Latest text of pad kpxxRanMSU Saved March 24, 2022

 
 
 
Benchmarks
 
 
CoLA
acceptable sentence:
gj04        1                They drank the pub dry.
unacceptable sentence:
gj04        0        *        They drank the pub.
slightly unacceptable sentence(?):
cj99        0        ?*        I can well imagine with a hatchet Mary destroying the Jeep.
cj99        0        ??        I can well imagine the more him eating, the fatter him getting.
gj04        0        *?        Bill floated into the cave for hours.
 
 
Stanford Sentiment Treebank
[0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0]
for very negative, negative, neutral, positive, very positive, respectively.
 
actual data is only 0 or 1?
 
weak and 0
skip this dreck , 0
generates 1
, though many of the actors throw off a spark or two when they first appear , they can't generate enough heat in this cold vacuum of a comedy to start a reaction . 0
in memory 1
respectable new one 1
yet this grating showcase 0
hawaiian shirt 1
 
Microsoft Research Paraphrase Corpus
 
 
 
 
 
Semantic Textual Similarity Benchmark
similarity score ranging from 0 for no meaning overlap to 5 for meaning equivalence
annotated by: human judges without any formal expertise in linguistics.
 
A man is playing a guitar.        A girl is playing a guitar.        2.800 (gender)
A woman is eating something.        A woman is eating meat.        3.000 (hyponymy)
The boy fell off his bike.        A boy falls off his bike.        4.800 (tense + definiteness)
*why does the time of an event make less difference?
A woman is writing.
A woman is swimming.
.500
 
A man pours oil into a pot.
A man pours wine in a pot.
3.200
 
difference action vs object
 
Quora Question Pairs
How do people join ISIS?
Why do people join ISIS?
1 (is_duplicate)
 
 
 
 
MultiNLI Matched
semantical entailment with label; "433k sentence pairs"
premise - Your gift is appreciated by each and every student who will benefit from your generosity.
label - neutral (contradiction, entialment)
hypohthesis - Hundreds of students will benefit from your generosity.
 
 
MultiNLI Mismatched
 
 
 
Question NLI
 
 
 
Recognizing Textual Entailment
A man is due in court later charged with the murder 26 years ago of a teenager whose case was the first to be featured on BBC One's Crimewatch. Colette Aram, 16, was walking to her boyfriend's house in Keyworth, Nottinghamshire, on 30 October 1983 when she disappeared. Her body was later found in a field close to her home. Paul Stewart Hutchinson, 50, has been charged with murder and is due before Nottingham magistrates later.
 
Paul Stewart Hutchinson is accused of having stabbed a girl.
 
not_entailment
?
 
Winograd NLI
pairs of sentences to disambiguate by changing only one word
example:
I. The trophy would not fit in the brown suitcase because it was too big (small). What was too big (small)?
Answer 0: the trophy
Answer 1: the suitcase
~830 lines of data in total
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
-----------------------------------------------------
Popular Language Models
 
BERT
- versions - e.g. different casing versions, model sizes: 
bert-base-cased
bert-base-uncased
bert-large-cased
bert-large-uncased
    - with hyperparameters of the different versions: number of layers, heads, dimensions)
    base
        hidden_layers 12
        hidden_size 768
        attention_heads 12
        110M parameters
    large
        hidden_layers 24
        attention_heads 16
        hidden_size 1024
        340M parameters
- pretraining objectives MLM, NSP
- training data:
BookCorpus (English Wikipedia texts + unpublished books)
 
- versions - e.g. different casing versions, model sizes: 
roberta-base, roberta-large
    - with hyperparameters of the different versions: number of layers, heads, dimensions)
Base: layers:12, heads: 12, hidden_size=768
Large: layers: 24 heads: 16 hidden_size=1024
Base training:
Vocabulary size: 50,000
 1024 V100 GPUs for 500K steps with a batch size of 8K and a sequence length of 512. The optimizer used is Adam with a learning rate of 6e-4, β1=0.9\beta_{1} = 0.9β1​=0.9, β2=0.98\beta_{2} = 0.98β2​=0.98 and ϵ=1e−6\epsilon = 1e-6ϵ=1e−6, a weight decay of 0.01, learning rate warmup for 24,000 steps and linear decay of the learning rate after.
 Large training:
 The model was trained on 1024 V100 GPUs for 500K steps with a batch size of 8K and a sequence length of 512. The optimizer used is Adam with a learning rate of 4e-4, 
β1=0.9\beta_{1} = 0.9β1​=0.9, β2=0.98\beta_{2} = 0.98β2​=0.98 and ϵ=1e−6\epsilon = 1e-6ϵ=1e−6, a weight decay of 0.01, learning rate warmup for 30,000 steps and linear decay of the learning rate after.
- pretraining objectives
Masked-language-modeling
Byte-pair Encoding tokenization
Dynamic masking
- training data:
bookcorpus, English wikipedia, CC-News, OpenWebText, Stories
(160 GB of text)
 
 
- versions - e.g. different casing versions, model sizes: 
distilbert-base-uncased
distilbert-base-multilingual-cased
    - with hyperparameters of the different versions: number of layers, heads, dimensions)
    distilbert-base-uncased:
    6-layer, 768-hidden, 12-heads, 66M parameters
    distilbert-base-multilingual-cased:
    6-layer, 768-hidden, 12-heads, 134M parameters
- pretraining objectives
Distillation loss, Masked language modeling (MLM), Cosine embedding loss
- training data:
English wikipedia, bookcorpus
 
XLNet
- versions - e.g. different casing versions, model sizes: 
xlnet-base-cased
xlnet-large-cased
    - with hyperparameters of the different versions: number of layers, heads, dimensions)
    base
        hidden_layers 12
        hidden_size 768
        attention_heads 12
        110M parameters
    large
        hidden_layers 24
        attention_heads 16
        hidden_size 1024
        340M parameters
- pretraining objectives
    question answering, natural language inference, sentiment analysis, and document ranking
    permutation language modeling
- training data:
BookCorpus (English Wikipedia texts + unpublished books)
 
multilingual BERT
- versions - e.g. different casing versions, model sizes: 
bert-base-multilingual-cased, 104 languages
bert-base-multilingual-uncased, 102 languages
distilbert-base-multilingual-cased
    - with hyperparameters of the different versions: number of layers, heads, dimensions)
base:
12 layers
768 dimension
12 heads
110M/177M parameters
distil:
6 layers
768 dimension
12 heads
134M parameters
- pretraining objectives
Masked Language Modeling
Next Sentence Prediction
- training data:
the entire Wikipedia dump for each language
 
XLM-RoBERTa
- versions - e.g. different casing versions, model sizes: 
    - with hyperparameters of the different versions: number of layers, heads, dimensions)
xlm-roberta-base
    - 12 layers
    - 8 attention heads
    - 3072 dimensions in the ff layer
    - 125M parameters
xlm-roberta-large
    - 24 layers
    - 16 attention heads
    - 4096 dimensions in the ff layer
    - 355M parameters
vocabulary size: 50 000
- pretraining objectives
MLM
supposed to be fine-tuned on tasks such as
    - sequence classification
    - token classification
    - question answering
- training data:
2.5TB of filtered CommonCrawl data containing 100 languages.
 
 
 
a semi-supervised approach for language understanding tasks using a
combination of unsupervised pre-training and supervised fine-tuning
- versions - e.g. different casing versions, model sizes:  
    - with hyperparameters of the different versions: number of layers, heads, dimensions)
   a 12 layer decoder 
   masked self-attention heads (768 dimensional states and 12 attention heads)
    reuse the hyperparameter settings from unsupervised pre-training
- pretraining objectives
Tasks:
Natural language inference
Question answering
Sentence similarity
Classification
- training data:
The training procedure consists of two stages: unsupervised pretraining and supervised fine-tuning 
the BooksCorpus dataset was used for training  that contains contains over 7,000 unique unpublished books from a variety of genres.
 
- versions - e.g. different casing versions, model sizes: 
    - with hyperparameters of the different versions: number of layers, heads, dimensions)
- pretraining objectives: CausalLM (text generation)
- training data:WebText (~40GB), webpages linked on reddit with 3 or more upvotes
 
 
GPT-3 
 
also trained on German data
 
- versions - e.g. different casing versions, model sizes: 
 
 
GPT-3 Small 12 layers
GPT-3 Medium 24 layers
GPT-3 Large 24 layers
GPT-3 XL 24 layers
GPT-3 2.7B 32 layers
GPT-3 6.7B 32 layers
GPT-3 13B 40 l.
GPT-3 175B 96 l.
 
    - with hyperparameters of the different versions: number of layers, heads, dimensions)
    GPT-3 175B: 175 B Parameters, 96 attention layers and 3.2 M batch size
- pretraining objectives
Tasks:
Question answering
Human-like text generation
Code, poems, stories generation
etc
- training data: (Wiki)
60 % of the weighted pre-trained dataset from a filtered version of Common Craw
22%  from WebText2 
 8 % from Books1 
 8% from Books2 
 3% of tokens from Wikipedia representing 3%