Server Notice:

hide

Public Pad Latest text of pad kpxxRanMSU Saved March 24, 2022

Benchmarks

CoLA

acceptable sentence:

gj04 1 They drank the pub dry.

unacceptable sentence:

gj04 0 * They drank the pub.

slightly unacceptable sentence(?):

cj99 0 ?* I can well imagine with a hatchet Mary destroying the Jeep.

cj99 0 ?? I can well imagine the more him eating, the fatter him getting.

gj04 0 *? Bill floated into the cave for hours.

Stanford Sentiment Treebank

[0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0]

for very negative, negative, neutral, positive, very positive, respectively.

actual data is only 0 or 1?

weak and 0

skip this dreck , 0

generates 1

, though many of the actors throw off a spark or two when they first appear , they can't generate enough heat in this cold vacuum of a comedy to start a reaction . 0

in memory 1

respectable new one 1

yet this grating showcase 0

hawaiian shirt 1

Microsoft Research Paraphrase Corpus

Semantic Textual Similarity Benchmark

similarity score ranging from 0 for no meaning overlap to 5 for meaning equivalence

annotated by: human judges without any formal expertise in linguistics.

A man is playing a guitar. A girl is playing a guitar. 2.800 (gender)

A woman is eating something. A woman is eating meat. 3.000 (hyponymy)

The boy fell off his bike. A boy falls off his bike. 4.800 (tense + definiteness)

*why does the time of an event make less difference?

A woman is writing.

A woman is swimming.

.500

A man pours oil into a pot.

A man pours wine in a pot.

3.200

difference action vs object

Quora Question Pairs

How do people join ISIS?

Why do people join ISIS?

1 (is_duplicate)

MultiNLI Matched

https://cims.nyu.edu/~sbowman/multinli/

semantical entailment with label; "433k sentence pairs"

premise - Your gift is appreciated by each and every student who will benefit from your generosity.

label - neutral (contradiction, entialment)

hypohthesis - Hundreds of students will benefit from your generosity.

MultiNLI Mismatched

Question NLI

Recognizing Textual Entailment

A man is due in court later charged with the murder 26 years ago of a teenager whose case was the first to be featured on BBC One's Crimewatch. Colette Aram, 16, was walking to her boyfriend's house in Keyworth, Nottinghamshire, on 30 October 1983 when she disappeared. Her body was later found in a field close to her home. Paul Stewart Hutchinson, 50, has been charged with murder and is due before Nottingham magistrates later.

Paul Stewart Hutchinson is accused of having stabbed a girl.

not_entailment

Winograd NLI

https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html

pairs of sentences to disambiguate by changing only one word

example:

I. The trophy would not fit in the brown suitcase because it was too big (small). What was too big (small)?

Answer 0: the trophy

Answer 1: the suitcase

~830 lines of data in total

-----------------------------------------------------

Popular Language Models

BERT

- versions - e.g. different casing versions, model sizes:

bert-base-cased

bert-base-uncased

bert-large-cased

bert-large-uncased

- with hyperparameters of the different versions: number of layers, heads, dimensions)

base

hidden_layers 12

hidden_size 768

attention_heads 12

110M parameters

large

hidden_layers 24

attention_heads 16

hidden_size 1024

340M parameters

- pretraining objectives MLM, NSP

- training data:

BookCorpus (English Wikipedia texts + unpublished books)

RoBERTahttps://arxiv.org/abs/1907.11692

- versions - e.g. different casing versions, model sizes:

roberta-base, roberta-large

- with hyperparameters of the different versions: number of layers, heads, dimensions)

Base: layers:12, heads: 12, hidden_size=768

Large: layers: 24 heads: 16 hidden_size=1024

Base training:

Vocabulary size: 50,000

1024 V100 GPUs for 500K steps with a batch size of 8K and a sequence length of 512. The optimizer used is Adam with a learning rate of 6e-4, β1=0.9\beta_{1} = 0.9β1=0.9, β2=0.98\beta_{2} = 0.98β2=0.98 and ϵ=1e−6\epsilon = 1e-6ϵ=1e−6, a weight decay of 0.01, learning rate warmup for 24,000 steps and linear decay of the learning rate after.

Large training:

The model was trained on 1024 V100 GPUs for 500K steps with a batch size of 8K and a sequence length of 512. The optimizer used is Adam with a learning rate of 4e-4,

β1=0.9\beta_{1} = 0.9β1=0.9, β2=0.98\beta_{2} = 0.98β2=0.98 and ϵ=1e−6\epsilon = 1e-6ϵ=1e−6, a weight decay of 0.01, learning rate warmup for 30,000 steps and linear decay of the learning rate after.

- pretraining objectives

Masked-language-modeling

Byte-pair Encoding tokenization

Dynamic masking

- training data:

bookcorpus, English wikipedia, CC-News, OpenWebText, Stories

(160 GB of text)

DistilBERT(https://arxiv.org/abs/1910.01108)

- versions - e.g. different casing versions, model sizes:

distilbert-base-uncased

distilbert-base-multilingual-cased

- with hyperparameters of the different versions: number of layers, heads, dimensions)

distilbert-base-uncased:

6-layer, 768-hidden, 12-heads, 66M parameters

distilbert-base-multilingual-cased:

6-layer, 768-hidden, 12-heads, 134M parameters

- pretraining objectives

Distillation loss, Masked language modeling (MLM), Cosine embedding loss

- training data:

English wikipedia, bookcorpus

XLNet

- versions - e.g. different casing versions, model sizes:

xlnet-base-cased

xlnet-large-cased

- with hyperparameters of the different versions: number of layers, heads, dimensions)

base

hidden_layers 12

hidden_size 768

attention_heads 12

110M parameters

large

hidden_layers 24

attention_heads 16

hidden_size 1024

340M parameters

- pretraining objectives

question answering, natural language inference, sentiment analysis, and document ranking

permutation language modeling

- training data:

BookCorpus (English Wikipedia texts + unpublished books)

multilingual BERT

- versions - e.g. different casing versions, model sizes:

bert-base-multilingual-cased, 104 languages

bert-base-multilingual-uncased, 102 languages

distilbert-base-multilingual-cased

- with hyperparameters of the different versions: number of layers, heads, dimensions)

base:

12 layers

768 dimension

12 heads

110M/177M parameters

distil:

6 layers

768 dimension

12 heads

134M parameters

- pretraining objectives

Masked Language Modeling

Next Sentence Prediction

- training data:

the entire Wikipedia dump for each language

XLM-RoBERTa

- versions - e.g. different casing versions, model sizes:

- with hyperparameters of the different versions: number of layers, heads, dimensions)

xlm-roberta-base

- 12 layers

- 8 attention heads

- 3072 dimensions in the ff layer

- 125M parameters

xlm-roberta-large

- 24 layers

- 16 attention heads

- 4096 dimensions in the ff layer

- 355M parameters

vocabulary size: 50 000

- pretraining objectives

MLM

supposed to be fine-tuned on tasks such as

- sequence classification

- token classification

- question answering

- training data:

2.5TB of filtered CommonCrawl data containing 100 languages.

GPT https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

a semi-supervised approach for language understanding tasks using a

combination of unsupervised pre-training and supervised fine-tuning

- versions - e.g. different casing versions, model sizes:

- with hyperparameters of the different versions: number of layers, heads, dimensions)

a 12 layer decoder

masked self-attention heads (768 dimensional states and 12 attention heads)

reuse the hyperparameter settings from unsupervised pre-training

- pretraining objectives

Tasks:

Natural language inference

Question answering

Sentence similarity

Classification

- training data:

The training procedure consists of two stages: unsupervised pretraining and supervised fine-tuning

the BooksCorpus dataset was used for training that contains contains over 7,000 unique unpublished books from a variety of genres.

GPT-2 (http://www.persagen.com/files/misc/radford2019language.pdf)

- versions - e.g. different casing versions, model sizes:

- with hyperparameters of the different versions: number of layers, heads, dimensions)

- pretraining objectives: CausalLM (text generation)

- training data:WebText (~40GB), webpages linked on reddit with 3 or more upvotes

GPT-3

also trained on German data

https://www.golem.de/news/textgenerator-gpt-3-auf-deutsch-getestet-genau-wahrscheinlich-sie-sind-wie-die-ameisen-2111-160468.html

- versions - e.g. different casing versions, model sizes:

https://www.springboard.com/blog/data-science/machine-learning-gpt-3-open-ai/#:~:text=The%20largest%20version%20GPT%2D3,and%203.2%20M%20batch%20size.&text=Shown%20in%20the%20figure%20above,that%20it%20is%20quite%20larger.

GPT-3 Small 12 layers

GPT-3 Medium 24 layers

GPT-3 Large 24 layers

GPT-3 XL 24 layers

GPT-3 2.7B 32 layers

GPT-3 6.7B 32 layers

GPT-3 13B 40 l.

GPT-3 175B 96 l.

- with hyperparameters of the different versions: number of layers, heads, dimensions)

GPT-3 175B: 175 B Parameters, 96 attention layers and 3.2 M batch size

- pretraining objectives

Tasks:

Question answering

Human-like text generation

Code, poems, stories generation

etc

- training data: (Wiki)

60 % of the weighted pre-trained dataset from a filtered version of Common Craw

22% from WebText2

8 % from Books1

8% from Books2

3% of tokens from Wikipedia representing 3%

Aktueller Inhalt geladen
Link zur aktuellen Version
Link zu read-only Seite
Dieses Pad bearbeiten

Herunterladen als

HTML

Text

Microsoft Word

PDF

Server Notice:

Public Pad Latest text of pad kpxxRanMSU Saved March 24, 2022

Herunterladen als

Autoren