Just Split, Dropout and Pay Attention

a recipe for low-resource NMT

procession to the Church of St. Rocco and St. Francis of Paola

natural language

table of contents

Sicilian language

dictionary specification

references

resources

Project Napizia

Recent research and our own experiments have shown that it is possible to create neural machine translators that achieve relatively high BLEU scores with small datasets of parallel text.

The trick is to train a smaller model for the smaller dataset.

Training a large model on a small dataset is comparable to estimating a regression model with a large number of parameters on a dataset with few observations: It leaves you with too few degrees of freedom. The model thus becomes over-fit and does not make good predictions.

Reducing the vocabulary with subword-splitting training a smaller network and setting a high-dropout parameter reduce over-fitting. And self-attentional neural networks also reduce over-fitting because (compared to recurrent and convolutional networks) they are less complex. They directly model the relationships between words in a pair of sentences.

This combination of splitting, dropout and self-attention achieved a BLEU score of 25.1 on English-to-Sicilian translation and 29.1 on Sicilian-to-English with only 16,945 lines of parallel training data containing 266,514 Sicilian words and 269,153 English words.

And because the networks were small, each model took just under six hours to train on CPU.

Our success is an implementation of the best practices developed by Sennrich and Zhang (2019) with the self-attentional Transformer model developed by Vaswani et al. (2017).

For training, we used the Sockeye toolkit by Hieber et al. (2017) running on a server with four 2.40 GHz virtual CPUs.

In their best practices for low-resource NMT, Sennrich and Zhang suggest the byte-pair encoding (i.e. subword-splitting) developed by Sennrich, Haddow and Birch (2016), a smaller neural network with fewer layers, smaller batch sizes and larger dropout parameters.

Using those best practices in the "BiDeep RNN" architecture proposed by Miceli Barone et al. (2017), they achieved a BLEU score of 16.6 on German-to-English translation with only 100,000 words of parallel training data.

Their largest improvements in translation quality came from the application of a byte-pair encoding (i.e. subword-splitting) that reduced the vocabulary from 14,000 words to 2000 words. But their most successful training also occurred when they set high dropout parameters.

During training, dropout randomly shuts off a percentage of units (by setting it to zero), which effectively prevents the units from adapting to each other. Each unit therefore becomes more independent of the others because the model is trained as if it had a smaller number of units, thus reducing over-fitting (Srivastava et al. (2014)).

BLEU scores

dataset

subwords

En-Sc

Sc-En

2,000

11.4

12.9

2,000

12.9

13.3

3,000

19.6

19.5

3,000

19.6

21.5

3,000

21.1

21.2

3,000

22.4

24.1

3,000

22.5

25.2

3,000

24.6

27.0

3,000

25.1

29.1

30
+back

5,000

27.7

–

30
Books
+back

Sc:  5,000
En:  7,500
It:  5,000

19.7

35.1*

26.2

34.6*

33
homework
Books
+back

Sc:  5,000
En:  7,500
It:  5,000

It-Sc
35.0*
36.5†

Sc-It
36.8*
30.9†

* larger model, † M2M model

datasets

dataset	lines	Sc words	En words
20	7,721	121,136	121,892
21	8,660	146,370	146,437
23	12,095	171,278	175,174
24	13,060	178,714	183,736
25	13,392	185,540	190,538
27	13,839	190,072	195,372
28	14,494	196,911	202,652
29	16,591	258,730	261,474
30	16,945	266,514	269,153
30 +back	16,829 +3,251	261,421 +92,141	264,242 –
30 Books +back	16,891 32,804 +3,250	262,582 – +92,146	266,740 929,043 –
33 hw Sc-En hw Sc-It hw En-It Books +bk Sc→It +bk En/It→Sc	12,357 4,660 4,660 4,660 28,982 +3,250 +3,250	237,456 30,244 30,244 – – – +92,146	236,568 35,173 – 35,173 836,757 – –

model sizes

	defaults	ours	larger	M2M
layers	6	3	4	4
embedding size	512	256	384	512
model size	512	256	384	512
attention heads	8	4	6	8
feed forward	2048	1024	1536	2048

Subword-splitting and high dropout parameters helped us achieve better than expected results with a small dataset, but it was the Transformer model that pushed our BLEU scores into the double digits.

Compared to recurrent neural networks, the self-attention layers in the Transformer model more easily learn the dependencies between words in a sequence because the self-attention layers are less complex.

Recurrent networks read words sequentially and employ a gating mechanism to identify relationships between separated words in a sequence. By contrast, self-attention examines the links between all the words in the paired sequences and directly models those relationships. It's a simpler approach.

Combining these three features – subword-splitting, dropout and self-attention – yields a trained model that makes relatively good predictions. And as described on the multilingual translation page, adding Italian-English data should improve translation quality even more.

In an initial experiment, we added the Italian-English subset of Farkas' Books to our dataset and and trained two translators – one from Sicilian and Italian into English and the other from English into Sicilian and Italian.

As shown in the table above, holding model size constant reduced translation quality, an effect that is consistent with the findings of Arivazhagan et al. (2019), who show that training a larger model can improve translation quality across the board.

So to push our BLEU scores into the thirties, we trained a larger model. And, as we'll discuss on the multilingual translation page, we also trained another model that can translate between English, Sicilian and Italian.

So come to Napizia and explore all six translation directions with our Tradutturi Sicilianu!

:: previous ::
subword splitting

table of contents

:: next ::
multilingual translation