Sicilian Translator

To show our progress in the development of a machine translator, we put an experimental Sicilian translator online.

It now translates simple sentences fairly well. For example:

Those are some of its successes. But the translator is still in the developmental stage, so you will still find many things that it does not translate well.

Please focus on our success.

This experiment has shown that we will succeed in our goal of creating a good translator for the Sicilian language once we have assembled enough parallel text (i.e. pairs of translated sentences). That is a very time-consuming task, so please be patient.

We have developed the methods necessary to create a neural machine translator for the Sicilian language. We just have to assemble some more parallel text.

During training, a neural machine translator "learns" through a process of trial and error. First, it predicts a translation. Then it compares its prediction to the correct translation and adjusts the model parameters in the direction that most reduces the errors.

In other words, it needs to make a lot of mistakes before it begins translating properly. And currently, we do not have a dataset large enough for it to make enough mistakes that yield a good translator.

Our dataset only has 12,095 pairs of translated sentences to learn from. For comparison, machine translation models are usually trained on several million pairs.

We are assembling the dataset from issues of Arba Sicula and from Arthur Dieli's translations of Pitrè's Fables and Sicilian proverbs. They have been very helpful to us and we thank them for their support and encouragement.

We look forward to putting a good quality translator online soon. In the meantime, we hope that you will be amused when it does not translate correctly.

translation quality

The unexpectedly bizarre translations that frequently appear are normal at this stage. The translator will keep producing gibberish until we assemble more of parallel text. For example, Koehn and Knowles (2017) used varying amounts of parallel text to train several English-to-Spanish models. Below is a table from their paper:

Translation Quality Improves with More Parallel Text

The fractions in the left column are the fraction of the 386 million words provided by the ACL 2013 workshop. At low amounts of parallel text, the model produces fluent sentences that are completely unrelated to the source sentence. But as the amount of parallel text increases, the translation becomes perfect.

And a recent paper by Sennrich and Zhang (2019) suggests that the method of subword splitting will enable us to create a good translator with a few hundred thousand words (i.e. far less than the millions that Koehn and Knowles needed two years earlier). Subword splitting usually helps the translator find good translations for words that did not appear in the training data or appeared rarely in the training data.

For example, the word jatta (cat) appeared ten times in the training data, while its variant gatta only appeared once. Nonetheless, the translator still translates gatta correctly because the words are split into: j@@ atta and g@@ atta.

The disadvantage is that other rare words like cravatta (tie), which appears three times in the dataset, is split into: c@@ r@@ av@@ atta. And the splitting causes the word to be incorrectly translated as cat.

But once the dataset contains more sentences with the word cravatta, the machine will learn to translate it as: tie.

And just as the translation input is a sequence of subword units, the translation output is also a sequence of subword units. Usually, merging the subword units will result in a fluent sentence, but occasionally the translator will "make up a word."

For example, one puzzled user asked what a fraggant is. It's my new word for any annoyingly incorrect combination of subword units.

Because we have less parallel training data, we have to use more subword splitting, so our model produces more fraggants. Reducing the vocabulary with subword splitting makes it possible to train a translator with only a few thousand lines of parallel text. But it also returns a lot of fraggants.

Once we have assembled more parallel text, we will need less subword splitting to train a good model, yielding better translation quality and fewer fraggants.

translation domain

I wish that this machine could translate my research into Sicilian. But this machine was not trained on economic literature. It was trained on Sicilian literature. So it's not going to translate my Robinson Crusoe model into Sicilian. At best, it might translate the Robinson Crusoe novel into Sicilian.

And in practice, the sentences that it will translate best are sentences similar to the ones it was trained on.

To cover the core language and grammar, our dataset includes exercises and examples from Gaetano Cipolla's grammar, Mparamu lu sicilianu. To include dialogue and everyday speech, our dataset includes 34 of Arthur Dieli's translations of Giuseppe Pitrè's Folk Tales. And to cover Sicilian culture, literature and history, our dataset includes prose from 14 issues of Arba Sicula.

Sentences similar to the ones found in those sources are the sentences that this machine will translate best. For a good discussion of the domain challenges in machine translation, see the paper by Koehn and Knowles (2017).

To expand our translator's domain, we will need sentences from other domains. One possible source is Wikipedia. If we translated English Wikipedia articles into Sicilian, we could expand Sicilian Wikipedia and expand the domain of our translator. We would be happy to assist in such efforts.

In the meantime, we'll produce a good translator for the domain of Sicilian culture, literature and history.

how to use the translator

Just type the sentence that you want to translate into the input box, select the appropriate direction (i.e. either "Sicilian-English" or "English-Sicilian") and press the "translate" button.

For best results when translating from Sicilian to English, use the standard Sicilian forms. For example, use dici (not rici), use bedda (not bella), etc. And do not use apostrophes in the place of the elided i. For example, use mparamu (not 'mparamu), use nzignamunni (not 'nzignamunni), etc.

You do not need any special keyboard. A standard American or Italian keyboard should work fine because – with the exception of è and – you do not need to use accents at all.

If you're using an American keyboard, you can type the word è as e'. And you can type the word as si'.

Or if you're using an Italian keyboard, just type as you normally would. The translator will automatically perform the appropriate conversions to any accented letter that you type.

frequently asked questions

Why doesn't it translate properly?

The current translator was trained on a dataset with 12,095 pairs of translated sentences. Machine translation models are usually trained on several million pairs. We think we can create a good translator with less, but developing that larger dataset takes time. Please be patient.

This is an experimental product designed to test the methodologies that we will use once we have a larger dataset. It is not ready for serious translation yet.

Will it ever translate properly?

Yes! Once we assemble a large enough dataset, the translation quality will be very good.

Do I need a special keyboard to type Sicilian letters?

No. You can type Sicilian words without using any accents at all. So if you have a standard American keyboard, go ahead and use it. The only two words which require an accent are è and which you can type as e' and si' respectively. (In other words, when typing those two words, just add an apostrophe to the end).

Or if you have an Italian keyboard, go ahead and use it. The translator will automatically perform the appropriate conversions to any accented letter that you type.

How did you create this translator?

With neural machine translation, a form of artificial intelligence which "learns" how to translate by examining thousands of sentences that humans translated. For the details and source code, please see the machine translation page.

Copyright © 2018-2020 Eryk Wdowiak