Sicilian Translator

Sicilian English

To show our progress in the development of a machine translator, we put an experimental Sicilian translator online.

It now translates simple sentences fairly well. For example:

Those are some of its successes. But the translator is still in the developmental stage, so you will still find many things that it does not translate well.

Please focus on our success.

This experiment has shown that we will succeed in our goal of creating a good translator for the Sicilian language once we have assembled enough parallel text (i.e. pairs of translated sentences). That is a very time-consuming task, so please be patient.

We have developed the methods necessary to create a neural machine translator for the Sicilian language. We just have to assemble some more parallel text.

During training, a neural machine translator "learns" through a process of trial and error. First, it predicts a translation. Then it compares its prediction to the correct translation and adjusts the model parameters in the direction that most reduces the errors.

In other words, it needs to make a lot of mistakes before it begins translating properly. And currently, we do not have a dataset large enough for it to make enough mistakes that yield a good translator.

Our dataset only has 20,016 pairs of Sicilian-English sentences and 10,649 Sicilian-Italian pairs to learn from. For comparison, machine translation models are usually trained on several million pairs.

We are assembling the dataset from issues of Arba Sicula and from Arthur Dieli's translations of Sicilian proverbs, poetry and Pitrè's Fables. They have been very helpful to us and we thank them for their support and encouragement.

We look forward to putting a good quality translator online soon. In the meantime, we hope that you will be amused when it does not translate correctly.

Translation Quality

The unexpectedly bizarre translations that frequently appear are normal at this stage. The translator will keep producing gibberish until we assemble more parallel text. For example, Koehn and Knowles (2017) used varying amounts of parallel text to train several English-to-Spanish models. Below is a table from their paper:

Translation Quality Improves with More Parallel Text

The fractions in the left column are the fraction of the 386 million words provided by the ACL 2013 workshop. At low amounts of parallel text, the model produces fluent sentences that are completely unrelated to the source sentence. But as the amount of parallel text increases, the translation becomes perfect.

And a recent paper by Sennrich and Zhang (2019) suggests that the method of subword splitting will enable us to create a good translator with a few hundred thousand words (i.e. far less than the millions that Koehn and Knowles needed two years earlier). Subword splitting usually helps the translator find good translations for words that did not appear in the training data or appeared rarely in the training data.

For example, the word jatta (cat) appeared six times in the training data, while its variant gatta only appeared once. Nonetheless, the translator still translates gatta correctly because the words are split into: j@@ atta and g@@ atta.

The disadvantage is that other rare words like cravatta (tie), which appears ten times in the dataset, is split into: crav@@ atta. In previous versions of this translator, that splitting caused the word to be incorrectly translated as cat.

And just as the translation input is a sequence of subword units, the translation output is also a sequence of subword units. Usually, merging the subword units will result in a fluent sentence, but occasionally the translator will "make up a word."

For example, one puzzled user asked what a fraggant is. It's my new word for any annoyingly incorrect combination of subword units.

Because we have less parallel training data, we have to use more subword splitting, so our model produces more fraggants. Reducing the vocabulary with subword splitting makes it possible to train a translator with only a few thousand lines of parallel text. But it also returns a lot of fraggants.

Once we have assembled more parallel text, we will need less subword splitting to train a good model, yielding better translation quality and fewer fraggants.

Translation Domain

I wish that this machine could translate my research into Sicilian. But this machine was not trained on economic literature. It was trained on Sicilian literature. So it's not going to translate my Robinson Crusoe model into Sicilian. At best, it might translate the Robinson Crusoe novel into Sicilian.

In general, the sentences that it will translate best are sentences similar to the ones it was trained on.

To cover the core language and grammar, our dataset includes exercises and examples from the textbooks Mparamu lu sicilianu (Cipolla, 2013) and Introduction to Sicilian Grammar (Bonner, 2001). To include dialogue and everyday speech, our dataset includes 34 of Arthur Dieli's translations of Giuseppe Pitrè's Folk Tales. And to cover Sicilian culture, literature and history, our dataset includes prose from 24 issues of Arba Sicula.

To augment our dataset, David Massaro contributed his collection of Bible translations and Marco Scalabrino contributed his translations of American songs.

Finally, to enable multilingual translation and to give our model more examples to learn from, we also included Italian-English text from Farkas' Books, from the Edinburgh Bible corpus and from Facebook's WikiMatrix in our dataset. All three are available from the OPUS project.

Sentences similar to the ones found in those sources are the sentences that this machine will translate best. For a good discussion of the domain challenges in machine translation, see the paper by Koehn and Knowles (2017).

To expand our translator's domain, we will need sentences from other domains. One possible source is Wikipedia. If we translated English Wikipedia articles into Sicilian, we could expand Sicilian Wikipedia and expand the domain of our translator. We would be happy to assist in such efforts.

And we'll continue to collect Sicilian language text because we want to develop a good translator for the domain of Sicilian culture, literature and history.

How to Use the Translator

Just type the sentence that you want to translate into the input box, select the appropriate direction (i.e. either "Sicilian-English" or "English-Sicilian") and press the "translate" button.

For best results when translating from Sicilian to English, use the standard Sicilian forms below. For example, use dici (not rici), use bedda (not bella), etc. And do not use apostrophes in the place of the elided i. For example, use mparamu (not 'mparamu), use nzignamunni (not 'nzignamunni), etc.

You do not need any special keyboard. A standard American or Italian keyboard should work fine because – with the exception of è and – you do not need to use accents at all.

If you're using an American keyboard, you can type the word è as e'. And you can type the word as si'.

Or if you're using an Italian keyboard, just type as you normally would. The translator will automatically perform the appropriate conversions to any accented letter that you type.

Standard Sicilian

The Sicilian language presented here does not represent any particular dialect. It presents the language that the neural network learned from translated sentence pairs. For lack of a better word, I call it: standard Sicilian.

Through selection and editing those Sicilian sentences roughly reflect the standards that Prof. Cipolla developed in Mparamu lu sicilianu. Developing a high-quality corpus of Sicilian text requires a standard, so I have tried to implement Prof. Cipolla's standards because he has established a high level of quality in his translations.

And given the nature of the translation task, I augmented his standards with the following differences:

  1. Italian-style H on aviri verbs:  haiu, hai, havi, avemu, aviti, hannu
  2. strict use of L on articles and object pronouns:  lu, la, li
  3. strict use of apostrophe and circumflex:  cu' = cui,  cû = cu lu
  4. strict use of apostrophe and circumflex:  du' = dui,  dû = di lu
  5. CI sufficiently denotes ÇI for words like:  çiuri
  6. maintain the R when infinitive is followed by an object pronoun:  Pozzu farlu.
  7. double II only where necessary:  la farmacìali farmacìi,  but:  la stòriali stòri

The first four differences sharply distinguish important words. In theory, a neural network doesn't need such distinctions because it will learn a set of rules to distinguish the different contexts. In practice, the rule that the neural network often learns is to translate a word, so it's helpful to distinguish words.

The first four differences also allow us to write rules that convert the translator's output from a literary form to a spoken form:  Vaiu a la scolaVaiu â scola.  Hai a parrari sicilianuHâ parrari sicilianu.  Another set of rules allow the translator's input to handle both the literary and spoken forms:  Vaiu â scola chî libbra = Vaiu a la scola cu li libbra.  Hê parrari cû prufissuri = Haiu a parrari cu lu prufissuri.

The fifth difference, ÇI→CI, helps create an ASCII representation of the language. Because we have less data, it's helpful to reduce what we have to the minimum viable representation. Specifically, prior to translation, the machine first uncontracts (ex.:  mappa dû munnumappa di lu munnu), then it strips any remaining diacritics (ex.:  çiuriciurifarmacìafarmacia) and converts to lower case.

The final two differences are stylistic differences. In hindsight, I should have consulted Prof. Cipolla on these points. I didn't. So the Sicilian language presented here reflects these stylistic differences.

Frequently Asked Questions

Why doesn't it translate properly?

The current translator was trained on a dataset with only 20,016 pairs of Sicilian-English sentences. Machine translation models are usually trained on several million pairs. We think we can create a good translator with less, but developing that larger dataset takes time. Please be patient.

This is an experimental product designed to test the methodologies that we will use once we have a larger dataset. It is not ready for serious translation yet.

Will it ever translate properly?

Yes! Once we assemble a large enough dataset, the translation quality will be very good.

Do I need a special keyboard to type Sicilian letters?

No. You can type Sicilian words without using any accents at all. So if you have a standard American keyboard, go ahead and use it. The only two words which require an accent are è and which you can type as e' and si' respectively. (In other words, when typing those two words, just add an apostrophe to the end).

Or if you have an Italian keyboard, go ahead and use it. The translator will automatically perform the appropriate conversions to any accented letter that you type.

I come from Suttasupra, province of Foraditesta. Can you create a translator for the dialect of my hometown?

When you have 20,016 pairs of Suttasuprisi-English sentences, we'll talk.

How did you create this translator?

With neural machine translation, a form of artificial intelligence which "learns" how to translate by examining thousands of sentences that humans translated.

Please see my Sicilian NLP pages for a complete explanation. And please come back "behind the curtain" at the Darreri lu Sipariu page, where you can see how the translator works.

How can I help?

You can help in many ways. You can create more examples for our translator to learn from. You can help develop our dictionary. Or, if you have technical expertise, you can help write code for the project. It's up to you.

Read the Next Steps page and write to me at: eryk@napizia.com. And we'll find places where you can make a difference. I look forward to working with you.

Copyright © 2018-2024 Eryk Wdowiak