How The Bible Helps Improve Machine Translation

AI-driven language translation tools are progressing at quite a pace, helped by huge volumes of text to train algorithms with.  This is not always easy with older languages, with tools not only providing poor literal translations, but also misinterpreting the style involved.  A recent study suggests that researchers might benefit from using The Bible.

The Bible contains around 31,000 verses that have been translated into pretty much every language on earth, thus creating an enormous dataset of aligned parallel text that has largely been untapped in the creation of language translation tools to date.  This dataset allowed the researchers to produce around 1.5 million unique pairings of source and target verses from 34 distinct versions of the English-language Bible.

The Bible is also an ideal data source because each volume is thoroughly indexed and has a consistent usage of book, chapter and verse numbers.  This predictable structure of text across each version eliminates any potential risk of misalignment.

Defining style

The researchers defined style by referencing sentence length, how word choices could define simplicity and formality, and the use of passive and active voices.

“Different wording may convey different levels of politeness or familiarity with the reader, display different cultural information about the writer, be easier to understand for certain populations,” the authors say.

In total, the team used 34 distinct versions of the Bible, ranging from the King James Version to the Bible in Basic English.  These 34 texts were fed to two AI algorithms, the first of which used statistical machine translation, whilst the second used a neural network framework, known as Seq2Seq.

Whilst the team used the Bible for this project, they believe that similar progress could be made with any text that has been widely translated.  This could help develop technologies that can ultimately accurately translate any written text for different audiences.  We’re still in the early stages of machine translation, but this work is another indicator of the progress being made.

Facebooktwitterredditpinterestlinkedinmail