How Good Is Machine Translation?

One of the lesser known, yet very cool features of Google Docs is its ability to provide a pretty decent translation of any text you enter into it.  The functionality highlights the progress that has been made in machine translation in recent years.  Indeed, work earlier this year suggested that machine translation is now on a par with humans.

Whilst this is certainly very cool and those results garnered a lot of publicity, it shouldn’t be taken to mean that human translators are heading for the scrap heap just yet. A recent study published by the University of Zurich highlights some of the reasons why.

It highlights that the work conducted earlier this year failed to take accurate account of the way we read full documents.  If the machines are tested according to this, more realistic benchmark, then they continue to fall short.

Testing the machine

Machine translation is currently tested along two distinct measures: adequacy and fluency.  The adequacy metric is usually determined by a human translator, who is tasked with reading both the original text and the machine translation to determine how well the translation expresses the meaning of the original.  The fluency metric uses monolingual readers who are given the translated text and asked to measure how well it’s expressed in their native tongue.

Now, most people working in the field think this is a decent approach, but there are still variances in how it’s applied.  For instance, many studies to date have only applied this system at the sentence level, when of course we humans tend to read text in full.

The Swiss team have therefore come up with a more detailed protocol that allows man and machine to be compared at the document level.  They put their new system to the test on around 100 articles that had been written in Chinese and then translated by both man and machine into English.  The human judges were asked to rate the adequacy and fluency of the translation at both the sentence and document level.

As the results from earlier this year showed, both man and machine were pretty comparable when the judges only looked at the output on a sentence by sentence level.  When they looked at things across the entire document however, the humans came out clearly on top.

“Human raters assessing adequacy and fluency show a stronger preference for human over machine translation when evaluating documents as compared to isolated sentences,” the researchers explain.

They believe this is largely because only when we assess a text as a whole do we uncover mistranslations such as an ambiguous word or a lack of textual cohesion.  These generally pass us by in individual sentences, but when tied together they stand out.

There has undoubtedly been tremendous progress made in the field, but as those advances emerge, so must the way we assess the technology evolve so that it’s suitably rigorous to reflect real life.

“As machine translation quality improves, translations will become harder to discriminate in terms of quality, and it may be time to shift towards document-level evaluation, which gives raters more context to understand the original text and its translation, and also exposes translation errors related to discourse phenomena which remain invisible in a sentence-level evaluation,” the researchers conclude.

Related

Facebooktwitterredditpinterestlinkedinmail