Recently I wrote about a study that had tested the potential for automated speech writing. It uses a rule based approach that was developed after researching thousands of successful and unsuccessful speeches down the years.
Meanwhile, there are also advances being made in speech analytics, with a Google Glass like invention giving speakers live feedback on everything from their pitch to their cadence.
Whilst this may sound as though we’re rapidly hurtling towards a time where robots will be able to deliver speeches, the actual voice of the machine is still something that researchers are struggling to crack.
Making machines sound human
It’s a problem that IBM attempted to tackle when they developed a voice for Watson. Early attempts didn’t really sound particularly human, but they weren’t quite as foreboding as HAL either.
Since those pioneering days, we’ve had voice added to a range of computerized platforms, from your GPS device to the Siri like personal assistant on your mobile phone.
They’re also increasingly deployed in robotic assistants for the home, the factory and for various medical environments.
Developments in this area revolve around what are known as ‘conversational agents’, which are programs that can both understand natural language, and then respond in kind.
Alas, the field is still a long way from developing a device that is indistinguishable from that of a human. We’re a long way from a machine being able to pass an audible Turing test, for instance.
There is also the issue of the ‘uncanny valley’, which describes our repulsion to things that are broadly similar to us, but still noticeably not us. So the more human robots become, the more turned off we are.
Mixing and matching
At the moment, most synthesized speech is generated using a huge database of words and other subsets of speech that can then be put together into something sensible sounding.
This database consists of humans recording those words, but even then there are distinct challenges involved in the inflection used to portray emotions and context in particular circumstances. Simply having one iteration of each word is therefore not enough, and this is before things like accents, dialects and slang are taken into account.
This remains a challenge that the industry has not managed to overcome, so whilst synthesized speech is largely functional, it is still some way from really reflecting our own speech.
Whether we really want synthesized speech to become lifelike however, is another matter again. I mentioned at the start of this post about a project designed to automate speeches, and it doesn’t seem a stretch to think that once the content and delivery are automated it could have some severe implications.
For instance, the Israeli tech company Imperson are believed to be considering a foray into politics, with politicians deploying an avatar developed by Imperson to represent them online.
So you could have a digital Donald Trump cut loose on Twitter, talking in the same way as the Donald does in real life.
There is already evidence that organizations are using bots online to engage with stakeholders, but giving those bots the ability to talk fluently and convincingly gives them a whole new level of power.
Interesting times.
Very interesting and something I hadn't given a great deal of thought to. I suppose barking out single sentences, like on your SatNav, is relatively easy, but having a conversation is much, much harder.