AI Experiments

Speak in any language

Humans is creating a blockchain based ecosystem, the perfect medium for AI NFTs to evolve and be used in the world of business.


The Idea

As speech is one of the main means for us to communicate, the AIs we developed will empower people to talk in any language, by generating unique synthetic voice that can act as an extension of their real voice in the digital space. The technology developed can enable a full-on translation that keeps the original voice of the person used. This way, everyone can generate his own avatar, and make it speak in multiple languages.

Each language has its own facial expressions and lip movements that are translated automatically by the system when the language is chosen. As such, the technology developed not only generates a digital avatar that can talk in multiple languages but also adapts the lip-synch to fit the language that is spoken, rendering an avatar that can be mistaken for a real person. This initial use case allows users to insert text input which will be converted into audio that will be spoken in any language by the digital avatar.


The Tech

Human voices have a wide range of natural variations and fluctuations that make it very challenging for a machine to reproduce. In order to tackle this, the cross language voice conversion we developed, enables the generation of speech in other languages based on samples from a speaker's native language. A model trained from a mix of languages is used to disentangle the information from the user’s data and generate the mel-spectrograms for the output. Finally, a vocoder converts the result to speech. The algorithm is able to generate phonemes that are not found in the source dataset, in order to replicate the user’s voice as it would sound from a native speaker.

We are entering an era of continuous disruption.

In order to map the new language generated on the avatar, the lip animation starts from sample videos of pronouncing different words and pieces of text. After that, a neural network detects patterns like what the lips look like when the speaker says different vowels and consonants, as well as the subtle micro-expressions the face does when speaking.

This is also known as mapping, a process in which the neural network learns to distinguish between different phonemes and the movement of the lips. A mapping process is also performed on the avatar to ensure that the mouth movement is in sync with the text provided in any language. More precisely, the neural network maps the image, outlining its key features like eyes, nose, and ears, focusing especially on the lips.

In AI we trust

A year spent in artificial intelligence is enough to make one believe in God.