Ispikit logo, myna bird talking

Core technology

We have developed a technology based on speech recognition that compares audio recordings with native British or American English speakers. The result is a score between 0 and 100 which shows how close the learner is to a native pronunciation. Below are a few examples with speakers taken from the Voxforge corpus, with their Ispikit assessment scores.





In addition, Ispikit detects and measures other aspects of learners' pronunciation. Depending on which platform it is used on, and depending on your application's need, here are the aspects captured by Ispikit:

Speech recognition
The ispikit library includes a speech recognizer that can be used to detect what the user said, among a set of possible inputs.
Overall nativeness score
As described above, Ispikit gives an accurate pronunciation score between 0 and 100. A native speaker reading a sentence correctly will receive the maximum score, while mistakes and/or strong accent cause the score to decrease accordingly.
Word-level feedback
Mispronounced words are caught and flagged by Ispikit. Your application can make use of these flags to indicate which words or sounds learners need to practice further.
Missing words
If a learner skips words while reading, Ispikit detects and indicates this.
Intonation
The pitch contour of the learner's voice is measured in real-time and sent to the application. It can be plotted, compared with a model, ideal, intonation or analyzed with your own technique.
Speech tempo
Speech tempo is also measured and can be used to give learners feedback on their fluency.
Waveform
Waveform data are sent to the application which can make use of it, for instance for providing a fancy user interface. Raw waveform data do not have obvious meaning for language learning.
Audio volume
The volume of the audio input is measured in real time. This can be used to show learners that the application is listening, or to provide a warning if the input level is too low or too high.
Background noise
The level of background noise is measured and can be used to suggest that learners change their audio settings, hardware or move to a quieter place or environment.
End-pointing
Ispikit detects when learners reach the end of the sentence and sends it to the application so that it can stop recording automatically. You don't need VAD (Voice Activity Detection), and learners don't need to remember to press the Stop button.