Apple lays out the details behind Siri’s speech generation

The Register:

At an academic speech tech conference today (Thursday), Apple researchers will present some of the classic building blocks behind the voice generation of the Siri assistant.

The conference is this week’s Interspeech 2017, in Stockholm.

From the paper Apple presented:

There are two mainstream techniques for industry production development, namely waveform concatenation (i.e. unit selection) and statistical parametric speech synthesis (SPSS). Given a sequence of text input, unit selection directly assembles waveform segments to produce synthetic speech, while SPSS predicts synthetic speech from trained acoustic models. Unit selection typically produces more natural-sounding speech than SPSS, provided the database used has sufficient high quality audio material.

This is a bit jargony, but think of unit selection is having access to a bunch of pre-recorded speech components, and assembling them to make syllables and, from those, words.

On Apple’s approach with Siri:

Most recently, much work has centered on using a statistical model to predict acoustic and prosodic parameters for synthesis and then using these predictions to set the costs in a unit selection system – this is known as hybrid unit selection.

Again, a bit jargony, but Apple’s approach (as I read it, no expert me) is the above mentioned unit selection approach, but backed by a learning system that helps make smart decisions about the underlying sounds to use in constructing specific voicings.

Couple things here.

First, this is fascinating stuff. I studied voice synthesis in college and have always wanted to know more about how Siri really works. I’ve watched Siri change voicings in the move from iOS 10 to iOS 11 and wondered why Apple moved from what sounded like a specific person (Susan Bennett) to a more generic, generated voice. I love all the detail in this paper. I can only imagine that the folks at Samsung are poring over the paper right now, competitive juices flowing.

The second thing is Apple’s decision to publicly open up about such critical technology.

Richmond said Apple’s paper is not breaking new ground and “there are no big surprises” but its strategy of releasing a “modest” number of papers could “work as advertising” for talent by hinting at higher quality research and collaboration happening behind “closed doors”.

Sounds about right. The details in this paper might go far enough to be of interest to the folks they want to hire, but without tipping the details of their Siri secret sauce.