I asked a computer to learn the sound of human speech.

Who Do You Think I Am?

I asked a computer to learn the sound of human speech. What happened?

0:00

-9:12

I asked a computer to learn the sound of human speech. What happened?

Mind. Blown.

Emma Clarke

Mar 01, 2023

(Listen above for the full story, or read below for a summary…)

It’s not often I hear audio that blows my mind.

Today I’ve received audio that will form the basis of a new piece of music. It was made by neural synthesis. By its nature, it’s unique and (to me at least) precious and extraordinary.

It was created using PRiSM SampleRNN, a computer-assisted compositional tool that generates new audio by ‘learning’ the characteristics of an existing corpus of sound or music.

I’m fortunate to be part of the Unsupervised project at the Royal Northern College of Music’s PRiSM research project.

PRiSM takes a lead in interdisciplinary and reflexive research between the creative arts and the sciences with a view to making a real contribution to society, to developing new digital technology and creative practice, and to addressing fundamental questions about what it means to be human and creative today.

Creating audio by neural synthesis is quite a long process. It’s unpredictable. Consequently, it’s really exciting. You never know what you’re going to get at the end of it.

To begin the process, I made three hours’ worth of small wav files containing samples of human speech - tiny fragments of the kind of spontaneous verbal sounds humans make.

That’s a LOT of editing.

In this three hour dataset of audio there are small clips of speaking, breathing, laughing, shouting, whispering - all in multiple pitches and tones.

Next, the audio files are fed into the computer. Once that’s complete, the machine learning can start; the audio is ‘learned’ by the algorithm.

This takes the computer (a super-computer, no less) about two weeks. This part of the process was overseen by the team at the University of Manchester.

If the algorithm is tweaked even slightly, the computer-generated audio output can be significantly changed.

The sounds the computer has delivered contain riches and many surprises.

It’s strange, other-worldly, oddly human but also not human. To anyone fascinated by sound, it’s bloody awesome.

When I was editing the original files for the dataset, I wanted to make it as challenging as possible for the algorithm to make assumptions about the audio; I wanted to make it difficult for the computer to ‘learn.’

Creating the initial dataset is an important part of the creative and compositional process: what you include changes the flavour of the dish, so to speak.

I edited my original audio across words and pitches, I included and excluded silences, I separated audio that centred around stable pitches and volume, then mixed them together. I made folders of audio containing specific tones, pitches and durations so we could alter the algorithm accordingly to change the output.

Effectively, I wanted the computer to have to work hard to give me its best attempt at ‘learning human.’

The results have blown me away.

I’ve got so many ideas as to how I’m going to use it, layer it, deconstruct it, reassemble it. I can’t wait to make a start.

The final yet-to-be-composed piece of music will incorporate live speech and instrumentation. The core of the work will be the computer generated, machine-learned audio created by the computer.

The idea behind this piece has a philosophical concept at its heart.

More about that once I’ve written it…!

The music will be presented (perhaps performed) at an event in the summer.

I’m so grateful to Dr Sam Salem, Dr Christopher Melen, PRiSM, the University of Manchester, the University of Oxford and the RNCM for their help and support with this project.

If you like what I do, you can buy me a coffee! :)

Who Do You Think I Am?

I asked a computer to learn the sound of human speech. What happened?

Discussion about this episode