The vocoder is one of those mysterious technologies that’s far more widely used than understood. Here I explain what it is, how it works, and why you should care.
Casual music listeners know the vocoder best as a way to make the robot voice effect that Daft Punk uses all the time.
Here’s Huston Singletary demonstrating the vocoder in Ableton Live.
You may be surprised to learn that you use a vocoder every time you talk on your cell phone. Also, the vocoder gave rise to Auto-Tune, which, love it or hate it, is the defining sound of contemporary popular music. Let’s dive in!
To understand the vocoder, first you have to know a bit about how sound works. Jack Schaedler made this delightful interactive that will get you started. I wrote my own basic explanation of what sound is and how you digitize it. The takeaway is this: as things in the world vibrate, they make the air pressure fluctuate. Your ear is a very precise tool for measuring momentary fluctuations in air pressure, and it’s able to decode these changes as sound. You can also use microphones to convert air pressure fluctuations into electrical current fluctuations, which you can then transmit, amplify, record, and so on.
In the 1930s, Bell Labs began researching ways to digitize voice transmissions. You can store a sound as a series of numbers by taking readings of the voltage coming off the microphone. The problem is that you need a whole lot of numbers to capture the sound accurately. The standard for compact disks calls for you to take 44,100 readings per second, and each reading takes up two bytes of memory. That adds up to five megabytes of data per minute of audio, which is too much to transmit even in 2017, much less with 1930s technology.
Fortunately, you can use math to make the problem more tractable. Sound is a sinusoidal signal. Jean-Baptiste Joseph Fourier gave us some super useful ways to analyze and express such signals mathematically.
Fourier realized that you can express any sinusoidal waveform, no matter how complicated, as the sum of a bunch of simple sine waves. That’s helpful, because sine waves are easy to express and manipulate mathematically. Breaking down a waveform into its simple sine wave components is called a Fourier transform. Here’s an example, a square wave broken down into a sum of sines:
You can also represent sine waves as the path swept out by a clock hand going around and around. The sci-fi-sounding term for one of these clock hands is a phasor.
If you connect a bunch of phasors together, together they can draw any crazy sinusoid you want. This is a difficult idea to express verbally, but it makes more sense if you play with Jack Schaedler’s cool interactive. Click the image below.
“Phasor magnitudes” is the daunting math term for the size of the clocks. If you make a list of the phasor magnitudes, you get a very nice and compact numerical expression of your super complicated waveform. This is vastly more efficient and technologically tractable than trying to store a billion individual voltage readings.
Click the image below to do a Fourier transform of your voice using the Chrome Music Lab.
This plot of the Fourier transform is called a spectrogram. Time goes from left to right. The vertical axis represents frequency, what musicians call pitch. Think of the lower frequencies as phasors spinning around more slowly, and the higher frequencies as phasors spinning around faster. The colors show amplitude, also known as loudness. Warmer colors mean the phasors are bigger, and cooler colors mean the phasors are smaller.
Now, at last, you’re ready to understand what the vocoder is and how it works. The earliest version was developed by Homer Dudley, a research physicist at Bell Laboratories in New Jersey. The name is short for Voice Operated reCOrDER. Here’s a vocoder built for Kraftwerk in the 1970s:
To encode speech, the vocoder measures to see how much energy there is within a set of frequency bands, and stores the readings as a list of numbers. The more frequency bands you measure and the narrower they are, the more accurate your encoding is going to be. This is intriguingly similar to the way your ear detects sound–your inner ear contains a row of little hairs, each of which vibrates most sensitively within a particular frequency band. By detecting the vibrations of each hair, your ear vocodes the pressure waves against your eardrums.
Before you can play speech back from the vocoder, you need a synthesizer that can play a sound with a lot of different frequencies in it. (This is called the “carrier.”) White noise works well for this purpose, since it includes all the frequencies. The vocoder filters out the different frequencies according to the readings it took, and you get an intelligible facsimile of the original speech. The key thing to understand here is that the vocoder is not recording speech and playing it back; it’s synthesizing speech from scratch.
Cell phones don’t record your voice and transmit it over the internet. That would take way too much bandwidth. Instead, they send vocoder readings. When you listen to someone’s voice on your phone, the phone synthesizes it by playing white noise and filtering it according to the readings it’s getting.
Musicians very quickly realized that if you used musical sounds instead of noise as the basis for vocoder synthesis, you could make a lot of weird and interesting things happen.
Here’s Herbie Hancock demonstrating the vocoder. The sound is being produced by the synth he’s playing. The synth’s sound is filtered based on readings of his voice’s frequency content.
Herbie isn’t much of a singer, but he’s one of history’s great piano and synth players. You can see why he liked the idea of being able to “sing” using his keyboard chops.
A lot of people think that Auto-Tune is a vocoder. That’s sort of true. Auto-Tune is based on the phase vocoder, which is a computer algorithm rather than a physical “thing.” The phase vocoder breaks up a signal into short bursts called windows. It then does a Fourier transform on each window. Think of it as a vocoder that can change its settings every couple of milliseconds.
Enter Andy Hildebrand, a former oil industry engineer who used the phase vocoder to inadvertently transform the sound of popular music.
Hildebrand used the Fourier transform to help Exxon figure out where oil might be, via a technique called reflection seismology. You create a big sound wave in the ground, often by blowing up a bunch of dynamite. Then you measure the sound waves that get reflected back to the surface. By analyzing the sound waves, you can deduce what kinds of rock they passed through and bounced off of.
After leaving the oil industry, Hildebrand began thinking about different ways to use his signal processing expertise for musical purposes. The pop music industry had long wanted a way to correct a singer’s pitch automatically, since doing it by hand in the studio was a labor-intensive and expensive process. Hildebrand figured out how to do very fast and computationally efficient phase vocoding, enabling a computer to measure the pitch of a note and resynthesize it sharper or flatter in real time. Thus was born Auto-Tune.
The idea behind Auto-Tune was to be an invisible, behind-the-scenes tool. It has a bunch of parameters you can use to adjust the amount and speed of the pitch correction, so that you can fix wrong notes without changing the timbre of the singer’s voice too much. But in 1998, while working on a Cher album, two producers named Mark Taylor and Brian Rawling discovered something. If they turned Auto-Tune all the way up, it tuned Cher’s notes too instantly and perfectly, making her voice sound blocky and robotic. Listen to the words “But I can’t break through” to hear the Cher Effect in action.
Eventually, other producers figured out how to do the Cher Effect, and that led to the sound you hear every time you turn on the radio. If you want to try Auto-Tune yourself, it’s available in simplified form in a browser-based music app called Soundtrap.
Once you can change the pitch of your voice, there’s no reason why you can’t make copies of it and change their pitch as well, thus creating effortless artificial harmony. Hear Kanye West as a robotic choir in this song by Chance The Rapper:
Here you can compare a recording of David Bowie’s voice to vocoded, Auto-Tuned and automatically harmonized versions:
Like the robotic voice effects that preceded it, Auto-Tune expresses the alienation and disembodiment of technology. This makes a lot of my fellow musicians angry. But clearly, it’s speaking to the mainstream pop audience, and why not? Our lived reality is so technologically mediated and alienated, why shouldn’t our music be too?