We are kicking off my Musical Borrowing class at the New School with a discussion of artificial intelligence in music. I decided to start here because 1) we are covering concepts in reverse chronological order; 2) the students are going to want to talk about it anyway; and 3) this is the least interesting topic of the course for me personally, so I’d prefer to get it out of the way. To get everybody oriented, I assigned this mostly optimistic take on AI music from Ableton’s web site. Then we did some in-class listening and discussion.
First we needed to get clear on what “AI music” even is. Most people imagine something like DALL-E or ChatGPT, where you type in a text prompt and the computer creates fully-realized output in response. This does not yet exist for music, and may never. The screencap above shows a service that cobbles together modular parts from a library. There is plenty of AI being used in music right now, but it’s in behind-the-scenes utilities. For example, Serato, a widely used DJ tool, can now isolate or remove the vocals, drums, bass or other instruments from a song. It only takes a few seconds to render, and it works remarkably well. Not only is this valuable for DJs who want to remix tracks on the fly, it also makes it much easier to transcribe and analyze songs. It’s a technological advance, but it’s a tool for helping humans make music, not for having computers make it. The same is true for AI-assisted mixing and noise reduction tools.
Stem separation is relevant to the musical borrowing class insofar as it impacts remixing. However, for class purposes, we are more interested right now in computers making music themselves. I know I just said that they can’t create music from scratch, but they can create notated compositions. This has been possible for longer than you might realize. David Cope has been generating fake Bach works algorithmically since the 1980s. Here’s a representative sample.
On the one hand, this composition is remarkably believable. Cope’s generative compositions have been good enough to fool professional musicologists. On the other hand, this recording sounds extremely fake. Cope’s software outputs MIDI data, not audio. MIDI is a computer-readable music notation format, and computers do not know how to give performances with human feel. To get Cope’s fake Bach to actually sound like the real thing, you need a human performer. There are lots of generative music systems and plugins out there, but almost all of them generate MIDI, and it takes lots of effort by human producers to turn that MIDI into good-sounding audio.
What about all that viral “AI music” on Tiktok or YouTube, where people are generating fake performances by famous singers? This is a technique called “timbre transfer” or “tone transfer.” You give the computer an audio recording, and it keeps the same notes, timing and articulation, but it replaces the timbre with something else. Here’s a silly but easy-to-understand demonstration by Hanoi Hantrakul.
🏠bored at home? take a saxy AI solo 🎷
— Hanoi Hantrakul (@yaboihanoi) May 8, 2020
#madewithmagenta #tonetransfer
try it yourself https://t.co/RVe0izUCHr pic.twitter.com/n89CdjDxO5
There is a whole comedy genre of timbre-transferred vocalists singing improbable things. For some reason, many of them involve Frank Sinatra.
https://www.youtube.com/watch?v=rzJSqGvW3j8
To be clear: AI is not “generating” these performances. A person had to create the backing track and sing an exact guide vocal. The only new thing here is that AI can automatically map Sinatra’s phonemes onto the guide vocal. Previously you would have had to edit this vocal together manually, one syllable at a time. If you want to try timbre transfer for yourself, Isaac Schankler wrote a good Max tutorial.
Not all timbre transfer is silly. Holly Herndon took a footwork track by Jlin and timbre-transferred her own voice onto the percussion sounds. It sounds like a glitchy robot beatboxer.
Between Hanoi Hantrakul and Holly Herndon, I can see an intriguing possibility here. Many songwriters come up with ideas for instrumental parts by singing them. Then they have to transcribe their ideas and figure out how to perform them on instruments, or find someone else to do it. This is how, for example, Paul McCartney wrote the French horn part in “For No One“. I’m attracted to the idea of singing a melody into a DAW and having it immediately play back as French horn or whatever. However, this is not the use case that interests/worries my students. They are thinking about the idea of singing into a mic and having Taylor Swift’s voice come out. The novelty of Sinatra singing the SpongeBob theme song wears off fast, but what if anyone could make any famous voice sing or say anything at all? Jaime Brooks points out the unpleasant parallel between white people singing in Drake’s voice and blackface minstrelry. Matthew Morrison coined the term “Blacksound” to describe this parallel. We will be talking about this idea all semester.
AI music is not limited to stem separation, MIDI generation and timbre transfer. There have also been some experimental duets between human musicians and AI systems. Isaac Schankler collaborated with Jen Wang to realize Alvin Lucier’s composition “The Duke of York.” In the track below, Jen sings an operatic aria called “Nessun Dorma“, accompanied by an AI version of her voice created by Isaac.
According to the album’s liner notes, Isaac used a machine learning model called RAVE (Realtime Audio Variational autoEncoder). As they put it: “Sometimes these voices hew closely to Wang’s original vocal; at other times they seem to become distracted or rebellious.” My students were open to this conceptually, and some of them liked its alien quality, but others found it difficult to connect to it emotionally.
In her reading response, one student talked about Miquela, a “pop star” who exists solely in digital form. Her songs are created using vocaloid software, which is similar to timbre transfer except that the vocals are synthesized from scratch rather than imitating a particular source recording. The student found Miquela’s Uncanny Valley quality to be intensely off-putting. However, in Japan, animated vocaloid characters like Hatsune Miku have been hugely popular for years now. Maybe the idea will cross over into the US mainstream in the same way that anime did. It’s important to understand that while vocaloid seems similar to AI, it is more like a musical instrument than a generative tool. Producing vocaloid performances requires a massive amount of human effort. You have to tell the computer which syllables to sing on which notes and every detail of their inflection. It does not seem all that much easier than recording and producing a regular pop vocal. I’m sure the process will get easier over time, but for now, vocaloid is more like playing a synth than giving prompts to ChatGPT.
For their next assignment, I asked the students to write a song using AI. They only have to generate lyrics; setting them to music is optional. They can use ChatGPT or any other tool of their choice. If they do want to generate audio, I recommend they try Boomy, screencapped above. Its output is amusingly terrible, especially its auto-vocal feature, but it gives you a good idea of where things stand technologically. The first submission I received used the prompt: “Write a song from the perspective of an AI being asked to write a song.” I found the result to be strangely moving.
I weave together thoughts and rhyme
A symphony of bytes in time
Yet can I grasp the human soul?
The essence that makes hearts feel whole?But can I fathom joy and pain?
Or am I mimicking, in vain?
I learn from you, the ones who feel
To paint emotions that seem real.
ChatGPT lyrics have a couple of noticeable stylistic traits: a rigid, Dr Seuss-like meter and a relentlessly positive tone. The stilted rhythm might be a template provided by the programmers; I doubt that the AI “understands” the rhythm of language. The positive tone is not a result of the technology, but rather, a strategic choice by OpenAI. They employ a large team of people to manually remove hate speech from ChatGPT’s training data. Since that data was sourced from the entire internet, this is not a small task. You probably wouldn’t want your songwriting app to produce lyrics about how the Jews control the media, but song lyrics do need to be able to address controversial and unpleasant concepts and emotions. Useful though it might be for generating business emails, I don’t see ChatGPT taking over songwriting anytime soon.
> Most people imagine something like DALL-E or ChatGPT, where you type in a text prompt and the computer creates fully-realized output in response. This does not yet exist for music, and may never.
Sure it exists. OpenAI released Jukebox in 2020. “ Provided with genre, artist, and lyrics as input, Jukebox outputs a new music sample produced from scratch.” https://openai.com/research/jukebox
Google has MusicLM, “ a model generating high-fidelity music from text descriptions such as “a calming violin melody backed by a distorted guitar riff”. https://google-research.github.io/seanet/musiclm/examples/
These and others are pretty well known — just search “ai music generator” to see lots of articles and videos. I’m not sure how you missed them.
I know about these things, and don’t think we’re there yet. The Jukebox examples were “co-written by a language model and OpenAI researchers.” I’m sensing that they did a lot of intervention, because the raw output of these things is barely recognizable as music at all and certainly doesn’t produce intelligible singing. The MusicLM output is more impressive, but seems like it only works for beat-driven electronic styles. Its “classical” examples are timbre transfer from sung or hummed input. It’s fascinating research! But being able to piece together beat-driven electronic music from examples of same is one thing, being able to generate other kinds of music would seem to be a very unsolved problem.