Parcha's Guide to AI

Is AI-generated music the sound of the future?

A look at how AI is going to change the way music is created and how we listen to it.

AJ Asver

Apr 23, 2023 • 7 min read

Hey Hitchhikers,

First, a huge welcome to the 250 new subscribers that hitched a ride with us this week! Second, I’ve decided to stick with this new format of doing deep dives every week vs. highlights and see how it goes. The main reason is that it’s easier to see trends and provide insights when my posts are thematic vs. when I’m just covering three highlights in AI from the week that may not be all connected. I hope you find the format more valuable too. If you do want just highlights, there are plenty of great AI newsletters like and that have you covered! With that out of the way, let’s jump into this week’s topic…

Many are predicting that the entertainment industry will experience a creative renaissance in the next few years, with Generative AI at the heart of this transformation. From music production to movie making, AI enables artists to push the boundaries of creativity and redefine what's possible. In this week's post, I'm exploring how AI will impact the music industry and what it means for the future of our listening experience.

But before I jump into this post, if you enjoy The Hitchhiker’s Guide to AI, please share this newsletter with your friends so we can reach 5,000 subscribers by June. We have just 1,750 to go!

🎧 Is AI-generated music the sound of the future?

The music industry is on the cusp of a transformative shift thanks to artificial intelligence. As generative AI models become increasingly capable of creating realistic music and audio, how we think about music, ownership, and value is being challenged.

I’ve been tracking advancements in generative audio and music since my first newsletter in January, where I talked about HarmonyAI and Riffusion, the most promising generative music projects at the time. Then, in February, multiple generative audio products were announced, including Google’s MusicLM. I covered these extensively in my post titled Is Google still the leader in AI? Here’s a reminder of these state-of-the-art generative audio projects:

MusicLM from Google, is a text-to-music model that can generate full-length songs, musical instrument sounds, and soundscapes based on a story.

MusicLM: Generating Music From Text

abs: arxiv.org/abs/2301.11325
project page: google-research.github.io/seanet/musiclm…
— AK (@_akhaliq) 1:56 AM ∙ Jan 27, 2023

Make an Audio by ByteDance AI Lab is an audio Diffusion model that can do both text-to-audio and video-to-audio.

Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models by @RongjieH

project page: text-to-audio.github.io
paper: text-to-audio.github.io/paper.pdf
— AK (@_akhaliq) 6:51 AM ∙ Jan 29, 2023

Noise2Music: 30-second music clips from text prompts again using diffusion, this time from an anonymous research team!

Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts

project page: noise2music.github.io
— AK (@_akhaliq) 8:09 PM ∙ Jan 28, 2023

Moûsai: Also another diffusion-based text-to-music generator using Long-Context Latent Diffusion1:

Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion

abs: arxiv.org/abs/2301.11757
github: github.com/archinetai/aud…
— AK (@_akhaliq) 1:52 AM ∙ Jan 30, 2023

SingSong: Another AI music project from Google, which can provide rhythmically and instrumentally complimentary accompanying music for vocals.

SingSong: Generating musical accompaniments from singing

abs: arxiv.org/abs/2301.12662
project page: g.co/magenta/singso…
— AK (@_akhaliq) 2:44 AM ∙ Jan 31, 2023

In the post, I also predicted that we are likely going to see a lot of progress in generative music over the next few months:

The fact that five advancements were published in music generation in a matter of one week is a sign that we will see a lot of progress in this space over the next few months. I’ll be watching the space closely to see how it evolves, and especially how audio quality improves. MusicLM, with the highest sounding audio quality, is still at about half that of CD quality.

Since February, though, there has been little in the way of notable advances in generative music, with ChatGPT and GPT-4 stealing all the limelight in AI. But, then last week, a track by Drake and The Weeknd went viral on TikTok, and the reason why was that both artists’ voices were synthesized with AI. Universal Music was quick to try and clamp down on the track that anonymous TikTok user Ghostwriter977 created, but you can still find it posted on Youtube:

What’s notable about this example of AI-generated music is that it wasn’t the music itself that was generated but instead the voices.

This brings to light a few key questions: Where is the value in music? How will artists monetize their work in an age of AI-generated music? In his latest article on Stratechery, Ben Thompson suggests that value is shifting from the music to the recording artist's name, image, and likeness (NIL).

“The relative contribution of value, though, continues to tip away from a record itself towards the recording artists NIL — which is precisely why Drake could be such a big winner from AI: imagine a future where Drake licenses his voice, and gets royalties or the rights to songs from anyone who uses it.”

The idea of Drake licensing his voice to AI-generated content may have sounded fare-fetched a few years ago, like some Black Mirror dystopian cultural future. And yet, here we are today with the latest AI models making this a genuine possibility. While there wasn’t much progress in generated music in the last few months, there have been many advancements in voice synthesis, with models becoming more indistinguishable from reality. Take this video that was created by Ammaar Reshi, where he used ElevenLabs to synthesize Steve Jobs narrating the first page of Make Something Wonderful, a book with product insights in Jobs’s own words:

Then, just a few days ago, another audio synthesis model, Bark was released that can generate both speech and music:

Combining voice synthesis with the generative music models I discussed, you now have all the ingredients to start making AI-generated music just with prompting. This dramatically reduces the barrier to entry to music creation and music production, making it accessible to literally anyone.

🎛️ The Next Spotify: Personalized Generative Music

Where do we go next once we have ubiquitously available AI-generated music? In the next few years, we will likely see the rise of wholly personalized, generative music platforms. Imagine a service that continuously generates music tailored to your preferences, like a DJ that never stops playing. Instead of relying on a fixed catalog of songs, this platform would use AI models to create music on-the-fly based on your mood, activity, and taste.

While existing streaming platforms like Spotify have the catalog to explore this concept, their agreements with record labels may limit their ability to embrace AI-generated music fully. This creates an opportunity for new entrants to disrupt the industry with generative music platforms.

Artists, too, stand to benefit from this shift. They could license their music to these platforms and receive royalties when their likeness is used in AI-generated tracks. Mapping generative content back to an artist's likeness may be achieved by analyzing the AI-generated prompt or machine learning-based labeling of the track after its creation.

As AI continues to shape the music industry, we may witness the rise of new business models centered around personalized, AI-generated music. Artists will be able to capitalize on their name, image, and likeness or NIL, and listeners will enjoy a unique and ever-evolving music experience.

💃🏽 An AI-powered Musical Renaissance

The impact of this shift extends beyond just the music we listen to—it will redefine how we experience music as a form of expression and connection. It has the potential to democratize music creation by enabling anyone with a creative vision to produce music, regardless of their technical expertise. We could see the rise of new genres, sounds, and collaborations that blend human creativity with AI-generated artistry. In many ways, the traditional concept of an "album" may give way to a dynamic, ever-changing soundscape uniquely tailored to each listener.

At the same time, this transformation comes with challenges. Questions of copyright, ownership, and authenticity will be at the forefront of discussions as we navigate the integration of AI into the music industry. How do we ensure fair compensation for artists while fostering innovation and creativity? How do we balance AI-generated content and the artistry and craft humans bring to the music? We don’t have answers to these questions yet, but I think we’re going to have to work them out very soon. In the meantime, we may be in for a tumultuous period in music creation and writes ownership similar to the days of Napster and Pirate Bay before Spotify and other streaming platforms legitimized online music streaming.

Music is a universal language that transcends boundaries and brings people together. It is a reflection of our emotions, our culture, and our humanity. As AI becomes a more integral part of the music industry, it has the potential to amplify our ability to create and share music in ways that were previously unimaginable.

That’s all for this week. Until next time, Hitchhikers!

P.S. Don't forget to share this newsletter with your friends and help us reach our goal of 5,000 subscribers by June. We're almost there, and every share counts!

Thank you for reading The Hitchhiker's Guide to AI. This post is public, so feel free to share it.

And if you haven’t subscribed yet, don’t forget to do that too!

Thanks for reading The Hitchhiker's Guide to AI! Subscribe for free to receive new posts and support my work.

Long-context latent diffusion is commonly used to generate videos , the model is designed to generate videos by taking into account both short-term and long-term context information. The model uses a latent representation, or a compact internal representation of the video, to capture both short-term and long-term information. The model then iteratively refines the latent representation over time, taking into account the context information, to generate the final video.
Long-context latent diffusion allows the model to generate more coherent and stable videos, compared to models that only consider short-term context information. The long-term context information provides a stronger constraint on the video generation process, helping to ensure that the generated video is consistent and coherent over time. ↩

🎧 Is AI-generated music the sound of the future?

🎛️ The Next Spotify: Personalized Generative Music

💃🏽 An AI-powered Musical Renaissance

Sign up for more like this.