Agostinelli et al. at Google Research:
We introduce MusicLM, a model generating high-fidelity music from text descriptions such as “a calming violin melody backed by a distorted guitar riff”. MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes.
Definitely listen to these demos — especially the AI generated voices singing and rapping in artificial languages. Amazing.
Riffusion is still my favorite text-to-music demo, mostly because of the unbelievable way it works. I am, of course, excited to see more development here, though. The output from MusicLM is clearly better than Riffusion; I just wish there was a public demo I could try out.