Generating Music using an LSTM Neural Network

David C Exiga
18 min readMay 7, 2021

By: Austin Blanchard, David Exiga, Kris Killinger, Neil Narvekar, Dat Nguyen, and Sofia Valdez

Link to code and output: https://drive.google.com/drive/folders/1vs9P9L9-lOsrOifh_ptExL7DlCKOfThG?usp=sharing

Introduction

This pop music generation project was built by Austin Blanchard, David Exiga, Kris Killinger, Neil Narvekar, Dat Nguyen, and Sofia Valdez. Generate your own pop music by visiting our drive linked above.

For our project, we built our model on top of an existing LSTM model written by Sigurður Skúli. This blog will mainly focus on our improvements on his model and exploring how changing different parameters in our model lead to different outputs. For a good introduction on how LSTMs work and an understanding of how the baseline model works to generate music, please check out Skuli’s own article which served us as a great resource.

https://towardsdatascience.com/how-to-generate-music-using-a-lstm-neural-network-in-keras-68786834d4c5

Motivation

The rise of artificial intelligence has increased computational creativity, aiding humans in tasks such as architectural design, art, writing novels, music generation, and much more [reference]. Our team, composed of music enthusiasts, was most interested in the musical capabilities of artificial intelligence. While scouring the web for music generation models, we noticed most models were rudimentary in that they generated only classical music and tended to continuously repeat the same note. Thus, we decided to take on the challenge of creating a more advanced pop music generation algorithm.

Music Terminology

Before beginning our project, we first had to familiarize ourselves with music terminology. In music, a note is a symbolic representation of a musical sound comprised of duration, pitch, and dynamics. Duration is a given note’s length (i.e quarter note, eighth note), pitch is a given note’s frequency where a discrete interval corresponds to a musical note name (i.e. A4 ~ 440Hz), and dynamics is a given note’s loudness denoted in decimals. When three or more notes are played simultaneously, the combination of the notes is called a chord. Additionally, the interval of silence between notes or chords is called rest.

Software

In this project, we used the MIDI file format to provide a standardized way for music sequences to be saved, transported, and opened in other systems. The Music21 Python toolkit allows for easy reading, writing, and manipulation of notes in the MIDI format. Additionally, glob, pickle, and keras libraries were used to implement file read/write, and LSTM.

Training Data

Our training data included 100 pop songs in the .midi format. All files are in the same key and at the same tempo and in this format, each note has a pitch, duration, and volume. Examples of training songs include Shake it Off, Dancing Queen, Die Young.

Example of a training .midi file: (Dancing Queen)

Training Times

One thing that we wanted to touch on before we began looking at the various models we trained were how expensive these models were to train. With just a fairly simple LSTM network, training times for our models originally with Google Colab were about 30 minutes per epoch for 200 epochs, meaning it would take 4 days to train just one model. Because of this, many members of our group opted to subscribe to Google Colab Pro. This significantly increased the speed models were trained. By setting the hardware accelerator in Google Colab to GPU and using Colab Pro, each epoch took somewhere between 30 seconds to 5 minutes to run depending on the quality of the computer. This sped up our training time to about 2 hours per model on certain computers.

However, 2 hours of training time per model was still significant for us since we had a hard deadline of May 7th. Due to these unforeseen long training times, we weren’t able to tune as many parameters as we would have hoped and to the precision that we have done previously, such as using GridSearchCV in the kaggle competition. Instead, we made large changes to our parameters to see which direction we should head in.

Algorithm Explanation

Our algorithm for training the network and generating music from said network closely followed Skuli’s article, so a more in depth look at the code can be seen there. For our purposes, we will discuss the algorithm at a higher level in 2 parts, training and generation.

The training of our model starts with parsing the dataset of midi files we have. We simply loop through each of the files in the dataset and use Music21 to extract the notes and chords from each piece. At the end of this step, we have an ordered list of notes and chords found in our dataset.

We then split this list into multiple independent and dependent variables that can be fed into our neural network. To illustrate this, we can make an example list from the last step that looks like A, B, C, D, E, F. We set our size of each independent variable, or sequence length, to be 2 so we can break this list into length 2 independent variables, each with a corresponding dependent next note variable. For this example, X1 would be A, B and y1 would be C. We then keep going through the list, so X2 would be D, E, and y2 would be F. This way, we have data that can be inputted to our neural network for it to predict the next note.

For the final part for the training part section, we define a neural network using Keras and train our model using the created dataset. The actual structure of the neural network was left unchanged from the one found in Skuli’s article.

Now that we have a model that takes a list of notes of some sequence length and can generate a next note, we can move on to generating music. To do this, we first get some seed of notes of size sequence length that can be the first input to our network, which can be random or from our original dataset. For instance, for sequence length 2, say our seed was A, C. We use our neural network to predict the next most probable note, in this case lets say it predicts F. We then add this new note to be the first note of our generated song. We then add this note to the end of our seed while removing the first note from it, so our next seed will be C, F. We then repeat this some number of times to generate a song.

In this project we played around with various parts of this baseline, most notably what types of notes are parsed from the dataset, the sequence length, and how we generate the next note. In the next section, we can see the output of Skuli’s baseline model which uses sequence length 100.

Base Model

To begin our project, we downloaded a base model that generates classical music from Skuli’s github account: https://github.com/Skuldur/Classical-Piano-Composer. The base model was trained with 50 epochs with sequence length 100, and by listening to the output of the model, we could tell that Skúli’s algorithm does not implement rests and does not vary pitch, duration, and volume.

Bug Fixes to Skúli’s base model

While we are grateful to Skúli for providing us a base model to work with, there are some glaring issues with his design, especially in his implementation of next note generation. We can see that in his generate_notes function, uses a prediction_input variable, which is the seed of notes that will be used each loop to generate the next note. He scales these values to be between 0–1 by dividing by the number of total classes, which he calls n_vocab. However, we found this division was completely unnecessary since in his prepare_sequences function, he already did this division by n_vocab, meaning that he shouldn’t scale it once again once he gets to generate_notes. We fix this by simply commenting out his division in generate_notes.

There is another easy to spot error when Skúli adds the next note to be used for the seed in the next loop.

We can see that after getting the index of the next note, he adds the correct result to the variable that holds the final song called prediction_output. However, the appends the index to the next seed rather than the actual note. To fix this, we converted the result note back to its scaled 0–1 representation and appended that to the pattern variable (the seed for the next loop) instead.

Improvements to Skúli’s base model

Even after fixing these code mistakes, there are many aspects of this original model that cause the output music to sound unpleasant. Firstly, rests are not implemented. His model ignores rests during encoding and as a result the model outputs a continuous stream of notes. Secondly, his model interprets all chords of the same pitch. This is a strange decision because his model is able to encode the pitches of individual notes but ignores them for chords. Thirdly, all the notes are the same duration and volume. Real music has notes of varying degrees of duration and volume which makes it sound interesting. Lastly, Skúli’s base model has a tendency to fall into a loop. It likes to repeat patterns of notes or even the same note over and over. This is a result of how the model decides its next note which will be explained later.

Capturing chord pitches

The first fix that will be discussed is correctly capturing chord pitches because that is the easiest (in terms of the programming). Understanding why this is important requires some basic music theory. In music, there are 12 distinct tones or notes. They range from A-G (think the 7 white keys on a piano) with the sharps and flats (the 5 black keys on a piano) to round out the 12 tones. Each note has a distinct base frequency that defines it as that note. For example, ‘A’ has a frequency of 55 Hz. But that is not the only frequency you can play ‘A’ at. If you double the frequency to play a sound at 110 Hz, that sound is still distinctly ‘A’, but at a higher pitch. Doubling the frequency in music terms means to move an octave up. You are playing the same note, but at a higher pitch. Same goes with halving frequency and moving an octave down. This is important because if we were only to limit ourselves to the notes at their base frequencies, we lose a lot of information in terms of pitch. That is essentially what Skúli is doing with his encoding of chords. He interprets the notes of the chords only, not the frequency at which it is played at.

We can fix that by changing the chord encoding from numbers to letters to represent the tone, then appending the pitch number after the tone letter. Drawing from the example in the figure above, the fixed model would go from interpreting both chords as ‘4,8,11’ to ‘C1,E1,G1’ and ‘C2,E2,G2’ respectively.

Changing Next Note Algorithm

As mentioned before, Skúli’s base model has a bad habit of falling into loops which do not sound pleasant. This is a result of how his model decides its next note, which we will dub the “argmax” implementation. When the model is deciding which note should come next, it generates a probability distribution for every single note and pitch it has seen during its training. The argmax implementation would simply take the maximum probability from this distribution and use its associated note and pitch to be the next note in the sequence. As an example, say our model can only predict 5 possible notes: A, B, C, D, E. A reasonable probability distribution generated by the model would be [0.6, 0.1, 0.1, 0.15, 0.05] (0.6 for A, 0.1 for B, and so on). Argmax would determine A should be the next note because 0.6 is the highest probability.

Problem with Argmax

The issue with argmax is its tendency to fall into a loop of predicting the same notes over and over, one which the model may never recover from. LSTM models in general are good at detecting patterns which is the main advantage leveraged when using such models to predict the next word in a sentence or in our case the next note in a song. However, the drawback to this is that the model is sometimes too good at predicting patterns. For example, say during training the model picks up on the following note pattern: “A, A, A” In other words, it thinks that the more As the model sees, the more likely the next note should be A. Now, consider the following scenario with our quirky ‘A’ model using the past 3 previous notes to predict upon and only 3 possible notes to predict: A, B, or C. In the first iteration it is given “A, B, C” to predict on and the corresponding probability distribution is [0.4, 0.3, 0.3]. Argmax says to predict A. In the second iteration it is given “B, C, A” and the corresponding probability distribution is [0.8, 0.1, 0.1]. Argmax says again to predict A. In the third iteration the model is given “C, A, A” and the corresponding probability distribution is [0.98, 0.01, 0.01]. Argmax once again says to predict A. Now, the model is stuck predicting A forever! Although a simple example, it is easy to see why the base model fell into the trap of producing unpleasant sounding loops.

Probability Distribution Implementation

An alternative to the argmax algorithm is something we call the probability distribution implementation. Put simply, the implementation stays true to the probability distribution generated by the model. Instead of looking only at the max probability under the argmax implementation, probability distribution considers all the probabilities. It will predict the next note according to the probability assigned to it by the probability distribution matrix. For example, if the model has 3 notes (A, B, C) with a probability matrix (.4, .3, .3), the model will predict A 40% of the time, B 30% of the time, and C another 30% of the time. With this implemented we were able to immediately see more varied and interesting outputs produced by the model.

Top 7 as an alternative implementation

A drawback with the probability distribution implementation is that there is still a (small) chance that a very bad note/chord will be played. To explain again, consider our simple 3 note model. The probability distribution for A, B, and C comes out to [0.99, 0, 0.01] respectively. The low probabilities assigned to B and C result from the fact that B and C should definitely not be the next note in the sequence. The model has trained on enough data to know that placing B or C as the next note is highly unusual and so would not recommend it. The fact that this sequence is unusual to the model is because it sounds bad and thus was rarely played in an actual song (and thus never learned). Under probability distribution, there is a very small chance that it will select the “wrong” note. 1% of the time, the model will place C as the next note and result in an awkward sounding output. As a fix to the probability distribution algorithm we derived the “top 7” algorithm. The top 7 algorithm works by allowing the model to only select from the top 7* most likely notes from its probability distribution. We accomplish this by having the model take the top 7 probabilities, standardize them so they add to 1, create a modified probability distribution array consisting of only the 7 standardized probabilities, then selecting from this array to determine the next note. This ensures only “compatible notes” will be played. For the rest of our models, we used this top 7 algorithm for our next note generation. You can hear in the following music clip below that with top 7 implemented the model is able to deliver an output that is more consistent yet variable enough to still sound interesting.

*The number 7 for the algorithm was determined rather arbitrarily, but the team experimented with different numbers and decided that 7 sounded the best.

Rest Implementation

The next thing we tackled was implementing rests. To accomplish this we simply needed to add another class to represent rests during the encoding of the midi file data into numerical input for the LSTM model. We then needed to match the additional class on the output side. We modified the model so that it was able to convert its numerical output back into one of the three classes of notes for the output midi file: notes, chords, or rests. In real music, rests (as do any musical note) vary in their duration but in our first implementation we chose to treat all rests the same duration for simplicity. You can hear rests (or the absence of notes) in the beginning of the song found below.

Exploring a different dataset

We tried to train a model on a subset of the 100 pop midi files that only contained the choruses. Each new file was a fraction of the length of the original, so training could be quick and specialized to the chorus. The same model was trained to 150 epochs with a sequence length of 20 and yielded a cross entropy loss of 0.11; however, the results were not significantly improved over the models that used the standard dataset. Additionally the songs the model produced tended to be more repetitive and were not the desired outcome.

Hyperparameter: Sequence Length

Sequence length is the term we use for the length of the input X in predicting next note y. For instance, we can see what a model with sequence length 5 might look like visually.

Neural Network Visualized

From https://towardsdatascience.com/how-to-generate-music-using-a-lstm-neural-network-in-keras-68786834d4c5

The sequence length here is 5 because that is the number of previous notes that the model is given to predict the next note. As you may guess, changing this parameter can greatly vary the sound of the final output. We experimented with many sequence lengths before settling in on sequence length 10 as our choice moving forward.

The base model we had actually had sequence length 100. After training for 200 epochs, the final model had a categorical cross entropy loss of 0.0758.

Our next model was trained with sequence length 200. After 200 epochs, the categorical cross entropy loss was 0.0960.

We also trained a model with sequence length 5. After 200 epochs, the categorical cross entropy loss was an extremely high 2.1154. We do not have an example output from this model. However, the output sounded a bit off since it sounded like a lot of wrong notes were played. The reason behind this is that as the sequence length decreases, the model has less data to predict the next note with. You can imagine that with a sequence length of 1, the model would have no idea what the next note is. Because of this, having a very low sequence length like 5 seemed to make our model worse.

Lastly, we trained a model with sequence length 10. After 200 epochs, the categorical cross entropy loss was 0.1130. Despite the low sequence length, the cross entropy loss was still low.

After listening to outputs generated by the varying sequence length, we decided that sequence length 10 was what we would move forward with. We felt that this sequence length gave a lot of variation over the course of the song as opposed to the higher variations. While it is more difficult to tell in these shorter clips, the higher sequence lengths produce songs that repeat the same phrases and similar notes over and over. For lower sequence lengths, the generated song has a higher likelihood to suddenly change, making the generated music feel more alive and varied.

Changing note lengths

So far, all our songs have notes and rests that have the exact same note length. We thought that adding varying note lengths would help the quality of our generated songs greatly. Since each note, chord, and rest that we parsed from our dataset comes with a duration.quarterLength attribute that tells us how long it is, we can use this to get data on how long each element we parsed from our dataset was. We can visualize this distribution by plotting the frequency of each duration found.

Plotted Frequency of Each Duration

We can see that most of our durations lie on the shorter end. The largest bumps are at 0.5 and 1, with the next moderately sized bump at 1.5. While we could simply give each note its corresponding length, we were afraid of expanding the number of classes the model could predict to something that was too high. Because of this, we decided to simplify the durations to 3 main durations, which we called short (less than 0.75), medium (between 0.75 and 1.5), and long (greater than or equal to 1.5). By doing this, we simplify the number of different classes our model will have while having different note lengths.

For our 3 note length model, we used sequence length 10 as discussed before. After 200 epochs, the categorical cross entropy loss was 0.1829. The number of classes the model predicted from expanded from 150 to 314.

In this excerpt that was generated by the model, we can hear how varying the note, chord, and rest lengths increase the diversity of the sound.

Changing volumes

Our last feature we tried out adding to our model was volume. When parsing the notes using Music21, we noticed that each note had a volume.velocity attribute that specified what volume it would be played. After parsing through all notes in our dataset, we were able to see that only 4 volumes existed in our dataset.

Plotted Volume Frequency for the Four Volumes

These volumes include 31, 59, 81, and 102. Since there were only 4 different volumes, it makes it easy for us to add to our model. We can simply attach the volume of the note to our note before appending it to our array of notes. For instance, what would previously have been ‘A’ now is read as ‘A 31’ if the note A had volume of 31.

We trained a model with these additional 4 volumes for 200 epochs. This model doesn’t feature varying note lengths, but only adds 4 varying volumes. It uses rests, chords, notes, and has a sequence length of 10. The final categorical cross entropy loss of this model was 0.1615. This model expanded the number of classes the model could predict from 150 to 264.

Listening to the model, we can clearly hear how adding varying volumes to the notes adds a lot to the music’s quality. Since we felt varying the volumes helped enhance the quality of the songs generated, we kept it for the final model.

Final Model

After seeing the strengths and weaknesses of our previous models, we were finally ready to combine the best parts of each of the previous models into one final model. This final model contained 4 carrying note volumes, 3 varying note lengths (short, medium, and long), and sequence length 10. This model expanded the number of classes the model could predict from to the greatest it has been, which was from 150 to 463. After training for 200 epochs, its final categorical cross entropy loss was 0.2839, which was also fairly high.

Loss and Number of Classes

We have been noting the losses for our various models throughout this article. As you might be able to see, this is the lowest in our discussion of the sequence lengths, but continues to increase as we add more features. The loss of our final model is actually the highest, with a loss of 0.2839. The reason why this loss increase is related to the number of classes that the model predicts from that was mentioned in the note lengths and volumes sections.

If we imagine a scenario where we have data containing 3 notes A, B and C. This means that given an input, our model has 3 classes to predict from. If we then decide to add note short, medium, and long note lengths, our classes now are A short, A medium, A long, B short, etc., so we would have 3 * 3 = 9 different classes. Next if we add 4 volumes, our classes might be A short 31, A medium 31, etc., so we would have 3 * 3 * 4 = 36 classes that our model could predict from.

By increasing the classes like this, we increase the complexity of the model needed to handle these classes, the data required, and the training time. However, since we weren’t able to increase our training time due to having to meet our deadline, we left the model the same and saw an increased loss as our model grew more complex.

Further Improvements

While we were happy with our final output, there is obviously a vast amount of improvements that could be made to our models. Firstly, our current implementation is just a continuous stream of notes. It may be possible to add some sort of music structure to what is inputted to the neural network, such as bars and measures.

Secondly, seeing how the output changes depending on the dataset would be really interesting to see. Since we wanted our outputs to be consistent for this project and since each model is so expensive to train, we didn’t end up experimenting with different datasets. However, it might be fun to use a dataset of your favorite songs to see if the output is more enjoyable.

Thirdly, adding multiple instruments playing at once would greatly enhance the quality of the sound. Currently, we have chords that play multiple notes at once, and these already help with our model’s sound. However, our input is currently just an input sequence of piano notes, so adding multiple notes for each of those inputs may allow us to add more parts to our song.

As discussed before, the difficulty in doing this is that these would increase the number of classes the model can predict from greatly, and we weren’t able to do this due to the high training times it would cause. However, it would be a worthwhile improvement if given more time.

References and Further Readings

Generating Music Using an LSTM NN by Sigurður Skúli

https://towardsdatascience.com/how-to-generate-music-using-a-lstm-neural-network-in-keras-68786834d4c5

Deep Learning Techniques for Music Generation- Jean-Pierre Briot, et al.

Music21 documentation

https://web.mit.edu/music21/doc/

--

--