Stop Telling Stories Badly at Holiday Parties

Before you head to that potluck, office party of family celebration, you want to go with a few aces up your sleeve so you’re not the boring person at a holiday party. If you’re dreading holiday…

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

Generating Music with Seq2Seq Models

Build a stacked LSTM encoder-decoder model with Keras for generating Music

Sequence to sequence (Seq2Seq) or encoder-decoder models have shown to be extremely powerful for making translation engines and building chatbots. In this article, I will be building a encoder-decoder model that can learn to generate music from a bunch of midi files.

I will be explaining the architecture of the model, the different components and steps involved and sharing my results and future work.

I have structured the article to help someone new to deep learning to get an intuitive understanding on how to build a deep-learning model in Keras.

The dataset I used for this exercise is a collection of midi files I downloaded off torrents. I am not sure of the legal implications of sharing this dataset publicly. I will not be sharing the dataset here as I don’t want to go to prison (again :p). You could easily find midi files online if you want to reproduce the results.

Processing MIDI files

To simplify the input data, I am extract only one track from a midi file. I pick the track that has the 1. The most number of notes and 2. Is not a percussion instrument. Python-midi has functions to support this selection. This will (hopefully) just pick out the melody track for the song.

Each track has a list of notes. And each note has start time, end time, pitch, and velocity.

We need to encode these 4 features as a sequence of vectors (matrices) to input this information into our model. The most ideal method of encoding them is to treat this as a multi-variate series and design a many-to-many model, wherein the model would take a series of 4 features as input and give 4 values as output. However, I wanna try something much simpler now.So I made some quick and dirty feature engineering to convert this multivariate series to a univariate series.

So we need to take 4 variables (start time, end time, pitch, velocity) and somehow mash them up into one value.

1. We find the difference of end time and start time of a note and make a new variable duration.

Duration = end time of note-start time of note

2. We look at the distribution of the values of duration in all the midi files, and split them into 4 bins

3. We split velocity of the notes into bins

4. Now we concatenate the pitch, bin number of velocity and bin number of duration to form categorical variables (let’s call them pseudo-notes)

5. We then one-hot encode the pseudo-notes

Now we can represent each note as one of these pseudo-notes. Examining the dataset, we see that all the midi files are comprised of 580 distinct pseudo-notes. So, any song/midi file can be represented as a sequence of these pseudo-notes. I realize that this is not the most ideal encoding; there’s some data lost as a result of the “bin”ning operation. However, for the purposes of a quick demo, this would do.

The sequence to sequence model is an encoder followed by a decoder. The encoders and decoders are essentially LSTMs. LSTMs (Long Short Term Memory cells) are special kind of Recurrent Neural Networks that can handle long term dependencies. This could be useful for music training music as it can learn patterns in music like rhythm, tempo, and other music stuff?.

The encoder takes an encoded sequence of data, and outputs a vector. This vector is then decoded into another sequence by the decoder. The decoder and encoder comprise of LSTM cells.

We won’t be discussing about the inner working of an LSTM, as I feel it won’t be required for understanding this model.

An LSTM cell takes 2 inputs and gives 1 to 2 outputs depending on how it’s configured. LSTMs are generally connected to each other, i.e. outputs on one LSTM is fed into another LSTM.

LSTM cell outputs a cell state and hidden states. The hidden state is the output of the cell. The cell state is a compressed representation of the memory of the neural network upto that point. This memory gives the neural network the capacity to capture long term dependency. Both the states are usually passed onto the next cell.

We will be using 2 configurations of the encoder-decoder model for our exercise. First configuration is for training, and the other is for inference/sequence generation.

Training Model

The encoder encodes the sequence of input vectors (encoded sequence of notes of the song). The outputs (hidden and cell states) of the encoder is then fed into the decoder. The decoder takes 3 inputs: 2 states from encoder and encoded notes of the song offset by one timestep from the encoder input.

The decoder output is then fed into a softmax activation function and compared to the target data.

Inference Model

We will be using a slight modification of the above model for generating the sequence. We cannot use the training model for testing/inference. In the training model we are feeding an offset of the desired output as the decoder input. This is not applicable while testing. We need to reconfigure the decoder model to generate one output and then feed that output into the next decoder cell to generate the next output and so on.

Step 1: Prepare input

An LSTM only takes in a 3D vector. Currently, our input is 2 dimensional. Each song is an ordered list of pseudo-notes.

We need to one-hot encode these pseudo-notes. The final vector will have dimensions: Number of samples (nb) x Length of sequence (timesteps) x One-Hot Encoding of pseudo-notes

Step 2: Build Training model

Let’s go deep. We are gonna build a 2 layer stacked encoder and a stacked decoder.

Encoder: 2 LSTM layers having 1024 neurons each with a dropout layer in between them.

The dropout layer randomly loses a portion of the data passing through it. This helps the model regularize and generalize better and reduce the likelihood of overfitting.

Code compiled:

Decoder: 2 LSTM layers with 1024 neurons each, followed by a dense layer with softmax activation. The softmax generates a probability distribution for different possible notes as output and we can select the max probability as the predicted output.

Code compiled:

Compile and fit:

Step 3: Building the inference model

The inference model’s encoder is same as that of training model.

For the decoder, the same layers from the training model’s decoder is used. However, they are configured differently as shown.

Inference model codes : putting it together

Step 4: Decoding and generating MIDI file

Alright..Now the moment of truth. I generated the midi files at every 100 epochs of training.

The first file generated is absolute garbage. It just plays the same note repeatedly.

The second file is interesting. It has learnt to play a small sequence. But it plays it over and over again. Sounds nice at first, but quickly gets annoying.

But, don’t worry…after a few iterations, it starts generating some interesting tunes.

The melody gets more complex and it manages to create long, kinda sustained rhythm. Not bad huh…

I have a lameass laptop. So the training process was painfully time-consuming. It hanged several times during the process. The results are not bad and we can see it getting better with every iteration.

Looking at the results, I am confident that we can get much better results with a deeper architecture and more training time.

Stop Telling Stories Badly at Holiday Parties

Generating Music with Seq2Seq Models

Add a comment

Related posts:

My Love life

Modular Blockchain Platform NULS Forms Food Chain Alliance

Division By Zero Error