Music Throughout the Decades: An Analysis¶

Sarah Flores & Premvanti Patel, UMD CMSC320, Fall 2020¶

Table of Contents:¶

Introduction: What Are We Doing?
Data Collection & Processing
A. Downloading the Billboard Top 100 Data
B. Preprocessing the DataFrames: Merging, Cleaning, and More!
Exploratory Data Analysis
A. Acoustic Features
B. Sentiment Analysis
Machine Learning: Predicting the Patterns
A. Setup and Model Training
B. Validation: Plotting Residuals
Another Interpretation: WordClouds
A. Visualizing Lyrics
B. Word Clouds
Insights and Conclusions

1. Introduction: What Are We Doing?¶

It's no secret that over the past few decades, people's taste in music has evolved greatly. With this project, we want to explore those changes quantitatively, and more deeply than just finding the most popular genre. There were a few criteria we had to consider in this: how to find the most popular songs, and how exactly to put numbers to music.

After some searching, we found a database that would serve our purposes perfectly. It combined the Billboard Top 100 weekly singles from 1958 to 2019, as well as each singles' corresponding acoustic features from Spotify.

The Billboard Top 100 needs no explanation, but Spotify's acoustic features do. In 2015, Spotify bought a small company named The Echo Nest, who had designed an algorithm that classified songs into numerical acoustic features. These features include some obvious ones, such as key and time signature, but also more nuanced fields such as acousticness, danceability, and energy. More information about these features (including a full spec by Spotify) can be found here.

To challenge ourselves and deepen the analysis, we decided to analyze the sentiment of the lyrics of a few of the most popular songs during this time period. With this, we'll also be able to cross-check this against the acoustic features in the database and see if we can find some patterns between the two that maybe even span across time.

Here is a list of the Python packages we needed for this project.

2. Data Collection & Processing¶

2A. Downloading the Billboard Top 100 Data¶

As mentioned above, we are using databases with Billboards top 100 weekly songs created by Sean Miller. All of the libraries necessary to complete this tutorial are imported/installed below.

# general
import numpy as np
import pandas as pd
import datetime

# plotting
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# scraping lyrics 
#!pip install lyricsgenius
import lyricsgenius

# sentiment analysis
#!pip install vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# machine learning & regression
import statsmodels.regression.linear_model as sm
from sklearn.linear_model import LinearRegression

# wordcloud libraries 
#!pip install wordcloud
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

import warnings
warnings.filterwarnings("ignore")

As with every good data science project, we'll start by reading our CSV files into DataFrames.

# read into dataframes
features_df = pd.read_csv('Hot 100 Audio Features.csv')
ranks_df = pd.read_csv('Hot Stuff.csv')

# make copies to manipulate
features = features_df.copy(deep=True)
ranks = ranks_df.copy(deep=True)

In the Ranks dataset shown below, there are 10 columns. They tell us how a song ranked in comparison to other songs that week. The attributes provided broadly include three types of information:

Song information: the name and performer of the song
Charts information: data about when and where the song was on the charts
Dataset information: the billboard.com URL, and a SongID specific to the dataset.

ranks.head(3)

In the Features dataset shown below, we have 22 columns. The first 10 columns provide basic information for each song, like its title and performer. In addition, it also includes Spotify-specific details like Track ID and popularity. The SongID, Song, and Perfomer columns are repeated from the Ranks dataset, which makes them important for connecting the Rank and Features DataFrames. The remaining 12 columns describe the acoustic features of each song. Here's the link to the description of the features again.

features.head(3)

2B. Preprocessing the DataFrames: Merging, Cleaning, and More!¶

You might see this as the most boring part of data science, but it's a necessary evil, and extremely important for the rest of the process. We'll start by merging the Ranks and Features dataframes together on the SongID column. We only want to analyze songs that are present in both dataframes so we will be performing an inner join.

songs = pd.merge(ranks, features, on='SongID', how='inner')
songs.head(3)

This table is simply massive and very clumsy, so we need to do a fair amount of dropping rows and columns. We'll drop any duplicate columns and columns not relevant to our analysis. Additionally, we only want to consider unique, complete rows with no NaN values.

# remove columns that we won't use 
cols_remove = ['url', 'Performer_y', 'Song_y', 'Weeks on Chart', 'Peak Position', \
               'Previous Week Position', 'Instance', 'spotify_track_id', \
               'spotify_track_preview_url', 'spotify_track_album', \
               'spotify_track_explicit', 'spotify_track_duration_ms', \
               'spotify_genre', 'spotify_track_popularity'
              ]
songs.drop(cols_remove, axis = 1, inplace=True)

# rename columns
songs.rename(columns={'Performer_x': 'Performer', 'Song_x': 'Song', 'Week Position': 'Rank'}, inplace=True)

# remove rows with duplicate SongIDs and rows containing NaN 
songs = songs.drop_duplicates(subset='SongID', keep='first')
songs = songs.dropna()

Now, we want to encode the weekID column as Datetime objects, which lets us sort and subdivide our DataFrame more easily later on. For convenience of visualization, we have a year column in addition to a complete Date.

# append new columns
dt = []
years = []
for row_index, row in songs.iterrows():
    temp = datetime.datetime.strptime(songs.at[row_index, 'WeekID'], '%m/%d/%Y')
    dt.append(temp)
    years.append(temp.year)
songs['Datetime'] = dt
songs['Year'] = years

# we can drop the weekID column now
songs.drop('WeekID', axis = 1, inplace=True)

# sort and reset
songs = songs.sort_values(by=['Datetime', 'Rank'])
songs.reset_index(drop=True, inplace=True)

The complete DataFrame after merging and preprocessing:

songs.head(3)

3. Exploratory Data Analysis¶

3A: Acoustic Features¶

Before we start considering any one feature or a subset of our data, we need an idea of some of the trends in the data as a whole. Spotify's acoustic features are a perfect place to start.

We'll examine the following features for relationships to year: acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, valence, and tempo. Each year's data is calculated by taking the mean over all the songs for that year.

Note that the key, mode and time signature are excluded here. Key and mode are categorical variables, and time signatures are ratios, so taking the mean of those simply makes no sense in this context.

For a reminder on what each of the the acoustic features means, check out this page.

# find the average of each feature per year
df = songs.copy()

acoustic_features = ['danceability','energy', 'key', 'loudness',
                     'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness',
                     'valence', 'tempo', 'time_signature']

feature_averages = df.groupby(["Year"])[acoustic_features].mean()
feature_averages.reset_index(inplace=True)

feature_averages.head(3)

Let's go ahead and plot these values, since the raw numbers don't mean much to the human eye.

# Generate Plots
fig, axs = plt.subplots(3, 3, figsize=(15,10))

x = feature_averages['Year']

# Plot Danceability
axs[0, 0].plot(x, feature_averages['danceability'],  'tab:red')
axs[0, 0].set_title("Danceability in Songs Over Time")
axs[0, 0].set(ylabel='Danceability')

# Plot Loudness
axs[0, 1].plot(x, feature_averages['loudness'], 'tab:orange')
axs[0, 1].set_title("Loudness in Songs Over Time")
axs[0, 1].set(ylabel='Loudness')

# Plot Acousticness
axs[0, 2].plot(x, feature_averages['acousticness'], 'tab:green')
axs[0, 2].set_title("Acousticness in Songs Over Time")
axs[0, 2].set(ylabel='Acousticness')

# Plot Energy
axs[1, 0].plot(x, feature_averages['energy'], 'tab:blue')
axs[1, 0].set_title("Energy in Songs Over Time")
axs[1, 0].set(ylabel='Energy')

# Plot Speechiness 
axs[1, 1].plot(x, feature_averages['speechiness'], 'tab:purple')
axs[1, 1].set_title("Speechiness in Songs Over Time")
axs[1, 1].set(ylabel='Speechiness')

# Plot Instrumentalness
axs[1, 2].plot(x, feature_averages['instrumentalness'], 'tab:brown')
axs[1, 2].set_title("Instrumentalness in Songs Over Time")
axs[1, 2].set(ylabel='Instrumentalness')

# Plot Liveness
axs[2, 0].plot(x, feature_averages['liveness'], 'tab:red')
axs[2, 0].set_title("Liveness in Songs Over Time")
axs[2, 0].set(ylabel='Liveness')

# Plot Valence
axs[2, 1].plot(x, feature_averages['valence'], 'tab:orange')
axs[2, 1].set_title("Valence in Songs Over Time")
axs[2, 1].set(ylabel='Valence')

# Plot Tempo
axs[2, 2].plot(x, feature_averages['tempo'], 'tab:green')
axs[2, 2].set_title("Tempo in Songs Over Time")
axs[2, 2].set(ylabel='Tempo')

# Set x axis labels to Year
for ax in axs.flat:
    ax.set(xlabel='Year')
    
fig.tight_layout()

What do these plots tell us?¶

Analyzing the acoustic features of songs over time gives us interesting results. Some of the plots have a definite general trend: acousticness decreases while speechiness increases, likely due to the rise in electronic music and rap. Some have dramatic dips: songs around 2010 were significantly less danceable, and from 1990-2005, they were much slower. Each plot shows us that what makes music popular has definitely evolved over time.

3B. Sentiment Analysis¶

We've looked at acoustic features and their relationship to time, but we can consider another facet of the music: the lyrics! We'll turn to sentiment analysis for this. In brief, sentiment analysis is a technique that comes from Natural Language Processing that allows us to quantify the emotion found in text. More information can be found here. This would usually fall under machine learning, rather than EDA, but with the package we're using (explained below), the analysis does not require training a model.

Preparation: Sampling Songs¶

Because this is a more involved analysis, we want to narrow our dataset to a more mangaeable size. Previously, we removed all duplicate songs from the table, so we're only left with the unique ones. Out of this, we'd like to sample the most popular ones, which will have high rankings in the dataset.

We'll start by splitting the table roughly into decades, with 1958 and 1959 getting mixed in with the 1960s, and 2010-2019 being the last decade.

sixties = songs[songs.Year < 1970]
seventies = songs[(songs.Year >= 1970) & (songs.Year < 1980)]
eighties = songs[(songs.Year >= 1980) & (songs.Year < 1990)]
nineties = songs[(songs.Year >= 1990) & (songs.Year < 2000)]
thousands = songs[(songs.Year >= 2000) & (songs.Year < 2010)]
tens = songs[songs.Year >= 2010]

Then, we can sample the top 100 songs from each time period, giving us 600 in total.

songs_sampled = sixties.sort_values(by='Rank')[0:100]
songs_sampled = songs_sampled.append(seventies.sort_values(by='Rank')[0:100])
songs_sampled = songs_sampled.append(eighties.sort_values(by='Rank')[0:100])
songs_sampled = songs_sampled.append(nineties.sort_values(by='Rank')[0:100])
songs_sampled = songs_sampled.append(thousands.sort_values(by='Rank')[0:100])
songs_sampled = songs_sampled.append(tens.sort_values(by='Rank')[0:100])
songs_sampled.reset_index(inplace=True)

Let's plot the distribution of our sampled songs to ensure we have a good representative sample over time.

years = np.arange(1958, 2020)
year_counts = [0] * len(years)
for row_index, row in songs_sampled.iterrows():
    song_year = songs_sampled.at[row_index, 'Year']
    year_counts[song_year-1958] += 1

fig, ax = plt.subplots(figsize=(15, 10))
ax = sns.barplot(years, year_counts)

counter = 0
for label in ax.get_xticklabels():
    if counter % 10 != 0:
        label.set_visible(False)
    counter += 1

Looks like a good sample! Of course, it's not perfect, since there appear to be a few years that had quite a few more songs than others. This could be due to our dropping duplicate rows, if some spectacular songs stayed on the charts across multiple years. Either way, though, this works for what we want to do.

Getting Song Lyrics with LyricsGenius¶

To get the song lyrics we need, we could manually find each song on a website like Genius and scrape the lyrics... but we can be smarter about it. Genius has a free API (that requires project registration) that allows you to query for lyrics! We can take this one step further, with a package designed by John Miller called LyricsGenius. This package enables us to query the Genius API much more intuitively, making this task much easier.

Before using it for our purposes, let's take a look at how it works.

genius = lyricsgenius.Genius('developer-token')
genius.remove_section_headers = True

The above code sets up our LyricsGenius object with a developer token (that can be generated for a new project here) and allows us to get the pure text of the lyrics without any headers like [Chorus]. If we wanted to only analyze the choruses of our songs, we could leave that setting disabled.

To actually get the song lyrics, we can just pass in a song and artist name into the LyricsGenius object.

example = genius.search_song('Never Gonna Give You Up', 'Rick Astley')

Searching for "Never Gonna Give You Up" by Rick Astley...
Done.

This either returns a NoneType if not found, or a Song object, from which we can get the lyrics:

print(example.lyrics[0:377])

We're no strangers to love
You know the rules and so do I
A full commitment's what I'm thinking of
You wouldn't get this from any other guy

I just wanna tell you how I'm feeling
Gotta make you understand

Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you

Perfect! Let's throw this in a function that we can use for our data later on.

def get_lyrics(artist, song):
    temp = genius.search_song(song, artist)
    if temp is not None:
        return temp.lyrics
    else:
        return None

Calculating Polarity Scores with VADER¶

There are many packages to perform sentiment analysis, but we'll be using VADER, which can be found here. As stated on the webpage, this tool is "specifically attuned to sentiments expressed in social media," which includes emojis, typed emoticons, and slang. The emojis aren't of much use to us, but the flexibility with slang is what made it particularly attractive for this project. VADER is also different to other tools because it doesn't require training data. Instead, its sentiment lexicon is fully human-curated, and what makes me call this EDA instead of ML.

Similarly to LyricsGenius, we will start with making a SentimentIntensityAnalyzer. Then, we can call the polarity_scores() method on it to get our analyzed sentiment scores. VADER places emphasis on keeping the original text, rather than tokenizing, so that's what we'll do. Let's test this on the lyrics we scraped earlier.

analyzer = SentimentIntensityAnalyzer();
analyzer.polarity_scores(example.lyrics)

{'neg': 0.152, 'neu': 0.816, 'pos': 0.032, 'compound': -0.9951}

We can see that VADER returns a dictionary of 4 values: 'neg', 'neu', 'pos', and 'compound'. The first three correspond to the negative, neutral, and positive scores, respectively. The last score is a combination of the other scores, which is then normalized to the range [-1, 1], representing "extremely negative" to "extremely positive" values. More information about exactly how the compound score is calculated can be found here.

Putting It Together!¶

Now, we can define a function that will first get our lyrics using LyricsGenius, and then analyze it using VADER. We'll use it to add columns to our sampled_songs DataFrame. Note that we return NaN values if the lyric query failed.

def get_sen_scores(artist, song):
    lyrics = get_lyrics(artist, song)
    if lyrics is not None:
        scores = analyzer.polarity_scores(lyrics)
        return scores['pos'], scores['neu'], scores['neg'], scores['compound']
    else:
        return np.nan, np.nan, np.nan, np.nan

Then, we want to add year and decade columns to our sampled songs, to allow for ease of plotting later on:

years = songs_sampled['Year']
decades = []
for year in years:
    if year < 1970:
        decades.append('1958-1969')
    elif (year >= 1970 and year < 1980):
        decades.append('1970-1979')
    elif (year >= 1980 and year < 1990):
        decades.append('1980-1989')
    elif (year >= 1990 and year < 2000):
        decades.append('1990-1999')
    elif (year >= 2000 and year < 2010):
        decades.append('2000-2010')
    else:
        decades.append('2010-2019')

The calculation of the polarity scores is mundane, computationally heavy, and takes a lot of time, so for simplicity's sake, we'll run through what the output would look like for the top three rows of the dataframe. Please just trust that we ran all of it on our own time, and saved you the trouble of seeing all the output.

sample_pos = []
sample_neu = []
sample_neg = []
sample_comp = []

# calculate polarity scores 
for row_index, row in songs_sampled[0:3].iterrows():
    song = songs_sampled.at[row_index, 'Song']
    artist = songs_sampled.at[row_index, 'Performer']
    pos, neu, neg, comp = get_sen_scores(artist, song)
    sample_pos.append(pos)
    sample_neu.append(neu)
    sample_neg.append(neg)
    sample_comp.append(comp)
    
# create a dataframe of polarity scores
sample_polarity_vals = {'Pos_Score': sample_pos, 'Neu_Score': sample_neu, \
                        'Neg_Score': sample_neg, 'Comp_Score': sample_comp}
sample_polarity_scores = pd.DataFrame(data=sample_polarity_vals)
sample_polarity_scores.head(3)

Searching for "Smoke Gets In Your Eyes" by The Platters...
Done.
Searching for "Apache" by Jorgen Ingmann & His Guitar...
No results found for: 'Apache Jorgen Ingmann & His Guitar'
Searching for "Calcutta" by Lawrence Welk And His Orchestra...
Done.

So we ran that for 600 lines (battling name resolution and connection errors along the way) and downloaded our results as a CSV file called "Polarity Scores" for easy access. We'll then add a "Year" and "Decade" column, once again for ease of plotting.

polarity_scores = pd.read_csv('Polarity Scores.csv')
polarity_scores['Decade'] = decades
polarity_scores['Year'] = songs_sampled['Year']
polarity_scores.reset_index(inplace=True)
polarity_scores.drop('index', axis=1, inplace=True)
polarity_scores.head(3)

Given the nature of our data, a violin plot makes the most sense, since it will allow us to see how our data is distributed over each specific decade, as well as the decades over time.

fig, ax = plt.subplots(figsize=(15, 10))
ax = sns.violinplot(x='Decade', y='Comp_Score', data=polarity_scores, palette="husl")

At first glance, it looks like the plots are quite similar across each decade, all with high medians very close to 1. However, we can see that the amount of negative scores increases over time. This is shown in the increasing thickness of the bottom half of the violins. To take a closer look at this, we could plot the mean over time, and fit a regression line to it.

Let's first calculate the mean score for each year, and then plot!

polarity_types = ['Pos_Score', 'Neu_Score', 'Neg_Score', 'Comp_Score']
polarity_averages = polarity_scores.groupby(["Year"])[polarity_types].mean()
polarity_averages.reset_index(inplace=True)
polarity_averages.head(3)

fig, ax = plt.subplots(figsize=(10, 7))

x = polarity_averages['Year']
y = polarity_averages['Comp_Score']

# Plot polarity score means per year
ax.plot(x, y,  'tab:blue')
ax.set_title("Mean Compound Score Per Year")
ax.set(ylabel='Compound Score')

# fit regression line
m, b = np.polyfit(x, y, 1)
ax.plot(x, (m*x + b), 'tab:red')
eq = ('comp_score = {}*year + {}').format(m.round(4), b.round(4))
print(eq)

comp_score = -0.0078*year + 16.0646

Looking at this plot, we can see that songs are generally becoming more negative over time, despite the variation from year to year.

Recall that the acoustic feature "valence" is a measure of how positive or negative the music was, so let's compare that plot to our plot above, which gives a measure of how positive or negative the lyrics were. Here, we reuse the valence plot from section 3A.

# Generate Plot
fig, ax = plt.subplots(figsize=(8.5, 6.5))

x = feature_averages['Year']

# Plot Valence
ax.plot(x, feature_averages['valence'], 'tab:orange')
ax.set_title("Valence in Songs Over Time")
ax.set(ylabel='Valence')

ax.set(xlabel='Year')

Interestingly, there seems to be a pattern here! It logically makes sense that songs that sound sad would also have more negative lyrics, but it looks like the data reflects this as well. If only there were a computational way to calculate any relationship between the two...

4. Machine Learning: Predicting the Patterns¶

Surprise, we do have a way to calculate any relationship there! Let's try to predict the compound sentiment analysis score based on two features: the year that the song was on the charts, and its valence, as determined by Spotify. Recall that we want to account for any relationship between those two variables, so we'll need to include an interaction term year * valence.

However, we have to note... aside from the spec given by Spotify, we have no inside information on how the acoustic features were calculated. The spec says: "Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry)." We're working under the assumption that the lyrics were not involved in the calculation of valence, but we must be aware that maybe sentiment analysis was considered.

4A: Setup and Model Training¶

At this point, you know how it goes: we have setup to do! This time, it's quite simple: we only need to calculate the interaction term of year and valence, in addition to building our feature matrix.

Also, note the last line in this cell. Based on the sentiment analysis functions in section 3B, if LyricsGenius was unable to acquire the lyrics for a given song, all of the scores will have NaN values. We can't put these into a training model, though, so we'll drop them.

# extract needed columns
years = songs_sampled['Year']
comp_score = polarity_scores['Comp_Score']
valence = songs_sampled['valence']
interaction = []

# calculate interaction term
for i in range(600):
    interaction.append(years[i] * valence[i])
    
# build DataFrame
regr_dict = { 'Year': years,
        'Valence': valence,
        'Interaction': interaction,
        'Comp_Score': comp_score
}
regr_data = pd.DataFrame(regr_dict)
regr_data = regr_data.dropna()
regr_data.head(3)

Now, we can split up our data into the feature matrix X, the predictors; and y, the column of values to predict. We'll use that to train a statsmodels linear regression model. The package makes this easy, but for more information, check our this tutorial! It goes into both linear regression as a concept, as well as how to implement it.

# separate data
X = regr_data[['Year', 'Valence', 'Interaction']]
y = regr_data['Comp_Score']

# train model
model = sm.OLS(y, X)
results = model.fit()
results.params

Year            0.000248
Valence        18.993925
Interaction    -0.009533
dtype: float64

4B: Validation: Plotting Residuals¶

So we trained a model, but we still need to check its performance. Linear regression is a simple enough model that we didn't split our data into training and test sets, but we can still do some residual plotting. Recall that the residual is the predicted value minus the measured value, and tells us how close our prediction was.

To plot our values, we need to once again calculate what decade each song was in. Due to the dropped NaN rows, our previous ones from the songs_sampled DataFrame are no longer accurate.

decades = []
for row_index, row in regr_data.iterrows():
    year = regr_data.at[row_index, 'Year']
    if year < 1970:
        decades.append('1958-1969')
    elif (year >= 1970 and year < 1980):
        decades.append('1970-1979')
    elif (year >= 1980 and year < 1990):
        decades.append('1980-1989')
    elif (year >= 1990 and year < 2000):
        decades.append('1990-1999')
    elif (year >= 2000 and year < 2010):
        decades.append('2000-2010')
    else:
        decades.append('2010-2019')
regr_data['Decade'] = decades

Now, let's calculate our residuals by using our trained model!

residuals = []
for row_index, row in regr_data.iterrows():
    y = regr_data.at[row_index, 'Year']
    v = regr_data.at[row_index, 'Valence']
    i = regr_data.at[row_index, 'Interaction'] 
    measured = regr_data.at[row_index, 'Comp_Score']
    predicted = results.predict([y, v, i])[0]
    residuals.append(predicted-measured)
regr_data['Residual'] = residuals

Finally, let's use another violin plot to examine the distribution of our residuals.

fig, ax = plt.subplots(figsize=(15, 10))
ax = sns.violinplot(x='Decade', y='Residual', data=regr_data, palette="husl")

So we can see that the median residuals are quite close to 0. We would usually expect a more Gaussian distribution, but in our case, our sample size was small enough that this would not necessarily be the case. It looks like the residual plot is almost a mirror image of the compound score plot, but upside down. If our model was a good match, then it makes sense that the large number of negative residuals in this plot correspond to the large number of high compound scores in the plot before, and vice versa.

5. Another Interpretation: WordClouds¶

After some puzzling regression modeling... let's do some data visualization! We can stare at numbers all day, but an alternate way of seeing our data may help us interpret it differently.

5A. Visualizing Lyrics¶

Word clouds are a really cool way of visualizing text! All of the unique words in a text will be added to a bubble, with higher frequency words taking up more space in the bubble and low frequency words taking up less space. We have shown in our tutorial that music has been changing over time, whether it be through its acoustic features, lyrics, or popularity. So far, we have primarily been visualizing numeric data, so let's take some time to visualize the actual song lyrics we have scraped! To do this, we are going to use Python's wordcloud library. But there are also free online generators available!

We will start off by storing the lyrics of popular songs for each decade into variables. Here, we are going to reuse our database of sampled songs which already has each of the decades separated chronologically. The code we use below to store scraped lyrics is very similar to the code we used to calculate the polarity scores in section 3B. We went through 600 rows of the sampled songs dataframe by splitting them into 100-row sections where every 100 rows represents a decade. Similarly to our sentiment analysis, we've included a snippet of the code to reduce redundancy and ran everything separately.

# populate sample lyrics variable 
sample_lyrics = ""
for row_index, row in songs_sampled[0:3].iterrows():
    song = songs_sampled.at[row_index, 'Song']
    artist = songs_sampled.at[row_index, 'Performer']
    temp = get_lyrics(artist, song)
    if (temp is not None):
        sample_lyrics += temp

Searching for "Smoke Gets In Your Eyes" by The Platters...
Done.
Searching for "Apache" by Jorgen Ingmann & His Guitar...
No results found for: 'Apache Jorgen Ingmann & His Guitar'
Searching for "Calcutta" by Lawrence Welk And His Orchestra...
Done.

Now that we have all of our lyrics stored, we can start making our word clouds! We generated all of our clouds and saved them as pngs so that we could examine them side by side (and avoid redundant code). Here is an example of how to make a cloud.

# Generate Word Cloud 
text = sample_lyrics

# Create and generate a word cloud image
wordcloud = WordCloud(max_words=100, background_color="white").generate(text)

# Display the generated image
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

5B. Word Clouds!¶

Finally, we can take a look at word clouds of popular song lyrics from each of the decades we initially examined!

Wow. Looking at the 2010s compared to 1960s, there are definitely more negative words there, which we'll leave for you to find yourself. Interestingly, the word "love" became smaller and less used in songs as the decades went on! Words like "know", "got", and "oh" are also unexpectedly common in songs across all of the decades. Although we did these for fun, we can see that our sentiment scores from section 3B match what we observe in the wordclouds. While all of them have positive words, the amount of negative words increases over time.

6. Insights and Conclusions¶

We've been on a long journey through this whole project, so if you made it to the end, congratulations! Let's recap everything we've done so far.

When we started this project, we wanted to explore how music has changed over time, and there were multiple ways to do this. We narrowed it down to Spotify's acoustic features for an analysis of the music, and sentiment analysis using LyricsGenius and VADER for the lyrics.

We found that over time, songs became increasingly negative, both in how they sounded (according to Spotify's valence scores), and in the words (using our sentiment analysis). Finally, we trained a model to see whether there was a correlation between both of these scores, and definitely found one. It looks like songs with more positive lyrics do have happier-sounding music, while sad lyrics are going to be accompanied with music that you could cry to.

For future work, it could be interesting to switch from only looking at music to including some other subjects. After all, there has to be some kind of reason why songs have been getting more negative as time goes on. So if you happen to need an idea for how to apply all the skills we've taught you, feel free to use that one. If you do, tell us about it, and we'll get back to you in 14-30 business days!

Happy coding!

	url	WeekID	Week Position	Song	Performer	SongID	Instance	Previous Week Position	Peak Position	Weeks on Chart
0	http://www.billboard.com/charts/hot-100/1958-0...	8/2/1958	1	Poor Little Fool	Ricky Nelson	Poor Little FoolRicky Nelson	1	NaN	1	1
1	http://www.billboard.com/charts/hot-100/1995-1...	12/2/1995	1	One Sweet Day	Mariah Carey & Boyz II Men	One Sweet DayMariah Carey & Boyz II Men	1	NaN	1	1
2	http://www.billboard.com/charts/hot-100/1997-1...	10/11/1997	1	Candle In The Wind 1997/Something About The Wa...	Elton John	Candle In The Wind 1997/Something About The Wa...	1	NaN	1	1

	url	WeekID	Week Position	Song_x	Performer_x	SongID	Instance	Previous Week Position	Peak Position	Weeks on Chart	...	key	loudness	speechiness	acousticness	liveness	valence	tempo	time_signature
0	https://www.billboard.com/charts/hot-100/2019-...	2/2/2019	1	7 Rings	Ariana Grande	7 RingsAriana Grande	1	NaN	1	1	...	1.0	-10.732	0.334	0.592	0.0881	0.327	140.048	4.0
1	https://www.billboard.com/charts/hot-100/2019-...	5/25/2019	11	7 Rings	Ariana Grande	7 RingsAriana Grande	1	10.0	1	17	...	1.0	-10.732	0.334	0.592	0.0881	0.327	140.048	4.0
2	https://www.billboard.com/charts/hot-100/2019-...	4/20/2019	4	7 Rings	Ariana Grande	7 RingsAriana Grande	1	3.0	1	12	...	1.0	-10.732	0.334	0.592	0.0881	0.327	140.048	4.0

	Rank	Song	Performer	SongID	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	time_signature	Datetime	Year
0	5	When	Kalin Twins	WhenKalin Twins	0.646	0.582	6.0	-12.679	1.0	0.0297	0.168	0.000005	0.976	0.963	96.490	4.0	1958-08-02	1958
1	7	Yakety Yak	The Coasters	Yakety YakThe Coasters	0.715	0.669	7.0	-9.491	1.0	0.1280	0.705	0.000732	0.044	0.976	120.789	4.0	1958-08-02	1958
2	8	My True Love	Jack Scott	My True LoveJack Scott	0.548	0.253	4.0	-11.387	1.0	0.0279	0.871	0.000099	0.138	0.238	68.184	3.0	1958-08-02	1958

	Year	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	time_signature
0	1958	0.547516	0.480868	5.172093	-10.659019	0.865116	0.054884	0.689247	0.055913	0.200894	0.680112	113.556684	3.725581
1	1959	0.522090	0.474502	4.970027	-11.237687	0.847411	0.058096	0.668486	0.063040	0.187984	0.654508	117.056379	3.792916
2	1960	0.509847	0.447627	5.111380	-11.002462	0.864407	0.048358	0.671877	0.069121	0.210148	0.633494	117.113029	3.765133

	Pos_Score	Neu_Score	Neg_Score	Comp_Score	Decade	Year
0	0.283	0.599	0.118	0.9792	1958-1969	1959
1	0.149	0.788	0.063	0.8934	1958-1969	1961
2	0.122	0.797	0.081	1.0000	1958-1969	1961

	SongID	Performer	Song	spotify_genre	spotify_track_id	spotify_track_preview_url	spotify_track_album	spotify_track_explicit	spotify_track_duration_ms	spotify_track_popularity	...	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	time_signature
0	AdictoTainy, Anuel AA & Ozuna	Tainy, Anuel AA & Ozuna	Adicto	['pop reggaeton']	3jbT1Y5MoPwEIpZndDDwVq	NaN	Adicto (with Anuel AA & Ozuna)	False	270740.0	91.0	...	10.0	-4.803	0.0	0.0735	0.017	0.000016	0.179	0.623	80.002	4.0
1	The Ones That Didn't Make It Back HomeJustin M...	Justin Moore	The Ones That Didn't Make It Back Home	['arkansas country', 'contemporary country', '...	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	ShallowLady Gaga & Bradley Cooper	Lady Gaga & Bradley Cooper	Shallow	['dance pop', 'pop']	2VxeLyX666F8uXCJ0dZF8B	NaN	A Star Is Born Soundtrack	False	215733.0	88.0	...	7.0	-6.362	1.0	0.0308	0.371	0.000000	0.231	0.323	95.799	4.0

	Year	Pos_Score	Neu_Score	Neg_Score	Comp_Score
0	1958	0.194667	0.724556	0.080889	0.558844
1	1959	0.215000	0.748571	0.036429	0.719686
2	1960	0.211400	0.770800	0.017800	0.986340

	Year	Valence	Interaction	Comp_Score
0	1959	0.285	558.315	0.9792
1	1961	0.248	486.328	0.8934
2	1961	0.954	1870.794	1.0000