Extracting Spotify data on your favourite artist via Python

4 min readDec 30, 2018

Spotify is one of the most popular streaming platforms in the world. They also have an API for developers to utilise their huge database of music to build interesting applications and uncover insights into our listening habits.

For today’s post I will be showing how to pull song data all the albums of a chosen music artist. You can see the results of data collection here.

Before getting started you need:

Spotify API permissions & credentials that could apply for here. Simply log in, go to your “dashboard” and select “create client id” and follow the instructions. Spotify are not too strict on providing permissions so put anything you like when they ask for commercial application.
Python module — spotipy — imported

Once you have all that, you can dive into Spotify’s API via Python.

Disclaimer: Below is an affiliate link that I will commission from:

Also if you’re interested in learning Python and executing all the cool automated projects you have swirling in your head then DataCamp’s platform is perfect for you. I think the best part isn’t the 355+ courses from Python, SQL or Tableau but the mobile app to practice on, a live community to hold you accountable and skill assessments to perfect your skills.

Set up your modules and variables

After importing your modules, you need to select your artist and set up your credentials which you should have at hand.

Then quickly run a search query to make sure everything works.

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials #To access authorised Spotify dataclient_id = {spotify client id}
client_secret = {spotify secret id}client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager) #spotify object to access APIname = "{Artist Name}" #chosen artistresult = sp.search(name) #search query
result['tracks']['items'][0]['artists']

You should see something like this:

[{'external_urls': {'spotify': 'https://open.spotify.com/artist/26VFTg2z8YR0cCuwLzESi2'},
  'href': 'https://api.spotify.com/v1/artists/26VFTg2z8YR0cCuwLzESi2',
  'id': '26VFTg2z8YR0cCuwLzESi2',
  'name': 'Halsey',
  'type': 'artist',
  'uri': 'spotify:artist:26VFTg2z8YR0cCuwLzESi2'}]

Now if you’re happy with the first result you can pull out all the artists’ albums

Extract Spotify albums

We will store Spotify URIs (IDs essentially) and album names in separate lists for reference later.

#Extract Artist's uri
artist_uri = result['tracks']['items'][0]['artists'][0]['uri']#Pull all of the artist's albums
sp_albums = sp.artist_albums(artist_uri, album_type='album')#Store artist's albums' names' and uris in separate lists
album_names = []
album_uris = []
for i in range(len(sp_albums['items'])):
    album_names.append(sp_albums['items'][i]['name'])
    album_uris.append(sp_albums['items'][i]['uri'])
    
album_names
album_uris
#Keep names and uris in same order to keep track of duplicate albums

album_uris should return something like this:

['spotify:album:2liBYuCYfW37CmNkub2BaH',
 'spotify:album:7AahTQHNRwEVIgyeAJtUJM', etc, etc]

Grab the songs from each album

Next would be to loop through each album to extract key album track data.

def albumSongs(uri):
    album = uri #assign album uri to a_namespotify_albums[album] = {} #Creates dictionary for that specific album#Create keys-values of empty lists inside nested dictionary for album
    spotify_albums[album]['album'] = [] #create empty list
    spotify_albums[album]['track_number'] = []
    spotify_albums[album]['id'] = []
    spotify_albums[album]['name'] = []
    spotify_albums[album]['uri'] = []tracks = sp.album_tracks(album) #pull data on album tracksfor n in range(len(tracks['items'])): #for each song track
        spotify_albums[album]['album'].append(album_names[album_count]) #append album name tracked via album_count
        spotify_albums[album]['track_number'].append(tracks['items'][n]['track_number'])
        spotify_albums[album]['id'].append(tracks['items'][n]['id'])
        spotify_albums[album]['name'].append(tracks['items'][n]['name'])
        spotify_albums[album]['uri'].append(tracks['items'][n]['uri'])

Now you can apply the function to each album URI in the list to pull track data. Also you need to create an empty dictionary called to store your spotify album data.

spotify_albums = {}album_count = 0
for i in album_uris: #each album
    albumSongs(i)
    print("Album " + str(album_names[album_count]) + " songs has been added to spotify_albums dictionary")
    album_count+=1 #Updates album count once all tracks have been added

Grab audio features for each song

There’s probably a more space efficient way of doing this but I prioritise visibility and structure when writing code.

Here we add additional key-values to store the audio features of each album track and append the data into lists representing all the music tracks for that album.

def audio_features(album):
    #Add new key-values to store audio features
    spotify_albums[album]['acousticness'] = []
    spotify_albums[album]['danceability'] = []
    spotify_albums[album]['energy'] = []
    spotify_albums[album]['instrumentalness'] = []
    spotify_albums[album]['liveness'] = []
    spotify_albums[album]['loudness'] = []
    spotify_albums[album]['speechiness'] = []
    spotify_albums[album]['tempo'] = []
    spotify_albums[album]['valence'] = []
    spotify_albums[album]['popularity'] = []
    #create a track counter
    track_count = 0
    for track in spotify_albums[album]['uri']:
        #pull audio features per track
        features = sp.audio_features(track)
        
        #Append to relevant key-value
        spotify_albums[album]['acousticness'].append(features[0]['acousticness'])
        spotify_albums[album]['danceability'].append(features[0]['danceability'])
        spotify_albums[album]['energy'].append(features[0]['energy'])
        spotify_albums[album]['instrumentalness'].append(features[0]['instrumentalness'])
        spotify_albums[album]['liveness'].append(features[0]['liveness'])
        spotify_albums[album]['loudness'].append(features[0]['loudness'])
        spotify_albums[album]['speechiness'].append(features[0]['speechiness'])
        spotify_albums[album]['tempo'].append(features[0]['tempo'])
        spotify_albums[album]['valence'].append(features[0]['valence'])
        #popularity is stored elsewhere
        pop = sp.track(track)
        spotify_albums[album]['popularity'].append(pop['popularity'])
        track_count+=1

Loop through albums extracting the audio features

We will need to add a random delay every few albums to avoid sending too many requests at Spotify’s API.
We will also set up print statements to track which album we are on incase we encounter errors and want to know where in the data it happened.

import time
import numpy as np
sleep_min = 2
sleep_max = 5
start_time = time.time()
request_count = 0for i in spotify_albums:
    audio_features(i)
    request_count+=1
    if request_count % 5 == 0:
        print(str(request_count) + " playlists completed")
        time.sleep(np.random.uniform(sleep_min, sleep_max))
        print('Loop #: {}'.format(request_count))
        print('Elapsed Time: {} seconds'.format(time.time() - start_time))

Add data to a new dataframe

But first we will organise our data into a dictionary which can more easily be turned into a dataframe.

dic_df = {}dic_df['album'] = []
dic_df['track_number'] = []
dic_df['id'] = []
dic_df['name'] = []
dic_df['uri'] = []
dic_df['acousticness'] = []
dic_df['danceability'] = []
dic_df['energy'] = []
dic_df['instrumentalness'] = []
dic_df['liveness'] = []
dic_df['loudness'] = []
dic_df['speechiness'] = []
dic_df['tempo'] = []
dic_df['valence'] = []
dic_df['popularity'] = []for album in spotify_albums: 
    for feature in spotify_albums[album]:
        dic_df[feature].extend(spotify_albums[album][feature])
        
len(dic_df['album'])

Now we’ve organised our dictionary, we convert it into a dataframe

Convert into dataframe

Once we have all the data we can put it into a more familiar and easy-to-read format like a dataframe with rows and columns.

import pandas as pddf = pd.DataFrame.from_dict(dic_df)
df

Spotify has a duplicate issue which we can only address by removing all but the most popular songs

Remove duplicates

print(len(df))
final_df = df.sort_values('popularity', ascending=False).drop_duplicates('name').sort_index()
print(len(final_df))

Save to CSV

final_df.to_csv("{a file location to store your csv}")

That’s how you can extract Spotify data on your favourite artist’s albums.