Spotify Project: Data Infrastructure Setup

1 minute read

Introduction

This project uses the 1 million playlist data set

Since the data set is ~70 million rows, and cannot all be easily managed on our local machines, so we’ve decided to use GCP (Google Cloud Developer) platform to help us manage our data. First, we uploaded the CSVs into a private GCP storage bucket, and from there sent the data to BigQuery, GCP’s distributed database, which will allow us to query the data however we like.

GCP Interface

From here, we can retrieve, query, join, and manipulate the entirety of the dataset however we like. Please follow the instructions below in this notebook to set up your Jupyter so that it can query directly from GCP.

Setup Instructions

The following setup should take just a few minutes.

1. Create a service account.

Go to the spotted-d service account. Once you are on this page, go to the “actions” tab, where there will be a drop-down indicated by three dots on the right-most part of the corresponding account. Click “Create Key” which will download a key for you somewhere on your local machine. Save it somewhere safe. :-)

If you plan to store it on this git project, make sure to put in a folder that is git-ignored so that we don’t push it up to Github. I actually created a .gitignore file on this branch. If you create a directory called “config” under the top-level directory of this repository and stick your service key in there, it should be automatically ignored. Ask me if you want any help.

2. Set up implicit authentication with gCloud

If you are using Mac, just run this on your command line:

export GOOGLE_APPLICATION_CREDENTIALS="[PATH]"

If you are using Windows:

$env:GOOGLE_APPLICATION_CREDENTIALS="[PATH]"

For more info, see the documentation on python authetication

It might be worth it to add this export line to your bash profile.

3. Install Google Cloud Big Query Pandas python package.

Run this on your terminal:

pip install --upgrade google-cloud-bigquery[pandas]

For more on the above two steps, you can check out the GCP jupyter documentation

4. Insert this cell into your EDA notebook:

%load_ext google.cloud.bigquery

Updated: