Loading IMDB data into postgresql

IMDB (Internet Movie Database) makes the movie dataset available for free download at https://datasets.imdbws.com/. The documentation for this dataset can be found at https://www.imdb.com/interfaces/.

In this blog post, I show you how to load this data into a PostgreSql database. The steps are executed from an Ubuntu Linux workstation. I assume that you already have a postgresql database with about 50Gb of free space to upload this data into and you know the connection information.

We’ll use the s32imdbpy.py script that can be downloaded from github .

From the ubuntu workstation, install the following packages

sudo apt install python3-pip
sudo apt-get install libpq-dev

Now install the following python modules

pip3 install Psycopg2
pip3 install imdbpy

Create a new directory, to store the downloaded datafiles and download the datafiles from https://datasets.imdbws.com into this directory

mkdir dat
cd dat
wget https://datasets.imdbws.com/name.basics.tsv.gz
wget https://datasets.imdbws.com/title.akas.tsv.gz
wget https://datasets.imdbws.com/title.basics.tsv.gz
wget https://datasets.imdbws.com/title.crew.tsv.gz
wget https://datasets.imdbws.com/title.episode.tsv.gz
wget https://datasets.imdbws.com/title.principals.tsv.gz
wget https://datasets.imdbws.com/title.ratings.tsv.gz

Login to your postgresql database and create a new schema to hold the imdb tables (This step is optional. If you do not create this schema, then the tables and the corresponding data gets loaded into the public schema).

create schema imdb;

Before you run the script, lets edit the script and make one change, which will enable the script to load the data into the newly created imdb schema. This change will be made in line 183 in the file.

Change
engine = sqlalchemy.create_engine(db_uri, encoding='utf-8', echo=False)
To
engine = sqlalchemy.create_engine(db_uri, encoding='utf-8', echo=False,connect_args={'options': '-csearch_path={}'.format('imdb')})

Then Execute the script as shown below

python3 s32imdbpy.py /home/ubuntu/dat postgresql://username:password@dbhostname/dbname

Where /home/ubuntu/dat is the directory where the imdb files are downloaded into.

This will take some time to load (Close to an hour on a reasonably sized ubuntu workstation) and will consume about 15Gb of space in your postgresql database.

Leave a Reply

Your email address will not be published. Required fields are marked *