Introduction to the project
In this series of blog posts we will present an end-to-end database project using MySQL and the IMDb dataset. The code and images for this project are shared in the associated GitHub repository. The purpose of this project was originally to be pedagogical and now we will share what we learnt along the way. In this project we designed and built a MySQL database to store the Internet Movie Database (IMDb) dataset. We then explored and visualised the data. This project required a fair amount of work, so this project will be presented in four parts. In this first post we will introduce the dataset and summarise the most import aspects of the project.
The purpose of this project was to do the following:
-
Learn about and use the database management system MySQL.
-
Learn the essentials of database design, e.g., Entity-Relationship diagrams, logical schema, and database normalisation.
-
To perform ETL (Extract-Transform-Load) on the IMDb data using python.
-
Practice database querying by posing basic and more complex queries using MySQL directly and also by using python.
-
Visualise IMDb data using python.
The tangible steps we took in this project are:
-
Understand the data in the IMDb dataset.
-
Design a relational database and store the IMDb data in it.
-
Modelled the database using an Entity-Relationship (ER) and logical schema diagrams.
- Performed ETL, i.e., we extracted the data from the tab separated files and transformed it into normalised and restructured tables using python, which were then loaded into new files.
- Created a MySQL database.
- Loaded transformed data into the normalised database.
- Added primary and foreign key constraints.
-
Created database indexes.
-
Asked questions of the IMDb data, so as to practice simple and more complex SQL queries.
-
Performed further exploration and also visualisation of the data using python.
-
Throughout we adhered to SQL style conventions, e.g., please see SQL Style guide https://www.sqlstyle.guide/. In particular, underscores were used in attribute names rather than camel case, which is used in the IMDb data files.
Introduction to the IMDb dataset
The IMDb dataset consists of 7 compressed tab-separated-value (*.tsv) files, which are explained and available for download from here. The data is refreshed daily, although the data used in this project was obtained on 29/11/2019. Each of these gzipped tab-separated-values (TSV) formatted files in the UTF-8 character set. The first line in each file contains headers that describe what is in each column. A “\N” is used to denote that a particular field is missing or has a NULL value for that title or name. It should be noted that the data available for download from the IMDb website is not the full dataset, but it will suffice for our purposes. Later we will discuss some of the data issues. The available IMDb data files are as follows:
name.basics.tsv.gz
Contains the following information for names:
- nconst (string) - alphanumeric unique identifier of the name/person.
- primaryName (string)– name by which the person is most often credited.
- birthYear – in YYYY format.
- deathYear – in YYYY format if applicable, else “\N”.
- primaryProfession (array of strings) – the top-3 professions of the person.
- knownForTitles (array of tconsts) – titles the person is known for.
title.basics.tsv.gz
Contains the following information for titles:
- tconst (string) - alphanumeric unique identifier of the title.
- titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc).
- primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release.
- originalTitle (string) - original title, in the original language.
- isAdult (boolean) - 0: non-adult title; 1: adult title.
- startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year.
- endYear (YYYY) – TV Series end year. “\N” for all other title types.
- runtimeMinutes – primary runtime of the title, in minutes.
- genres (string array) – includes up to three genres associated with the title.
title.akas.tsv.gz
Contains the following information for titles:
- titleId (string) - a tconst which is an alphanumeric unique identifier of the title.
- ordering (integer) – a number to uniquely identify rows for a given titleId.
- title (string) – the localised title.
- region (string) - the region for this version of the title.
- language (string) - the language of the title.
-
types (array) - Enumerated set of attributes for this alternative title. One or more of the following: “alternative”, “dvd”, “festival”, “tv”, “video”, “working”, “original”, “imdbDisplay”. New values may be added in the future without warning. Please note that types is said to be an array. In the data we have this appears to not be true. There appears to be only one string for each pair of titleId and ordering values. Also, there are many NULL (\N) values in this field (~95%).
-
attributes (array) - Additional terms to describe this alternative title, not enumerated. Please note that attributes is said to be an array. In the data we have this appears to not be true. There appears to be only one string for each pair of titleId and ordering values. There are many NULL (\N) values in this field (~99%).
-
isOriginalTitle (boolean) – 0: not original title; 1: original title.
title.crew.tsv.gz
Contains the director and writer information for all the titles in IMDb. Fields include:
- tconst (string) - alphanumeric unique identifier of the title.
- directors (array of nconsts) - director(s) of the given title.
- writers (array of nconsts) – writer(s) of the given title.
title.episode.tsv.gz
Contains the tv episode information. Fields include:
- tconst (string) - alphanumeric identifier of episode.
- parentTconst (string) - alphanumeric identifier of the parent TV Series.
- seasonNumber (integer) – season number the episode belongs to.
- episodeNumber (integer) – episode number of the tconst in the TV series.
title.principals.tsv.gz
Contains the principal cast/crew for titles
- tconst (string) - alphanumeric unique identifier of the title.
- ordering (integer) – a number to uniquely identify rows for a given titleId.
- nconst (string) - alphanumeric unique identifier of the name/person.
- category (string) - the category of job that person was in.
- job (string) - the specific job title if applicable, else “\N”.
- characters (string) - the name of the character played if applicable, else “\N” (It is really “[role1,role2,….]” or “\N”).
title.ratings.tsv.gz
Contains the IMDb rating and votes information for titles
- tconst (string) - alphanumeric unique identifier of the title.
- averageRating – weighted average of all the individual user ratings.
- numVotes - number of votes the title has received.
IMDb dataset license details
Subsets of IMDb data are available for access to customers for personal and non-commercial use. You can hold local copies of this data, and it is subject to our terms and conditions. Please refer to the Non-Commercial Licensing and copyright/license and verify compliance.
A few WARNINGs about the IMDb dataset
The IMDb data base contains details of adult titles. If this is likely to offend you or others that see your work, then you can filter these titles out by using the field isAdult in title.basics.tsv.gz.
The IMDb dataset is a noisy dataset. In particular, there are some issues with missing data which affected how we added constraints to the database. These are discussed in the note IMDb_data_issues.md in the GitHub repository. Of course, missing data could be obtained by scraping the IMDb site but this was beyond the scope of the project.
Conclusion
In this first post we have introduced our end-to-end database project and the IMDb dataset. In our next post we will look at a database design to store this data and we will perform the necessary ETL tasks using python.
Further Reading
- IMDb dataset can be found here.