In our third post in this series we will finally build the MySQL database. MySQL is an open-source relational database management system (DBMS) and it is one of the most used DBMS in the world. We will begin by creating the database, then we will load the transformed IMDb data into it, add key constraints, and then finally add some indexing to the tables. This will all be done using SQL scripts. The SQL commands to build the database described in the previous post (illustrated by the ER and logical schema diagrams) is contained in the following 4 SQL scripts:
- imdb-create-tables.sql
- imdb-load-data.sql
- imdb-add-constraints.sql
- imdb-index-tables.sql
Steps to Build the MySQL IMDb Database
1. Open MySQL in a terminal:
$ mysql -u root -p --local-infile
Enter your password.
2. Create IMDb database in MySQL:
The code to create the database is contained in imdb-create-tables.sql and is shown below
-- Delete IMDb database if necessary
DROP DATABASE IF EXISTS IMDb;
-- Create IMDb database
CREATE DATABASE IMDb;
-- Use IMDb database
USE IMDb;
-- Character set
-- want to be able to distinguish text with accents
ALTER DATABASE IMDb CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;
-- Create tables only
CREATE TABLE Titles (
title_id VARCHAR(255) NOT NULL, -- not null bc PK
title_type VARCHAR(50),
primary_title TEXT, -- some are really long
original_title TEXT, -- some are really long
is_adult BOOLEAN,
start_year INTEGER,
end_year INTEGER,
runtime_minutes INTEGER
);
CREATE TABLE Title_ratings (
title_id VARCHAR(255) NOT NULL, -- not null bc PK
average_rating FLOAT,
num_votes INTEGER
);
CREATE TABLE Aliases (
title_id VARCHAR(255) NOT NULL, -- not null bc PK
ordering INTEGER NOT NULL, -- not null bc PK
title TEXT NOT NULL,
region CHAR(4),
language CHAR(4),
is_original_title BOOLEAN
);
CREATE TABLE Alias_types (
title_id VARCHAR(255) NOT NULL, -- not null bc PK
ordering INTEGER NOT NULL, -- not null bc PK
type VARCHAR(255) NOT NULL-- Only stored if not null
);
CREATE TABLE ALias_attributes (
title_id VARCHAR(255) NOT NULL, -- not null bc PK
ordering INTEGER NOT NULL, -- not null bc PK
attribute VARCHAR(255) NOT NULL -- only stored if not null
);
CREATE TABLE Episode_belongs_to (
episode_title_id VARCHAR(255) NOT NULL, -- not null bc PK
parent_tv_show_title_id VARCHAR(255) NOT NULL,
season_number INTEGER,
episode_number INTEGER
);
CREATE TABLE Title_genres (
title_id VARCHAR(255) NOT NULL, -- not null bc PK
genre VARCHAR(255) NOT NULL -- not null bc PK
);
-- Names and name is a reserved word in MySQL, so we add an underscore
CREATE TABLE Names_ (
name_id VARCHAR(255) NOT NULL, -- not null bc PK
name_ VARCHAR(255) NOT NULL, -- everybody has a name
birth_year SMALLINT,
death_year SMALLINT
);
CREATE TABLE Name_worked_as (
name_id VARCHAR(255) NOT NULL, -- not null bc PK
profession VARCHAR(255) NOT NULL -- not null bc PK
);
-- NOTE: All 3 must must be used as the primary key
-- role is a reserved word in MySQL, so we add an underscore
CREATE TABLE Had_role (
title_id VARCHAR(255) NOT NULL, -- not null bc PK
name_id VARCHAR(255) NOT NULL, -- not null bc PK
role_ TEXT NOT NULL -- not null bc PK
);
CREATE TABLE Known_for (
name_id VARCHAR(255) NOT NULL, -- not null bc PK
title_id VARCHAR(255) NOT NULL -- not null bc PK
);
CREATE TABLE Directors (
title_id VARCHAR(255) NOT NULL, -- not null bc PK
name_id VARCHAR(255) NOT NULL -- not null bc PK
);
CREATE TABLE Writers (
title_id VARCHAR(255) NOT NULL, -- not null bc PK
name_id VARCHAR(255) NOT NULL -- not null bc PK
);
CREATE TABLE Principals (
title_id VARCHAR(255) NOT NULL, -- not null bc PK
ordering TINYINT NOT NULL, -- not null bc PK
name_id VARCHAR(255) NOT NULL,
job_category VARCHAR(255),
job TEXT
);
To run this code in the MySQL terminal type the following, but of course you should change the path to where you have placed the files.
mysql> SOURCE /Users/lappy/Git_repos_mine/MySQL_IMDb_Project/imdb-create-tables.sql
As one can see the tables a created very quickly.
3. Load the data into the MySQL database
The code to load the transformed data into the database is contained in imdb-load-data.sql and is shown below. Of course you should change the paths to reflect where you have placed the files.
-- SHOW VARIABLES LIKE "local_infile";
SET GLOBAL local_infile = 1;
-- Load Aliases.tsv into Aliases table
LOAD DATA LOCAL INFILE '/Users/lappy/Git_repos_mine/MySQL_IMDb_Project/Aliases.tsv'
INTO TABLE Aliases
COLUMNS TERMINATED BY '\t'
IGNORE 1 LINES;
-- Load Alias_attributes.tsv into Alias_attributes table
LOAD DATA LOCAL INFILE '/Users/lappy/Git_repos_mine/MySQL_IMDb_Project/Alias_attributes.tsv'
INTO TABLE Alias_attributes
COLUMNS TERMINATED BY '\t'
IGNORE 1 LINES;
-- Load Alias_types.tsv into Alias_types table
LOAD DATA LOCAL INFILE '/Users/lappy/Git_repos_mine/MySQL_IMDb_Project/Alias_types.tsv'
INTO TABLE Alias_types
COLUMNS TERMINATED BY '\t'
IGNORE 1 LINES;
-- Load Directors.tsv into Directors table
LOAD DATA LOCAL INFILE '/Users/lappy/Git_repos_mine/MySQL_IMDb_Project/Directors.tsv'
INTO TABLE Directors
COLUMNS TERMINATED BY '\t'
IGNORE 1 LINES;
-- Load Writers.tsv into Writers table
LOAD DATA LOCAL INFILE '/Users/lappy/Git_repos_mine/MySQL_IMDb_Project/Writers.tsv'
INTO TABLE Writers
COLUMNS TERMINATED BY '\t'
IGNORE 1 LINES;
-- Load Episode_belongs_to.tsv into Episode_belongs_to table
LOAD DATA LOCAL INFILE '/Users/lappy/Git_repos_mine/MySQL_IMDb_Project/Episode_belongs_to.tsv'
INTO TABLE Episode_belongs_to
COLUMNS TERMINATED BY '\t'
IGNORE 1 LINES;
-- Load Names_.tsv into Names_ table
LOAD DATA LOCAL INFILE '/Users/lappy/Git_repos_mine/MySQL_IMDb_Project/Names_.tsv'
INTO TABLE Names_
COLUMNS TERMINATED BY '\t'
IGNORE 1 LINES;
-- Load Name_worked_as.tsv into Name_worked_as table
LOAD DATA LOCAL INFILE '/Users/lappy/Git_repos_mine/MySQL_IMDb_Project/Name_worked_as.tsv'
INTO TABLE Name_worked_as
COLUMNS TERMINATED BY '\t'
IGNORE 1 LINES;
-- Load Known_for.tsv into Known_for table
LOAD DATA LOCAL INFILE '/Users/lappy/Git_repos_mine/MySQL_IMDb_Project/Known_for.tsv'
INTO TABLE Known_for
COLUMNS TERMINATED BY '\t'
IGNORE 1 LINES;
-- Load Principals.tsv into Principals table
LOAD DATA LOCAL INFILE '/Users/lappy/Git_repos_mine/MySQL_IMDb_Project/Principals.tsv'
INTO TABLE Principals
COLUMNS TERMINATED BY '\t'
IGNORE 1 LINES;
-- Load Had_role.tsv into Had_role table
LOAD DATA LOCAL INFILE '/Users/lappy/Git_repos_mine/MySQL_IMDb_Project/Had_role.tsv'
INTO TABLE Had_role
COLUMNS TERMINATED BY '\t'
IGNORE 1 LINES;
-- Load Titles.tsv into Titles table
LOAD DATA LOCAL INFILE '/Users/lappy/Git_repos_mine/MySQL_IMDb_Project/Titles.tsv'
INTO TABLE Titles
COLUMNS TERMINATED BY '\t'
IGNORE 1 LINES;
-- Load Title_genres.tsv into Title_genres table
LOAD DATA LOCAL INFILE '/Users/lappy/Git_repos_mine/MySQL_IMDb_Project/Title_genres.tsv'
INTO TABLE Title_genres
COLUMNS TERMINATED BY '\t'
IGNORE 1 LINES;
-- Load Title_ratings.tsv into Title_ratings table
LOAD DATA LOCAL INFILE '/Users/lappy/Git_repos_mine/MySQL_IMDb_Project/Title_ratings.tsv'
INTO TABLE Title_ratings
COLUMNS TERMINATED BY '\t'
IGNORE 1 LINES;
To run this code in the MySQL terminal type the following, but of course you should change the path to where you have placed the files.
mysql> SOURCE /Users/lappy/Git_repos_mine/MySQL_IMDb_Project/imdb-load-data.sql
4. Add constraints to the IMDb database
The code to add the primary key and foreign key constraints to the database is contained in imdb-add-constraints.sql and is shown below.
-- Add constraints individually
ALTER TABLE Names_
ADD CONSTRAINT Names_pri_key PRIMARY KEY (name_id);
ALTER TABLE Titles
ADD CONSTRAINT Titles_pri_key PRIMARY KEY (title_id);
ALTER TABLE Aliases
ADD CONSTRAINT Aliases_pri_key PRIMARY KEY (title_id,ordering);
ALTER TABLE Alias_attributes
ADD CONSTRAINT Alias_attributes_pri_key PRIMARY KEY (title_id,ordering);
ALTER TABLE Alias_types
ADD CONSTRAINT Alias_types_pri_key PRIMARY KEY (title_id,ordering);
ALTER TABLE Directors
ADD CONSTRAINT Directors_pri_key PRIMARY KEY (title_id,name_id);
ALTER TABLE Directors
ADD CONSTRAINT Directors_title_id_fkey FOREIGN KEY (title_id) REFERENCES Titles(title_id);
ALTER TABLE Writers
ADD CONSTRAINT Writers_pri_key PRIMARY KEY (title_id,name_id);
ALTER TABLE Writers
ADD CONSTRAINT Writers_title_id_fkey FOREIGN KEY (title_id) REFERENCES Titles(title_id);
ALTER TABLE Episode_belongs_to
ADD CONSTRAINT Episode_belongs_to_pri_key PRIMARY KEY (episode_title_id);
ALTER TABLE Episode_belongs_to
ADD CONSTRAINT Episode_belongs_to_ep_title_id_fkey FOREIGN KEY (episode_title_id) REFERENCES Titles(title_id);
ALTER TABLE Name_worked_as
ADD CONSTRAINT Name_worked_as_pri_key PRIMARY KEY (name_id,profession);
ALTER TABLE Name_worked_as
ADD CONSTRAINT Name_worked_as_name_id_fkey FOREIGN KEY (name_id) REFERENCES Names_(name_id);
ALTER TABLE Known_for
ADD CONSTRAINT Known_for_pri_key PRIMARY KEY (name_id,title_id);
ALTER TABLE Known_for
ADD CONSTRAINT Known_for_name_id_fkey FOREIGN KEY (name_id) REFERENCES Names_(name_id);
ALTER TABLE Principals
ADD CONSTRAINT Principals_pri_key PRIMARY KEY (title_id,ordering);
-- role_ is TEXT, so we need to add indexing length (255)
ALTER TABLE Had_role
ADD CONSTRAINT Had_role_pri_key PRIMARY KEY (title_id,name_id,role_(255));
ALTER TABLE Title_genres
ADD CONSTRAINT Title_genres_pri_key PRIMARY KEY (title_id,genre);
ALTER TABLE Title_genres
ADD CONSTRAINT Title_genres_title_id_fkey FOREIGN KEY (title_id) REFERENCES Titles(title_id);
ALTER TABLE Title_ratings
ADD CONSTRAINT Title_ratings_pri_key PRIMARY KEY (title_id);
ALTER TABLE Title_ratings
ADD CONSTRAINT Title_ratings_title_id_fkey FOREIGN KEY (title_id) REFERENCES Titles(title_id);
-- Issues with missing data in title.basics.tsv.gz, name.basics.tsv.gz, ...
-- Disable foreign key check lock
SET foreign_key_checks = 0;
-- Aliases has titles that do not exist in Titles, i.e., there are entries in
-- IMDb's title.akas.tsv.gz that are not present in title.basics.tsv.gz. The same
-- issue arises when setting the foreign key for the Alias_attributes and
-- Alias_types tables.
ALTER TABLE Aliases
ADD CONSTRAINT Aliases_title_id_fkey FOREIGN KEY (title_id) REFERENCES Titles(title_id);
ALTER TABLE Alias_attributes
ADD CONSTRAINT Alias_attributes_title_id_fkey FOREIGN KEY (title_id) REFERENCES Titles(title_id);
ALTER TABLE Alias_types
ADD CONSTRAINT Alias_types_title_id_fkey FOREIGN KEY (title_id) REFERENCES Titles(title_id);
-- Ditto for Episode_belongs_to table.
ALTER TABLE Episode_belongs_to
ADD CONSTRAINT Episode_belongs_to_show_title_id_fkey FOREIGN KEY (parent_tv_show_title_id) REFERENCES Titles(title_id);
ALTER TABLE Known_for
ADD CONSTRAINT Known_for_title_id_fkey FOREIGN KEY (title_id) REFERENCES Titles(title_id);
ALTER TABLE Principals
ADD CONSTRAINT Principals_name_id_fkey FOREIGN KEY (name_id) REFERENCES Names_(name_id);
ALTER TABLE Principals
ADD CONSTRAINT Principals_title_id_fkey FOREIGN KEY (title_id) REFERENCES Titles(title_id);
ALTER TABLE Had_role
ADD CONSTRAINT Had_role_title_id_fkey FOREIGN KEY (title_id) REFERENCES Titles(title_id);
ALTER TABLE Had_role
ADD CONSTRAINT Had_role_name_id_fkey FOREIGN KEY (name_id) REFERENCES Names_(name_id);
ALTER TABLE Directors
ADD CONSTRAINT Directors_name_id_fkey FOREIGN KEY (name_id) REFERENCES Names_(name_id);
ALTER TABLE Writers
ADD CONSTRAINT Writers_name_id_fkey FOREIGN KEY (name_id) REFERENCES Names_(name_id);
-- Enable foreign key check lock
SET foreign_key_checks = 1;
It is important to note that Aliases has titles that do not exist in Titles, i.e., there are entries in IMDb’s title.akas.tsv.gz that are not present in title.basics.tsv.gz. This leads to issues with the constraints. The same issue arises for the Alias_attributes and Alias_types tables. This is why we have disabled and then re-enabled the foreign key checks.
To run this code in the MySQL terminal type the following, but of course you should change the path to where you have placed the files.
mysql> SOURCE /Users/lappy/Git_repos_mine/MySQL_IMDb_Project/imdb-add-constraints.sql
5. Add indexes to the IMDb database
The code to add indexes to the tables in the database is contained in imdb-index-tables.sql and is shown below.
-- Alias_attributes
CREATE INDEX Alias_attributes_index ON Alias_attributes(title_id);
-- Alias_types
CREATE INDEX Alias_types_index ON Alias_types(title_id);
-- Aliases
CREATE INDEX Aliases_index ON Aliases(title_id);
-- Directors
CREATE INDEX Directors_title_id_index ON Directors(title_id);
CREATE INDEX Directors_name_id_index ON Directors(name_id);
-- Episode_belongs_to
CREATE INDEX Episode_belongs_to_ep_title_id_index ON Episode_belongs_to(episode_title_id);
CREATE INDEX Episode_belongs_to_show_title_id_index ON Episode_belongs_to(parent_tv_show_title_id);
-- Had_role
CREATE INDEX Had_role_title_id_index ON Had_role(title_id);
CREATE INDEX Had_role_name_id_index ON Had_role(name_id);
-- Known_for
CREATE INDEX Known_for_index ON Known_for(name_id);
-- Name_worked_as
CREATE INDEX Name_worked_as_index ON Name_worked_as(profession);
-- Names_
CREATE INDEX Names_index ON Names_(name_id);
-- Principals
CREATE INDEX Principals_index ON Principals(title_id);
-- Title_genres
CREATE INDEX Title_genres_title_id_index ON Title_genres(title_id);
CREATE INDEX Title_genres_genre_index ON Title_genres(genre);
-- Title_ratings
CREATE INDEX Title_ratings_index ON Title_ratings(title_id);
-- Titles
CREATE INDEX Titles_index ON Titles(title_id);
-- Writers
CREATE INDEX Writers_title_id_index ON Writers(title_id);
CREATE INDEX Writers_name_id_index ON Writers(name_id);
To run this code in the MySQL terminal type the following, but of course you should change the path to where you have placed the files.
mysql> SOURCE /Users/lappy/Git_repos_mine/MySQL_IMDb_Project/imdb-index-tables.sql
Conclusion
In our third post we finally built and indexed the MySQL database to store the IMDb dataset. In our next post we will look at querying the database and also make a few visualisations the data. For further details on database concepts and SQL we refer the interested reader to references below.
Further Reading
-
R. Ramakrishnan and J. Gehrke, Database management systems, 3rd edition, Mc Graw-Hill. Companion site
-
Yu-San Lin, CMPSC431W: Database Management Systems fall 2015 video lectures, Penn State. Course site