Last update: 2018-06-03

The database schema used by the Java Movie Database


Introduction

The diagram rendered below shows how the tables are connected and also reveals how much data is actually stored in the tables. Originally this diagram is completely interactive and generated by the Java application SchemaSpy. Some of the connections shown in the diagram are implied as it was not possible to create so called foreign key constraints in the database.

The reason is that the values from one column in a child table (e.g. persons referenced by the imdbpersonid column in table imdb_tsv_title_crew) where not present in the parent table (imdbpersonid column in table imdb_tsv_name_basics). This actually indicates that there is a serious error in the database design used by the Internet Movie Database (IMDb) or they simply have an error in their data export code. The export code might have an exclude filter (for whatever reason) for persons that are written into the name.basics.tsv.gz file while the same exclude filter is not used when the title.crew.tsv.gz file is written.

Some general comments on NoSQL and Relational Databases (RDBMS)

I suspect that the IMDb changed their old (relational) database backend for new NoSQL backend over the last few years (my guess is from 2013 to 2017). People think it is smart to use NoSQL because it is more flexible as you don't need database experts designing your database model as this can be done by the software developers. Some think NoSQL is faster and easier to maintain. Some think we have BigData (a very large dataset - but what is very large?) so NoSQL must be used to be fast (MapReduce). Some may also think it is a smart choice because is was and still is hyped in the IT press.

Most of the assumptions are normally wrong, so 99.X percent still would be better of with a relational database system instead of a NoSQL solution. The normal assumption is that developers can change the structure faster with NoSQL because they change the model in the code. This is not wrong but it produces a lot of problems which would fall on your feet immediately if you would use a relational database model with very strict constraints. The constraints are important because they are your protection against bad data quality in your data storage. You can not put data into the database until it meets the constraints put in place. Most constraints would have to be implemented in software (it depends a little bit on which NoSQL solution you are using). Well most software developers are lazy (thinking things through and looking at all side plus documenting it), so it would take a huge effort to make sure that this is handled properly. So basically you only moved the problem from one person (DBA/database expert) can easily apply the toolset of the database to protect you to another person (the Software Developer) which as normally to code some stuff to handle it in software. You should know that the database expert is most likely doing a much better job preventing that crappy/bad data flows into your data storage.

Next thing is, once the data is in your NoSQL database (lets assume it is some document store for key=value where the value is some JSON string) you have some structural changes. This was the reason why you might have chosen the new NoSQL storage over the relational database world. In you relational database the database expert had to apply a new column with constraints to prevent incorrect data to be inserted and the software developer had to wait until this is completed, while in the NoSQL world the software developer could do it himself. Again the developer is responsible not only to add/remove the field, he also needs to make sure the structural changes are handled correctly (with if...else... lines in the code and other checks).
But what about the size of the data found in the IMDb dataset. Isn't that large meaning BigData? As for the IMDb dataset we know (even if there is more in the background) this is not even near to be called BigData. The title_principals file contains, when you unwrap the comma separated IDs to rows, only 26,561,000 rows (referring to the file from 2018-01-05). Basically this is a simple mapping file here which allows you to go from a person to one of more movies or the other way around for a movie to get the persons involved. The number of rows is not really an issue even if there should be a lot more in this case. Currently for each movie this is limited to a maximum of 10 entries, so we are missing a lot of data here.

Relational database systems have been updated in the last 40 years with a lot of features. PostgreSQL for example has a KEY=VALUE store called hstore since PostgreSQL v8.2 which has been released on 2006-12-05. Newer Versions since PostgreSQL v9.2 released on 2012-09-10 have a JSON data store (which is based on the old hstore). This JSON data store has been enhanced to fully use indexing with JSONB in PostgreSQL v9.4 on 2014-12-18. There is also XML and Array support, so basically you have the basic NoSQL functionally already at hand in your relational database including full text indexing. It's even possible to have multiple servers worldwide that share the same data using the integrated replication service (a master and multiple slaves). This will introduce some issues as it takes some time until all slaves are provided with the same data. Normally you can't wait until all slave server ping back to the master to tell him yes, we got all the data, now you can deliver it to the user. So you will use the fire and forget method and assume that all of your slaves will get the data from you. This can speed up the query side if you have a lot of queries or your users are all around the world, so you replicate the data to each continent.

I'll stop this here as you are most likely interested in the database model and not my personal thoughts about NoSQL and relational database systems. You might also disagree that I think PostgreSQL is the best database you can use. There are a few reasons why you still might want to go to IBM with DB2, Microsoft SQL Server or Oracle for let's say a all on one solution for your data warehouse with reporting, but you should choose wisely. For most features you have to pay a lot of money when you choose those well known companies while the features you require are free in PostgreSQL. There is a reason why for example the Elster system in Germany (electronic tax management system) migrated form Oracle to PostgreSQL in 2014/2015.

Details about the database schema used

The database schema shown is based upon the internal JMDB v1.40 pre4 build from early January 2018. The used IMDb LIST files are from 2017-12-22 and the TSV files are from 2018-01-05.

JMDB database schema with implied relationships (build using SchemaSpy)

Menu

We need help!

We're looking for volunteers creating new language files.

If you want to help, contact us!

» Contact...

Donate!

If you want to support the development you can » donate!

PayPal

Advertisement