Data Science for the Silver Screen!

Flatiron Data Science - Module 1 Project

Welcome! My name is Will Dougherty, and I’m an Online, Self-Paced student in the Flatiron School’s Data Science program. As a relative newcomer to both the data and the science parts of Data Science, it’s been challenging but exciting working through the first module of the course.

Thanks to Flatiron, the course and instructors, I feel great about what I’ve been able to accomplish and learn. I’m thrilled to bring everything together for this first project.

The Project

Imagine, if you will, that I’ve somehow been hired as a data analyst for Microsoft’s brand-new film production division. Never mind how that happened, they just need to know what kinds of films they should produce — and I’m their go-to guy!

Using a few different data sets, I’m going to look at some basic characteristics of films and what the expected profitability would be.

In preparing for this project, I thought a lot about what to focus on, and what a large company like Microsoft would need to know before embarking on the lucrative but expensive undertaking of making films.

For example, if you’re just concerned with profit, then things like review scores are pretty meaningless —it’s easy to think of examples of acclaimed films that no one saw, or terrible films that made loads of money.

With that in mind, I decided to focus on two parameters: production budget, and genre. These are the most easily controlled and prioritized in the film production process, and one would assume they are among the first considerations when a production company is forming their business plan.


Luckily, I was provided with some large and very helpful datasets, of which I decided to mainly use: and I chose them because of the large, clear set of budget and revenue data from The-Numbers, and genre information from TheMovieDB.

  • From here on I’ll use TN and TMDB to refer to these sets.

As for determining ‘box office success’, I focused on a few different measures.

  • Return on Investment (ROI) — this is computed by dividing the revenue by the budget.
  • For example, if a $5 million film earns $10 million, then (10 million / 5 million) = 2.0 ROI.
  • A 1.0 score means it breaks even, and anything below 1.0 is a loss.
  • Net Revenue — this is simply the gross revenue minus the production budget. So a $10 million film that makes $12 million would have a net revenue of $2 million.
  • Worldwide vs. Domestic — I also looked at whether we should prioritize worldwide or domestic returns, as this presumably has a big impact on how the film is produced, localized, marketed, and distributed — as well as potential profitability.

I broke up my analysis into two parts, focusing on budget and genre. I used standard Python libraries (pandas, matplotlib, seaborn) to organize and visualize the data.

Photo by Jp Valery on Unsplash

Production Budget

Looking at the budget information itself from the TN dataset, I found that films fell into four clear categories, based on simple quartile distribution; that is, roughly 25% of films fall into each these categories.

  • $ 0–5 Million
  • $ 5–20 Million
  • $ 20–50 Million
  • $ 50 Million+

Plotting the distribution of their respective ROI’s (in worldwide and domestic contexts), we see how they stack up:

Using a lovely Seaborn boxplot, we can clearly see the distribution. The colored boxes represent the middle 50% of each category, and the central line is the median. Note also the large number of outliers (the black diamonds above the boxes) — many were far beyond the graphs’ upper bounds, so a median/distribution-focused visualization seems appropriate here.

The purple line marks 1.0 ROI, or the ‘break-even’ point. Films above this line make more than they cost, and films below lose money.


  • We see pretty clearly that in domestic ROI, all categories do poorly, with over half of the films failing to break even.
  • Worldwide, however, the top three categories do much better, with the vast majority of the 50+ million category making a profit.

That’s all well and good, but what about trends over time? And what about net revenue?

I also decided to plot these categories over the period 2000–2020, and focus on Worldwide ROI/net.

Here we see the mean ROI, with large outliers removed as before. The shaded areas are the 95% confidence interval.

And here is the mean net revenue over the same period:


  • In the first graph, all the categories are somewhat in the ballpark of each other — except the lowest (0–5 million) category, which is often averaging below break-even.
  • Looking at net revenue however, we see that the 50+ million is vastly superior, and is on a largely upward trajectory (before 2020 that is; Covid has obviously impacted the industry in a profound way).

So based on this, we can recommend that Microsoft focus on maximizing worldwide profitability (presumably through marketing and distribution), and that $5 million seems to be the threshold above which films are largely successful; and if Microsoft has the funds, they should invest as much as possible, as the $50+ million category has the highest ROI and net revenues.

Photo by Arlinda on Unsplash


To look at genre, I combined the budget/revenue data from above (from TN) with data from TMDB.

Each film is assigned a list of genre ID numbers, that looks something like this:

[35, 18, 10749]

Since this isn’t very useful on its own, I did a simple request to the TMDB API, and turned that response into a nice Python dictionary of ID’s and genre names, all as strings:

import requestsresponse = requests.get(f"{api_key}&language=en-US")genre_ids = {}
for entry in response.json()['genres']:
genre_ids[str(entry['id'])] = str(entry['name'])
# Result:
{'28': 'Action',
'12': 'Adventure',
'16': 'Animation',
'35': 'Comedy',
'80': 'Crime',
'99': 'Documentary',
'18': 'Drama',
'10751': 'Family',
'14': 'Fantasy',
'36': 'History',
'27': 'Horror',
'10402': 'Music',
'9648': 'Mystery',
'10749': 'Romance',
'878': 'Science Fiction',
'10770': 'TV Movie',
'53': 'Thriller',
'10752': 'War',
'37': 'Western'}

That’s better!

Now, after merging the datasets based on title and release date (to avoid duplicate matches for movie names that are used for multiple entries) we can look at the profitability of films, categorized by genre — and note that films often have multiple genres (‘action/adventure’, ‘animation/fantasy’, or the very rare ‘horror/family’) .

So looking at the median ROI (worldwide) of each genre (of films made since 2000), we see this:

How lovely! We can see that almost all of the genres are at least mostly profitable (sorry, ‘Western’) but there are some that are much more so.

Median ROI for ‘Animation’, ‘Horror’, ‘Mystery’, ‘Family’, ‘Fantasy’, and ‘Adventure’ are all above 2.5, which is quite good. Any of these genres would be a safe pick to focus on for Microsoft.

Finally, let’s look at these top genres, and see if there’s anything we can learn from their ROI’s over time.

After first looking at their median ROI’s each year 2000–2020, there’s a bit of a snag:

Table: median ROI by genre, each year 2000–2020, with lots of NaN (missing data)

We can see here that there are a lot of ‘NaN’ entries until 2010, and then again in 2019. This is due to a lack of good data that was able to be merged from the datasets during those years.

So, since the idea of this project is to find recent box office data, the time period 2010–2018 is still substantial and recent enough for my purposes.

So after narrowing our focus and plotting the data, we see this:

This might be one of those line plots that only a mother could love, but there’s still a few things we can learn from it.

  • Since there aren’t any clear upward or downward trends among the genres, we can really only see the volatility of their respective ROI’s.
  • Horror and Mystery seem to be the most risky, as they have high peaks and low valleys, with Horror dipping below break-even in 2015.

Overall Conclusions

As a first real foray into some relatively simple data analysis, I’m excited to be able to see some real results, and give some general recommendations to Microsoft’s (imaginary) film production division.

  • It’s very clear that worldwide profitability should be prioritized, as most domestic ROI’s are below 1.0 (break-even).
  • A threshold of $5 million dollars is the point at which worldwide ROI is above the break-even point; and the more money invested, the more of a return is generated.
  • We saw that the highest category ($50 million plus) is a clear winner in terms of net revenue over the period 2000–2020, and it is the only with a clear upward trend.
  • The lowest category, $0–5 Million, is the clear loser in both measures — or at least, it has low risk and low reward.
  • Almost all genres have at least some profitability potential.
  • Westerns are the clear losers here, and History, War, and Music films aren’t much better.
  • When looking at the top genres, Animation, Horror, Mystery, etc. are most profitable, but some (Horror, Mystery) can be more volatile over time.
Photo by Javier Allegue Barros on Unsplash

Future Work

In doing this project, of course I had to settle on a few key questions, and focus on doing a reasonable amount of work in an appropriate timeframe.

As is often the case, lots of ideas came during the work itself, and especially after completing the project. If I had more time (and data), I’d like to look at the following.

  • Because of profitability of Worldwide revenue, investigate reasons for success vs. failure in global marketing and distribution
  • Investigate intersection of budget and genre
  • Investigate data on streaming platforms, because they exist outside of traditional ‘box office’ profit framework
  • Incorporate data from 2020/2021 and impact of Covid on industry
  • Investigate data on impact of films’ principals — Directors, Actors, Producers

Thank you for taking the time to check out my work. I hope this article has shown my progress in the Flatiron Data Science program, and has been enlightening to read. I had a lot of fun working through to this point, and this project has really solidified my understanding of the materials thus far.

Thanks for reading!

Data Science student / pianist / keyboardist