A Data Science Central Community
Oscars ceremony last weekend was a blast. My friends had their theories as to what movie would win the Best Picture Award; I secretly prayed for "The Imitation Game" to make it. Alas, I was wrong and I figured if I want to make better guesses in the future, I should seriously learn a little more about movie industry. Especially given how much I like movies (and who doesn't?).
So how do I learn more about movie industry and figure out what it takes to get nominated for an Oscar ?
I have to admit that I know nothing about movie industry, so after spending a few hours on google search I came across a pretty cool movie website (http://www.the-numbers.com/market/) where they publish basic information about movies made in the last 20 years. I managed to obtain information on 11,330 movies produced between 1995 - 2015.
This dataset (lets call it Movie Dossier) consisted of the following fields:
Along with the Movie Dossier, I also found a separate database on Box Office Mojo (http://www.boxofficemojo.com/oscar/) listing movies nominated for the Oscar Best Picture Award. I joined Movie Dossier and Mojo Oscar tables (on Movie Name) and voila - I knew if a movie was nominated or not.
When I looked at the Movie Dossier dataset, I didn't know where to start. Does movie production change over time? If so, how? Does movie genre have any significance? How come some movies are so much more popular than others? And above all, how can this data help me understand what helped 8 movies beat other 660 competitors and get nominated for Oscar's Best Picture Award in 2015?
So here goes...
My Lesson #1
When you enter an uncharted territory and you lack domain knowledge in the subject you are about to analyze, take a pause...find data you think is relevant and play with it. Very similar to how you usually play with a new gadget when you are too lazy to read the manual. This will help you learn more about the subject and generate hypotheses you are looking for.
I took my own lesson and started looking at basic metrics like:
While doing this simple analysis, I noticed that movie production has been following overall US economy market trends with a little lag. Here is a visualization I put together.
First off, why do Drama movies bring less money than Comedy films?! I guess people prefer to be more funny than serious... But look, film producers don't seem to agree since they keep making more drama than comedy (1,960 comedy films vs. 3,541 drama movies have been produced since 1995, but comedies earned 31% more money than dramas). Adventure movies turned out to be the most efficient ones -- can you believe that 619 movies made almost $39B (i.e. $61M/movie)? Well, I guess they are the most expensive ones too.
Speaking of production budgets, Avatar proved to be a revenue champion in action genre with $760M in gross earnings and $425M spent on production (who said 79% is a bad ROI?).
Curiously, Action, Thriller/Suspense and Adventure top movies earned 2 times more money than Drama, Comedy/Romantic Comedy and Horror favorites. And to my biggest dissapointment, Justin Bieber's concert show ranked #1 in Concert/Performance genre. But lets keep moving...
The bottom chart displays movie production volume change since 1995. Bar charts represent the # of movies released in that year and the trend line shows average revenue per ticket sold. As an add-on, if you hover over any bar you will see:
It was a big revelation for me that although movie production consistently followed US economic trend with a little lag (US market activity dropped in 2008-2009, whereas movie industry showed decline in 2010), in 2010 when movie production went down by 40%, on average film companies made a lot of money per film. Average revenue per movie was $25M which is the highest average revenue seen in 20 years. But when I looked at how much money each US citizen spent on movies that year, the picture cleared up a little. Turns out, people were paying $8.3 for a ticket compared to $6/ticket historical average.
So in a matter of few hours I saw that
I think I kicked my data around enough to generate initial hypotheses to answer my main question.
In my next post I will conduct a confirmatory analysis where I will test how well each factor can predict the likelihood of a movie to be nominated for Best Picture Award.
What do you think about the data? Could I use other sources to dive deeper into existing datasets?
Did exploratory analysis make sense to you? How else could I have explored the data to better link to Oscar nominations topic?
Have you conducted a similar exploratory analysis before? How did you approach the problem?