One of the industries that can influence public is the movie industry, which was evolved from plays to cinemas. A good film always creates a positive impact on audience, if it is a bad film people erase the memory quickly, often filmmakers and producers thrive to present a best film to the audience. But as the time changes audience perception too changes, which will definitely impact the success of the film. The pandemic has opened many doors for upcoming filmmakers to test their luck through online platforms like Netflix and Amazon prime etc and at the same time this made audience to watch movies in their own comfort and look through many new genres of films globally, which could be a challenging factor for the traditional filmmakers who now need to consider all these elements to make a film.
It would be hard task if someone is asked to predict whether the upcoming movie will be successful or not, because it completely depends upon a persons perception who watch the film. But if we try to analyse the trends in the movie industry and find out under what circumstances these films with these features did well, we can try to predict the success of the movie.
Today, Based on this assumption let’s find out the key factors that effect the movie success using Visual analytics and classify thee success using Decision Tree.
I took the dataset from Kaggle, initially the dataset consists of 7668 rows and 15 columns, we have first worked on the outlier detection, we used empirical rule(which says 99.7% of the data is within 3 standard deviations of the mean) to detect and clean the outliers. After clearing the outliers we have 7245 rows and 15 columns in out dataset.
Now we focused on filling out the null values, I calculated the percentage of the missing data and plotted the graph above. I have analysed each column which has missing data, for instance in star column there are few missing values, but we cannot predict the star, so we dropped that row, like wise we analysed every column and did what needed.For filling the budget column it needed some attention and creativity as this is one among the key features for our study, I cannot just drop the columns or fill the null values with mean, I needed to think for an better approach to fill these values.The approach I followed is based on assumptions, as we actually don’t know the real budgets prior, we used the help of the gross column to predict the budgets.
The plot above shows the relation between budget and gross, as the budget of a movie increases the gross also is increasing, I initially filled the null values in budget columns with 0’s and then calculated the profits and losses bases on gross column. I analysed the budgets and gross for each genre took a threshold value and depending on that we filled in the budget missing values for each genre. After filling all the null values and making sure that there are no more null values left, we performed outlier detection again, as we added some assumptive values which could again create some outliers. After all the preprocessing we have ended up with 6238 rows and 15 columns.
Findings
After preparing the data, I now focused on what are the features that a re responsible for supporting our study. Out of the 15 features we are considering 8 features which have a greater impact on success of the film. Then, I needed to know which movie have brought profits and which brought losses to the produces, for that I added two more columns for his purpose, one is a numerical column and other is a categorical column. These two extra columns will be supporting our work. I derived these features by measuring the different between the gross and the budgets, if the value is positive, then the movie have seen profits and if it is negative, then it brought losses. The profit or loss categorical column will be one of the target column for the classification model that we are going to build.
From the above two figure we can see that, the most produced genre is ‘Adventure’ and this is almost constantly sharing the same percentages every year from 1980–2020 these genre movies share a denser area.
It is a hard task to perfectly tell that runtime effect the movie, because there are many movies which performed great with higher runtime, but it depends on how audience percept the movie. The above figure shows how the runtime has been changing since 1980 to 2020 and how the profits are effected by this. We can see that movies with little higher runtime have seen profits, except from year 2002 to 2006 and 2016 to 2018, where higher runtime movies failed at box office. But from 1988- 1997 movies with more than 105 mins runtime brought profits. We can say that maximum movies try to fit their movie runtime between 100 to 110. As these film dataset deals with maximum Hollywood movies we can see this trends, If it is an indian movie data set then the average runtime would be around 120 mins.
For the question how to audience respond to lengthier movies cn be answered from the above figure, we can say that animation films have seen more profits than other genres. Adventure movies have seen moderate profits, but this is the highest produced movie genre, and animation and adventure movies share almost the same budget for the average runtime of 105–110 mins. But why animation films have higher success ratio than other genre ? If that is the case why not producers and filmmakers are not interested in making more animated films ? To answer this question we need to understand the making cost and time for these films. Animated films take a longer time to make, as these include lot of VFX and those VFX should resemble the outside world, so there should be more care and tuning required for making these movies. That is one reason why animation films take a longer time to produce, Next would be adventure and action movies, which require high budgets and longer time to complete the films. Most of the animated films are U rated films, which attracts the family audience to the theatre, crime and adventure are mostly PG or R rated films, this is one more reason why these films have less profits.
Decision Tree.
I used decision tree for classification, as decision tree provides us automatic feature selection and we need not perform any standardisation which is an advantage with decision tree. I have chosen six columns where 5 are features and 1 is the target column. I used the additional column that is added as the target column, I have split the data into training and testing sets and going to train the model using the training set and running the trained model on unseen testing data I will be classifying the profits or losses of a movie. This classification model helps to know in future, if a movie is released with these features, what could be the result, whether it brings profits to the producer or losses, apart from that this will be an add-on to the traditional filmmakers and also the upcoming filmmakers to know what are the profitable features that they need to consider while making a film apart from the story and cast.
The above Confusion matrix shows the TP, TN,FP and FN. The model has performed well, with an accuracy of 70% but, I didn’t tune any of the parameters, took a minimum leaf size between 7 and this performed well but it was hard to find out was was the best leaf size and maximum features as it is a big dataset, so performed the hyper parameter tuning. After the hyper parameter selection the number of True positives and True negatives have increased, which can infer that the model is doing fine with the classification.
After tuning the parameters choosing the maximum depth as 3 and minimum leaf size as 4 we achieved a bit higher accuracy of 75 % and AUC of 0.79 which indicates that it is good model and the classification is done accurately.
By using this model, one can select what should be the key features and in what limits they need to be, and by understanding these models one can experiment new approaches of making a film, and the result of can be used as an input for this model as this is a continuous learner.
Conclusion
From the study, we have found some key factors that impact the movie success, which shows our intention to know, how can a filmmaker can get on to the success path. Visual analytics was a great help in finding these solutions, we can say that these are not the only way to be a successful filmmaker, because filmmaking is all about taking new risks and trying different things, but this analysis is based on the risks that are been already taken by filmmakers in past so, this can be helpful for the current and upcoming producers and filmmakers. From all the derived observations we can say that the upcoming and traditional filmmakers can try both adventurous and animation films for better success, but the only disadvantage with it is that it needs little more budget compare to other films, of course action movies also needs heavy budgets.