Predicting the success of Mobile Applications in the Google Play Store

This project was developed by Dominic Teo, Grace Prakaisriroj and Alessandro Luciano as our capstone group project for the undergraduate final year Statistics course ST309.

View the Project on GitHub

The more detailed report on the entire project can be found in the pdf document. Unfortunately, the Rmarkdown file has been lost.

Introduction

The global app economy is estimated to be worth 6.3 trillion USD by 2021, up from 1.3 trillion USD in 2018 (AppAnnie,2019). Over this period the user base will almost double from 3.4 billion people using apps to around 6.3 billion (AppAnnie, 2019). However, the majority of developers are still struggling to break even (AppSurvey, 2013). For those unsuccessful app developers, a clear analysis of the characteristics of existing successful apps would provide a useful insight into creating apps that users want (Tian,2015).

Therefore, the main goal of this project is to identify the characteristics that successful apps share and investigate which of these factors is important for success.

Dataset Variables

Our final cleaned-dataset used in modelling has 13 predictor-variables (Android.Ver has been removed through EDA) in 8 dimensions and 6738 obervations in total.

The 8 dimensions we used for our analysis are:

Success: For our outcome variable, we transformed the ‘installs’ feature into a numeric value and then defined a new variable, success, as those apps having more than 500,000 app downloads. Apps are typically profitable with 500,000 downloads (Louis,2013).58% of the observations were classified as successful.

success

Rating: The rating factor is the overall user rating of the app. Higher rated apps have been shown tohave more downloads (Lanza,2012). Apps are rated from 1 to 5.
Size: The “Size of App” factor captures various information on the app. Large apps might contain more features or better functionality. Thus, they might have better ratings. On the other hand, larger apps also imply a higher probability to contain a bug and therein might have lower ratings (Zimmermann,2007).
Category: For the category the app belongs to, we choose to recode the categories so that there were simply 4 categories: “Hobbies”, “Entertainment”, “Lifestyle”, and “Productivity”. We coded these 4 categories as 4 mutually exclusive and collectively exhaustive binary variables.

Methodology

Our project has two research questions:

What characteristics do successful apps share?
What factors are important for app success on the Google Play Store?

We applied 3 different classification modeling approaches to the data in order to find significant variables in determining the success of mobile apps.

Logistic Regression
Decision Trees and Pruned Decision Trees
Random Forest

Results and Conclusion

From our study and use of the 4 different models, we can conclude that the random forest model is most effective in predicting the success of mobile applications. Our random forest model demonstrates that Price, Rating, DaysSinceLastUpdated and Size are the most important variables used in the construction of the model.

This is important for developers who want to develop successful mobile applications. They should hence develop free applications with the aim of receiving high ratings in the Google Play Store. This could mean that developers trying to make money from their apps could stand to make their apps free but concentrate on in-app monetization opportunities instead. Making consumers pay for the app itself seems to be a major deterrent for users to download the app.

More interestingly, developers should also continuously and consistently update the app as we found that apps that were more recently updated tend to be more successful. We also found that overall, apps that were larger in size were more successful (although less significant than the other three variables). We think that size could be a good proxy for complexity and sophistication, hence apps that are more complex and developed tend to be more popular.

result