Starting from the Google Play Store dataset,

All groups and individual must do the following:

  1. Convert the app sizes to a number
  2. Convert the number of installs to a number
  3. Transform “Varies with device” into a missing value
  4. Convert Current Ver and Android Ver into a dotted number (e.g. 4.0.3 or 4.2)
  5. Remove the duplicates
  6. For each category, compute the number of apps
  7. For each category, compute the average rating
  8. Create two dataframes: one for the genres and one bridging apps and genres. So that, for instance, the app Pixel Draw - Number Art Coloring Book appears twice in the bridging table, once for Art & Design, once for Creativity
  9. For each genre, create a new column of the original dataframe. The new columns must have boolean values (True if the app has a given genre)
  10. For each genre, compute the average rating. What is the genre with highest average?
  11. For each app, compute the approximate income, obtain as a product of number of installs and price.
  12. For each app, compute its minimum and maximum Sentiment_polarity

The following part of the exercise must be done only by groups with two or three people

  1. For each app, compute the average number of words in its reviews
  2. For each app, compute its longest review
  3. For each app, compute the ratio between the number of installs and the number of reviews
  4. Cluster the apps according to the major android version (the first two digits — e.g. for 4.0.3 the major version is 4.0)
  5. For each cluster, compute the average date and the last date of an update.
  6. Excluding the free apps, what is the content rating with highest average price?

The following part of the exercise must be done only by groups with three people

  1. What is the genre with the highest total income?
  2. What is the genre with the highest fraction of free apps (over the number of all apps)?
  3. For each rating, compute the average income
  4. For each (Content Rating, Genre) pair, compute the number of reviews and the average rating.

Notes

  1. It is mandatory to use GitHub for developing the project.
  2. The project must be a jupyter notebook.
  3. There is no restriction on the libraries that can be used, nor on the Python version.
  4. Post any question on the Discussions forum