Starting from the The Movie DataBase (TMDB) sample datasets,

All groups and individual must do the following:

  1. For each movie, compute the number of cast members
  2. How many movies do not have a homepage?
  3. For each year, how many movies do not have a homepage?
  4. Extract the domain of each homepage.
  5. Extract a set of normalized tables. That is, each entry of a normalized table must contain exactly one value (not a list or a dictionary). There is no need to use SQL for this point.
  6. For each movie, compute the gross margin (difference between revenue and budget)
  7. For each movie, compute the number of crew members
  8. For each movie, compute the number of directors
  9. For each language, compute the number of movies where such language is spoken.
  10. For each company and each decade, compute the overall revenue
  11. For each decade, compute the company with maximum revenue
  12. In each year, how many movies have revenue smaller than the budget?

The following part of the exercise must be done only by groups of two or three people

  1. Distribute the revenue according to the order of appearance in a movie. Assume that the i-th actor contributes twice as much as the (i+1)-th actor to the revenue.
  2. For each actor find the total revenue attributed to him/her.
  3. Find the actor that is responsible for the most overall revenue.

The following part of the exercise must be done only by groups of three people

  1. For each movie, compute the ratio between males and females in the cast
  2. For each movie, compute the ratio between the attributed revenue of males and females in the cast
  3. Find the director that has the highest average ratio computed in the previous point.

Notes

  1. It is better to use GitHub for developing the project. It is not mandatory, though.
  2. The project must be a jupyter notebook.
  3. There is no restriction on the libraries that can be used, nor on the Python version.