You have to work on the Kiva
dataset.
Some information on the datasets are on the Kaggle web page.
All groups and individual must do the following:
- Normalize the
loan_lenders
table. In the normalized table, each row must have one loan_id
and one lender
.
- For each loan, add a column
duration
corresponding to the number of days between the disburse time and the planned
expiration time. If any of those two dates is missing, also the duration
must be missing.
- Find the lenders that have funded at least twice.
- For each country, compute how many loans have involved that country as borrowers.
- For each country, compute the overall amount of money borrowed.
- Like the previous point, but expressed as a percentage of the overall amount lent.
- Like the three previous points, but split for each year (with respect to
disburse time
).
- For each lender, compute the overall amount of money lent. For each loan that has more than one lender, you must
assume that all lenders contributed the same amount.
- For each country, compute the difference between the overall amount of money lent and the overall amount of money
borrowed. Since the country of the lender is often unknown, you can assume that the true distribution among the
countries is the same as the one computed from the rows where the country is known.
- Which country has the highest ratio between the difference computed at the previous point and the population?
- Which country has the highest ratio between the difference computed at point 9 and the population that is
not below the poverty line?
- For each year, compute the total amount of loans. Each loan that has planned expiration time and disburse time in
different years must have its amount distributed proportionally to the number of days in each year. For example, a
loan with disburse time December 1st, 2016, planned expiration time January 30th 2018, and amount 5000USD has an
amount of 5000USD * 31 / (31+365+30) = 363.85 for 2016, 5000USD * 365 / (31+365+30) = 4284.04 for 2017, and 5000USD * 30 /
(31+365+30) = 352.11 for 2018.
The following part of the exercise must be done only by groups with two or three people
- For each value of
repayment_interval
, add a new column to the lenders
dataframe that contains the total amount
of money corresponding to loans in such state
- What is the occupation with the highest average amount of money lent (the average must be computed over all lenders
with a given occupation)?
- Cluster the loans according to the year-month of disburse time.
- For each country, compute its overall GNI, by multiplying the per capita GNI with its population.
The following part of the exercise must be done only by groups with three people
- Find the country with the highest rate of irregular repayment interval.
- Find the country with the highest fraction of loaned amount with irregular repayment interval.
Notes
- It is mandatory to use GitHub for developing the project.
- The project must be a jupyter notebook.
- There is no restriction on the libraries that can be used, nor on the Python version.
- Post any question on the Discussions forum