Open In Colab

Chapter 1 Assignment: Exploring Global Life Expectancy Data#

Overview#

In this assignment, you will apply the concepts learned in Chapter 1 to analyze a real-world dataset. You will:

  1. Load and explore a public dataset

  2. Clean the data by handling missing values and data types

  3. Create visualizations (bar charts, histograms)

  4. Calculate descriptive statistics (mean, median, standard deviation)

  5. Draw conclusions based on your analysis

Dataset: Gapminder Life Expectancy Data#

We will use the Gapminder dataset, which contains information about countries including:

  • Life expectancy

  • GDP per capita

  • Population

  • Continent

This dataset is publicly available and widely used for teaching data analysis.

Source: https://www.gapminder.org/data/


Instructions#

  • Complete all the tasks marked with TODO

  • Write your code in the provided cells

  • Answer the questions in markdown cells

  • Make sure your visualizations have proper labels and titles


Part 1: Loading and Exploring the Data (15 points)#

First, let’s import the necessary libraries and load the dataset.

# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
# Load the Gapminder dataset from a public URL
url = "https://raw.githubusercontent.com/plotly/datasets/master/gapminderDataFiveYear.csv"
df = pd.read_csv(url)

# Display the first 10 rows
print("First 10 rows of the dataset:")
df.head(10)

Task 1.1: Explore the Dataset Structure (5 points)#

TODO: Use appropriate pandas methods to answer the following questions:

  1. How many rows and columns does the dataset have?

  2. What are the data types of each column?

  3. Are there any missing values?

# TODO: Find the shape of the dataset (rows, columns)
# Hint: Use df.shape
# TODO: Display data types and info about the dataset
# Hint: Use df.info() or df.dtypes
# TODO: Check for missing values in each column
# Hint: Use df.isnull().sum()

Task 1.2: Understand the Variables (5 points)#

TODO: For each categorical column, find the unique values.

# TODO: Find unique continents in the dataset
# Hint: Use df['continent'].unique()
# TODO: Find the range of years covered in the dataset
# Hint: Use df['year'].min() and df['year'].max()
# TODO: How many unique countries are in the dataset?
# Hint: Use df['country'].nunique()

Task 1.3: Filter Data for Analysis (5 points)#

For the rest of this assignment, we will focus on the most recent year in the dataset.

TODO: Create a new DataFrame containing only the data from year 2007.

# TODO: Filter the dataset to include only year 2007
# Hint: df_2007 = df[df['year'] == 2007]

df_2007 = None  # Replace with your code

# Display the shape of the filtered dataset
print(f"Number of countries in 2007: {len(df_2007) if df_2007 is not None else 'Complete the TODO'}")

Part 2: Data Cleaning (15 points)#

Real-world data often contains issues that need to be addressed before analysis. In this section, you will practice data cleaning techniques.

Task 2.1: Introduce and Handle Missing Values (10 points)#

Let’s simulate a real-world scenario by introducing some missing values, then handle them appropriately.

# Create a copy of the 2007 data for cleaning practice
df_clean = df_2007.copy()

# Introduce some missing values (simulating real-world data issues)
np.random.seed(42)  # For reproducibility
missing_indices = np.random.choice(df_clean.index, size=10, replace=False)
df_clean.loc[missing_indices[:5], 'lifeExp'] = np.nan
df_clean.loc[missing_indices[5:], 'gdpPercap'] = np.nan

print("Missing values introduced:")
print(df_clean.isnull().sum())
# TODO: Identify which countries have missing life expectancy values
# Hint: df_clean[df_clean['lifeExp'].isnull()]['country']
# TODO: Fill missing 'lifeExp' values with the median life expectancy of their continent
# Hint: Use groupby and transform with a lambda function
# Example: df_clean['lifeExp'] = df_clean.groupby('continent')['lifeExp'].transform(
#              lambda x: x.fillna(x.median()))
# TODO: Fill missing 'gdpPercap' values with the median GDP of their continent
# TODO: Verify that there are no more missing values
# Hint: Use df_clean.isnull().sum()

Task 2.2: Data Type Validation (5 points)#

TODO: Create a new column called pop_millions that contains the population in millions (divide population by 1,000,000).

# TODO: Create a new column 'pop_millions' = population / 1,000,000
# Round to 2 decimal places using .round(2)


# Display sample of the result
# df_clean[['country', 'pop', 'pop_millions']].head(10)

Part 3: Visualization (35 points)#

Now let’s create visualizations to understand our data better.

Task 3.1: Bar Chart - Countries per Continent (10 points)#

TODO: Create a bar chart showing the number of countries in each continent.

Requirements:

  • Add a title: “Number of Countries per Continent (2007)”

  • Label the x-axis: “Continent”

  • Label the y-axis: “Number of Countries”

  • Add value labels on top of each bar

# TODO: Create a bar chart showing countries per continent
# Step 1: Count countries per continent using value_counts()
# Step 2: Create the bar chart using plt.bar()
# Step 3: Add labels and title
# Step 4: Add value labels on bars

plt.figure(figsize=(10, 6))

# Your code here


plt.tight_layout()
plt.show()

Task 3.2: Histogram - Life Expectancy Distribution (10 points)#

TODO: Create a histogram showing the distribution of life expectancy across all countries in 2007.

Requirements:

  • Use 10 bins

  • Add a title: “Distribution of Life Expectancy (2007)”

  • Label the x-axis: “Life Expectancy (years)”

  • Label the y-axis: “Number of Countries”

  • Add a vertical line showing the mean life expectancy

# TODO: Create a histogram of life expectancy
# Step 1: Create histogram using plt.hist()
# Step 2: Calculate mean life expectancy
# Step 3: Add vertical line at the mean using plt.axvline()
# Step 4: Add labels, title, and legend

plt.figure(figsize=(10, 6))

# Your code here


plt.tight_layout()
plt.show()

Task 3.3: Conditional Histogram - Life Expectancy by Continent (15 points)#

TODO: Create separate histograms of life expectancy for each continent to compare distributions.

Requirements:

  • Create a figure with 5 subplots (one per continent)

  • Use the same x-axis range for all (40 to 85 years)

  • Add appropriate titles and labels

# TODO: Create conditional histograms by continent
# Hint: Use plt.subplots() to create multiple plots
# Loop through continents and create a histogram for each

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()  # Flatten to make indexing easier

continents = df_clean['continent'].unique()

# Your code here - loop through continents and create histograms


# Hide the 6th subplot (we only have 5 continents)
axes[5].set_visible(False)

plt.suptitle('Life Expectancy Distribution by Continent (2007)', fontsize=14)
plt.tight_layout()
plt.show()

Part 4: Descriptive Statistics (20 points)#

Calculate and interpret key statistics for the data.

Task 4.1: Summary Statistics (10 points)#

TODO: Calculate the following statistics for life expectancy in 2007:

  • Mean

  • Median

  • Standard deviation

  • Minimum and Maximum

  • Range (Max - Min)

# TODO: Calculate descriptive statistics for life expectancy
# Use numpy or pandas methods: np.mean(), np.median(), np.std(), etc.

life_exp = df_clean['lifeExp']

# Calculate statistics
mean_life = None      # TODO: Calculate mean
median_life = None    # TODO: Calculate median
std_life = None       # TODO: Calculate standard deviation
min_life = None       # TODO: Calculate minimum
max_life = None       # TODO: Calculate maximum
range_life = None     # TODO: Calculate range

print("Life Expectancy Statistics (2007):")
print("="*40)
print(f"Mean:               {mean_life:.2f} years" if mean_life else "TODO")
print(f"Median:             {median_life:.2f} years" if median_life else "TODO")
print(f"Standard Deviation: {std_life:.2f} years" if std_life else "TODO")
print(f"Minimum:            {min_life:.2f} years" if min_life else "TODO")
print(f"Maximum:            {max_life:.2f} years" if max_life else "TODO")
print(f"Range:              {range_life:.2f} years" if range_life else "TODO")

Task 4.2: Statistics by Continent (10 points)#

TODO: Calculate the mean and standard deviation of life expectancy for each continent.

# TODO: Calculate mean and std of life expectancy by continent
# Hint: Use df_clean.groupby('continent')['lifeExp'].agg(['mean', 'std'])

# continent_stats = ...

# Display the results sorted by mean life expectancy
# continent_stats.sort_values('mean', ascending=False)
# TODO: Create a bar chart comparing mean life expectancy across continents
# Include error bars showing the standard deviation
# Hint: Use plt.bar() with yerr parameter

plt.figure(figsize=(10, 6))

# Your code here


plt.tight_layout()
plt.show()

Part 5: Analysis and Conclusions (15 points)#

Based on your analysis, answer the following questions.

Question 5.1: Distribution Shape (5 points)#

TODO: Look at your histogram from Task 3.2. Describe the shape of the life expectancy distribution.

Consider:

  • Is it symmetric or skewed?

  • Is it unimodal (one peak) or bimodal (two peaks)?

  • Are there any outliers?

Your Answer:

Write your answer here (double-click to edit)

Question 5.2: Mean vs Median (5 points)#

TODO: Compare the mean and median life expectancy you calculated.

  • Which one is larger?

  • What does this tell you about the distribution?

  • Which measure would you use to describe the “typical” life expectancy and why?

Your Answer:

Write your answer here

Question 5.3: Continental Differences (5 points)#

TODO: Based on your analysis of life expectancy by continent:

  1. Which continent has the highest average life expectancy?

  2. Which continent has the most variability (highest standard deviation)?

  3. What factors might explain these differences?

Your Answer:

Write your answer here


Bonus Challenge (10 extra points)#

For extra credit, complete the following challenge.

Bonus: Temporal Analysis#

TODO: Create a line plot showing how the average life expectancy has changed over time for each continent.

Requirements:

  • Calculate mean life expectancy by year and continent

  • Create a line plot with one line per continent

  • Use different colors for each continent

  • Add a legend

  • Add appropriate title and labels

# BONUS: Create a line plot of life expectancy over time by continent
# Hint: Use the original df (not df_2007)
# Group by year and continent, then plot

plt.figure(figsize=(12, 6))

# Your code here


plt.tight_layout()
plt.show()

Bonus Question:#

What trends do you observe in the line plot? Has the gap between continents increased or decreased over time?

Your Answer:

Write your answer here


Submission Checklist#

Before submitting, make sure you have:

  • [ ] Completed all TODO tasks

  • [ ] Run all cells from top to bottom without errors

  • [ ] Added titles and labels to all visualizations

  • [ ] Written answers to all analysis questions

  • [ ] Saved your notebook

Total Points: 100 (+ 10 bonus)

Section

Points

Part 1: Loading and Exploring

15

Part 2: Data Cleaning

15

Part 3: Visualization

35

Part 4: Descriptive Statistics

20

Part 5: Analysis and Conclusions

15

Bonus

10


Good luck! 🎉