Chapter 1 Assignment: Exploring Global Life Expectancy Data

Chapter 1 Assignment: Exploring Global Life Expectancy Data#

Overview#

In this assignment, you will apply the concepts learned in Chapter 1 to analyze a real-world dataset. You will:

Load and explore a public dataset
Clean the data by handling missing values and data types
Create visualizations (bar charts, histograms)
Calculate descriptive statistics (mean, median, standard deviation)
Draw conclusions based on your analysis

Dataset: Gapminder Life Expectancy Data#

We will use the Gapminder dataset, which contains information about countries including:

Life expectancy
GDP per capita
Population
Continent

This dataset is publicly available and widely used for teaching data analysis.

Source: https://www.gapminder.org/data/

Instructions#

Complete all the tasks marked with TODO
Write your code in the provided cells
Answer the questions in markdown cells
Make sure your visualizations have proper labels and titles

Part 1: Loading and Exploring the Data (15 points)#

First, let’s import the necessary libraries and load the dataset.

# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Load the Gapminder dataset from a public URL
url = "https://raw.githubusercontent.com/plotly/datasets/master/gapminderDataFiveYear.csv"
df = pd.read_csv(url)

# Display the first 10 rows
print("First 10 rows of the dataset:")
df.head(10)

Task 1.1: Explore the Dataset Structure (5 points)#

TODO: Use appropriate pandas methods to answer the following questions:

How many rows and columns does the dataset have?
What are the data types of each column?
Are there any missing values?

# TODO: Find the shape of the dataset (rows, columns)
# Hint: Use df.shape

# TODO: Display data types and info about the dataset
# Hint: Use df.info() or df.dtypes

# TODO: Check for missing values in each column
# Hint: Use df.isnull().sum()

Task 1.2: Understand the Variables (5 points)#

TODO: For each categorical column, find the unique values.

# TODO: Find unique continents in the dataset
# Hint: Use df['continent'].unique()

# TODO: Find the range of years covered in the dataset
# Hint: Use df['year'].min() and df['year'].max()

# TODO: How many unique countries are in the dataset?
# Hint: Use df['country'].nunique()

Task 1.3: Filter Data for Analysis (5 points)#

For the rest of this assignment, we will focus on the most recent year in the dataset.

TODO: Create a new DataFrame containing only the data from year 2007.

# TODO: Filter the dataset to include only year 2007
# Hint: df_2007 = df[df['year'] == 2007]

df_2007 = None  # Replace with your code

# Display the shape of the filtered dataset
print(f"Number of countries in 2007: {len(df_2007) if df_2007 is not None else 'Complete the TODO'}")

Part 2: Data Cleaning (15 points)#

Real-world data often contains issues that need to be addressed before analysis. In this section, you will practice data cleaning techniques.

Task 2.1: Introduce and Handle Missing Values (10 points)#

Let’s simulate a real-world scenario by introducing some missing values, then handle them appropriately.

# Create a copy of the 2007 data for cleaning practice
df_clean = df_2007.copy()

# Introduce some missing values (simulating real-world data issues)
np.random.seed(42)  # For reproducibility
missing_indices = np.random.choice(df_clean.index, size=10, replace=False)
df_clean.loc[missing_indices[:5], 'lifeExp'] = np.nan
df_clean.loc[missing_indices[5:], 'gdpPercap'] = np.nan

print("Missing values introduced:")
print(df_clean.isnull().sum())

# TODO: Identify which countries have missing life expectancy values
# Hint: df_clean[df_clean['lifeExp'].isnull()]['country']

# TODO: Fill missing 'lifeExp' values with the median life expectancy of their continent
# Hint: Use groupby and transform with a lambda function
# Example: df_clean['lifeExp'] = df_clean.groupby('continent')['lifeExp'].transform(
#              lambda x: x.fillna(x.median()))

# TODO: Fill missing 'gdpPercap' values with the median GDP of their continent

# TODO: Verify that there are no more missing values
# Hint: Use df_clean.isnull().sum()

Task 2.2: Data Type Validation (5 points)#

TODO: Create a new column called pop_millions that contains the population in millions (divide population by 1,000,000).

# TODO: Create a new column 'pop_millions' = population / 1,000,000
# Round to 2 decimal places using .round(2)


# Display sample of the result
# df_clean[['country', 'pop', 'pop_millions']].head(10)

Part 3: Visualization (35 points)#

Now let’s create visualizations to understand our data better.

Task 3.1: Bar Chart - Countries per Continent (10 points)#

TODO: Create a bar chart showing the number of countries in each continent.

Requirements:

Add a title: “Number of Countries per Continent (2007)”
Label the x-axis: “Continent”
Label the y-axis: “Number of Countries”
Add value labels on top of each bar

# TODO: Create a bar chart showing countries per continent
# Step 1: Count countries per continent using value_counts()
# Step 2: Create the bar chart using plt.bar()
# Step 3: Add labels and title
# Step 4: Add value labels on bars

plt.figure(figsize=(10, 6))

# Your code here


plt.tight_layout()
plt.show()

Task 3.2: Histogram - Life Expectancy Distribution (10 points)#

TODO: Create a histogram showing the distribution of life expectancy across all countries in 2007.

Requirements:

Use 10 bins
Add a title: “Distribution of Life Expectancy (2007)”
Label the x-axis: “Life Expectancy (years)”
Label the y-axis: “Number of Countries”
Add a vertical line showing the mean life expectancy

# TODO: Create a histogram of life expectancy
# Step 1: Create histogram using plt.hist()
# Step 2: Calculate mean life expectancy
# Step 3: Add vertical line at the mean using plt.axvline()
# Step 4: Add labels, title, and legend

plt.figure(figsize=(10, 6))

# Your code here


plt.tight_layout()
plt.show()

Task 3.3: Conditional Histogram - Life Expectancy by Continent (15 points)#

TODO: Create separate histograms of life expectancy for each continent to compare distributions.

Requirements:

Create a figure with 5 subplots (one per continent)
Use the same x-axis range for all (40 to 85 years)
Add appropriate titles and labels

# TODO: Create conditional histograms by continent
# Hint: Use plt.subplots() to create multiple plots
# Loop through continents and create a histogram for each

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()  # Flatten to make indexing easier

continents = df_clean['continent'].unique()

# Your code here - loop through continents and create histograms


# Hide the 6th subplot (we only have 5 continents)
axes[5].set_visible(False)

plt.suptitle('Life Expectancy Distribution by Continent (2007)', fontsize=14)
plt.tight_layout()
plt.show()

Part 4: Descriptive Statistics (20 points)#

Calculate and interpret key statistics for the data.

Task 4.1: Summary Statistics (10 points)#

TODO: Calculate the following statistics for life expectancy in 2007:

Mean
Median
Standard deviation
Minimum and Maximum
Range (Max - Min)

# TODO: Calculate descriptive statistics for life expectancy
# Use numpy or pandas methods: np.mean(), np.median(), np.std(), etc.

life_exp = df_clean['lifeExp']

# Calculate statistics
mean_life = None      # TODO: Calculate mean
median_life = None    # TODO: Calculate median
std_life = None       # TODO: Calculate standard deviation
min_life = None       # TODO: Calculate minimum
max_life = None       # TODO: Calculate maximum
range_life = None     # TODO: Calculate range

print("Life Expectancy Statistics (2007):")
print("="*40)
print(f"Mean:               {mean_life:.2f} years" if mean_life else "TODO")
print(f"Median:             {median_life:.2f} years" if median_life else "TODO")
print(f"Standard Deviation: {std_life:.2f} years" if std_life else "TODO")
print(f"Minimum:            {min_life:.2f} years" if min_life else "TODO")
print(f"Maximum:            {max_life:.2f} years" if max_life else "TODO")
print(f"Range:              {range_life:.2f} years" if range_life else "TODO")

Task 4.2: Statistics by Continent (10 points)#

TODO: Calculate the mean and standard deviation of life expectancy for each continent.

# TODO: Calculate mean and std of life expectancy by continent
# Hint: Use df_clean.groupby('continent')['lifeExp'].agg(['mean', 'std'])

# continent_stats = ...

# Display the results sorted by mean life expectancy
# continent_stats.sort_values('mean', ascending=False)

# TODO: Create a bar chart comparing mean life expectancy across continents
# Include error bars showing the standard deviation
# Hint: Use plt.bar() with yerr parameter

plt.figure(figsize=(10, 6))

# Your code here


plt.tight_layout()
plt.show()

Part 5: Analysis and Conclusions (15 points)#

Based on your analysis, answer the following questions.

Question 5.1: Distribution Shape (5 points)#

TODO: Look at your histogram from Task 3.2. Describe the shape of the life expectancy distribution.

Consider:

Is it symmetric or skewed?
Is it unimodal (one peak) or bimodal (two peaks)?
Are there any outliers?

Your Answer:

Write your answer here (double-click to edit)

Question 5.2: Mean vs Median (5 points)#

TODO: Compare the mean and median life expectancy you calculated.

Which one is larger?
What does this tell you about the distribution?
Which measure would you use to describe the “typical” life expectancy and why?

Your Answer:

Write your answer here

Question 5.3: Continental Differences (5 points)#

TODO: Based on your analysis of life expectancy by continent:

Which continent has the highest average life expectancy?
Which continent has the most variability (highest standard deviation)?
What factors might explain these differences?

Your Answer:

Write your answer here

Bonus Challenge (10 extra points)#

For extra credit, complete the following challenge.

Bonus: Temporal Analysis#

TODO: Create a line plot showing how the average life expectancy has changed over time for each continent.

Requirements:

Calculate mean life expectancy by year and continent
Create a line plot with one line per continent
Use different colors for each continent
Add a legend
Add appropriate title and labels

# BONUS: Create a line plot of life expectancy over time by continent
# Hint: Use the original df (not df_2007)
# Group by year and continent, then plot

plt.figure(figsize=(12, 6))

# Your code here


plt.tight_layout()
plt.show()

Bonus Question:#

What trends do you observe in the line plot? Has the gap between continents increased or decreased over time?

Your Answer:

Write your answer here

Submission Checklist#

Before submitting, make sure you have:

[ ] Completed all TODO tasks
[ ] Run all cells from top to bottom without errors
[ ] Added titles and labels to all visualizations
[ ] Written answers to all analysis questions
[ ] Saved your notebook

Total Points: 100 (+ 10 bonus)

Section	Points
Part 1: Loading and Exploring	15
Part 2: Data Cleaning	15
Part 3: Visualization	35
Part 4: Descriptive Statistics	20
Part 5: Analysis and Conclusions	15
Bonus	10

Good luck! 🎉

Chapter 1 Assignment: Exploring Global Life Expectancy Data

Contents

Chapter 1 Assignment: Exploring Global Life Expectancy Data#

Overview#

Dataset: Gapminder Life Expectancy Data#

Instructions#

Part 1: Loading and Exploring the Data (15 points)#

Task 1.1: Explore the Dataset Structure (5 points)#

Task 1.2: Understand the Variables (5 points)#

Task 1.3: Filter Data for Analysis (5 points)#

Part 2: Data Cleaning (15 points)#

Task 2.1: Introduce and Handle Missing Values (10 points)#

Task 2.2: Data Type Validation (5 points)#

Part 3: Visualization (35 points)#

Task 3.1: Bar Chart - Countries per Continent (10 points)#

Task 3.2: Histogram - Life Expectancy Distribution (10 points)#

Task 3.3: Conditional Histogram - Life Expectancy by Continent (15 points)#

Part 4: Descriptive Statistics (20 points)#

Task 4.1: Summary Statistics (10 points)#

Task 4.2: Statistics by Continent (10 points)#

Part 5: Analysis and Conclusions (15 points)#

Question 5.1: Distribution Shape (5 points)#

Question 5.2: Mean vs Median (5 points)#

Question 5.3: Continental Differences (5 points)#

Bonus Challenge (10 extra points)#

Bonus: Temporal Analysis#

Bonus Question:#

Submission Checklist#