Chapter 2 Assignment: Analyzing Relationships in Data

Chapter 2 Assignment: Analyzing Relationships in Data#

Overview#

In this assignment, you will apply the concepts learned in Chapter 2 to analyze relationships between variables. You will:

Load and explore a multi-variable dataset
Create 2D visualizations (scatter plots, heatmaps)
Calculate and interpret correlations
Make predictions using linear relationships
Identify correlation pitfalls and draw conclusions

Dataset: Auto MPG Dataset#

We will use the Auto MPG dataset, which contains information about cars:

Miles per gallon (mpg)
Number of cylinders
Engine displacement
Horsepower
Weight
Acceleration
Model year
Origin

This dataset is publicly available from the UCI Machine Learning Repository.

Source: https://archive.ics.uci.edu/ml/datasets/auto+mpg

Instructions#

Complete all the tasks marked with TODO
Write your code in the provided cells
Answer the questions in markdown cells
Make sure your visualizations have proper labels and titles

Part 1: Loading and Exploring the Data (10 points)#

First, let’s import the necessary libraries and load the dataset.

# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set display options
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-whitegrid')

# Load the Auto MPG dataset from a public URL
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv"
df = pd.read_csv(url)

# Display the first 10 rows
print("First 10 rows of the dataset:")
df.head(10)

Task 1.1: Explore the Dataset (5 points)#

TODO: Answer the following questions about the dataset:

How many rows and columns does the dataset have?
What are the numerical columns?
Are there any missing values?

# TODO: Find the shape of the dataset
# Hint: Use df.shape

# TODO: Display the data types and identify numerical columns
# Hint: Use df.dtypes or df.select_dtypes(include=[np.number]).columns

# TODO: Check for missing values and handle them
# Hint: Use df.isnull().sum() and df.dropna()

Task 1.2: Clean the Data (5 points)#

TODO: Create a clean dataset by:

Dropping rows with missing values
Selecting only numerical columns for correlation analysis

# TODO: Create a clean dataset
# Step 1: Drop rows with missing values
# Step 2: Select numerical columns: mpg, cylinders, displacement, horsepower, weight, acceleration, model_year

df_clean = None  # Replace with your code

# Verify the result
# print(f"Clean dataset shape: {df_clean.shape}")
# print(f"Columns: {df_clean.columns.tolist()}")

Part 2: Scatter Plots and Visual Relationships (25 points)#

Before calculating correlations, it’s essential to visualize the relationships between variables.

Task 2.1: Single Scatter Plot (10 points)#

TODO: Create a scatter plot showing the relationship between weight (x-axis) and mpg (y-axis).

Requirements:

Add a title: “Car Weight vs Fuel Efficiency”
Label axes appropriately with units
Add a trend line (optional but encouraged)

# TODO: Create a scatter plot of weight vs mpg
# Step 1: Create figure with plt.figure()
# Step 2: Create scatter plot with plt.scatter()
# Step 3: Add labels and title
# Step 4 (optional): Add trend line using np.polyfit()

plt.figure(figsize=(10, 6))

# Your code here


plt.tight_layout()
plt.show()

Task 2.2: Scatter Plot with Categories (10 points)#

TODO: Create a scatter plot of horsepower vs mpg, colored by origin (USA, Europe, Japan).

Requirements:

Different colors for each origin
Add a legend
Add appropriate title and labels

# TODO: Create a scatter plot colored by origin
# Hint: You'll need to use the original df (with 'origin' column)
# Use different colors for each origin group

plt.figure(figsize=(10, 6))

# Your code here
# You can use: for origin in df['origin'].unique(): plot each group
# Or use seaborn: sns.scatterplot(data=df, x='horsepower', y='mpg', hue='origin')


plt.tight_layout()
plt.show()

Task 2.3: Pair Plot (5 points)#

TODO: Create a pair plot (scatter plot matrix) for the variables: mpg, horsepower, weight, and acceleration.

# TODO: Create a pair plot
# Hint: Use seaborn's pairplot function
# sns.pairplot(df_clean[['mpg', 'horsepower', 'weight', 'acceleration']])

# Your code here

Part 3: Correlation Analysis (30 points)#

Now let’s quantify the relationships we observed visually.

Task 3.1: Calculate Single Correlation (10 points)#

TODO: Calculate the Pearson correlation coefficient between weight and mpg.

Use at least two methods:

NumPy’s np.corrcoef()
SciPy’s stats.pearsonr() (which also gives you the p-value)

# TODO: Calculate correlation between weight and mpg
# Method 1: np.corrcoef()
# Method 2: stats.pearsonr()

# Your code here


# Print results with interpretation
# print(f"Correlation (numpy): {r_numpy:.4f}")
# print(f"Correlation (scipy): {r_scipy:.4f}")
# print(f"P-value: {p_value:.2e}")

Task 3.2: Correlation Matrix (10 points)#

TODO: Create a correlation matrix for all numerical variables and visualize it as a heatmap.

# TODO: Calculate the correlation matrix
# Hint: Use df_clean.corr()

# corr_matrix = ...

# Print the correlation matrix
# print("Correlation Matrix:")
# print(corr_matrix.round(3))

# TODO: Create a heatmap of the correlation matrix
# Hint: Use sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', center=0)

plt.figure(figsize=(10, 8))

# Your code here


plt.tight_layout()
plt.show()

Task 3.3: Interpret the Correlations (10 points)#

TODO: Based on the correlation matrix, answer the following questions in the markdown cell below:

Which variable has the strongest positive correlation with mpg?
Which variable has the strongest negative correlation with mpg?
Which two variables (other than mpg) have the highest correlation with each other?
Are there any surprising correlations? Explain.

Your Answers:

Strongest positive correlation with mpg: Write your answer here
Strongest negative correlation with mpg: Write your answer here
Highest correlation between other variables: Write your answer here
Surprising correlations: Write your answer here

Part 4: Prediction Using Correlation (20 points)#

Now let’s use correlation to make predictions.

Task 4.1: Implement Prediction Function (10 points)#

TODO: Implement a function that predicts y from x using the formula:

\[\hat{y} = \bar{y} + r \frac{\sigma_y}{\sigma_x}(x - \bar{x})\]

Where:

\(r\) is the correlation coefficient
\(\bar{x}, \bar{y}\) are the means
\(\sigma_x, \sigma_y\) are the standard deviations

# TODO: Implement the prediction function
def predict_from_correlation(x_new, x_data, y_data):
    """
    Predict y from x using correlation.
    
    Parameters:
    - x_new: the new x value(s) to predict for
    - x_data: array of x values (training data)
    - y_data: array of y values (training data)
    
    Returns:
    - predicted y value(s)
    """
    # Calculate correlation
    r = None  # TODO
    
    # Calculate means
    x_mean = None  # TODO
    y_mean = None  # TODO
    
    # Calculate standard deviations
    x_std = None  # TODO
    y_std = None  # TODO
    
    # Calculate prediction
    y_pred = None  # TODO: Use the formula
    
    return y_pred

# Test the function (uncomment after implementing)
# test_pred = predict_from_correlation(3000, df_clean['weight'], df_clean['mpg'])
# print(f"Predicted mpg for 3000 lbs car: {test_pred:.2f}")

Task 4.2: Make Predictions (10 points)#

TODO: Use your function to predict MPG for cars with the following weights:

2500 lbs
3000 lbs
3500 lbs
4000 lbs
4500 lbs

Then create a plot showing:

The original scatter plot (weight vs mpg)
The regression line with your predictions

# TODO: Make predictions for different weights
weights_to_predict = [2500, 3000, 3500, 4000, 4500]

# Your code here - make predictions


# Print predictions
# print("Weight (lbs) | Predicted MPG")
# print("-" * 30)
# for w, mpg in zip(weights_to_predict, predictions):
#     print(f"{w:12} | {mpg:.2f}")

# TODO: Create a plot with scatter points and regression line
plt.figure(figsize=(10, 6))

# Step 1: Plot original data as scatter
# Step 2: Plot regression line
# Step 3: Mark the predicted points

# Your code here


plt.tight_layout()
plt.show()

Part 5: Correlation Pitfalls (15 points)#

Understanding the limitations of correlation is crucial.

Task 5.1: Non-linear Relationships (5 points)#

TODO: The relationship between acceleration and mpg might not be perfectly linear. Create a scatter plot and calculate the correlation. Is correlation the best measure for this relationship?

# TODO: Analyze the acceleration vs mpg relationship
# Create scatter plot and calculate correlation

plt.figure(figsize=(10, 6))

# Your code here


plt.tight_layout()
plt.show()

# Calculate and print correlation
# r_accel_mpg = ...
# print(f"Correlation between acceleration and mpg: {r_accel_mpg:.3f}")

Task 5.2: Correlation vs Causation (5 points)#

TODO: Answer the following question in the markdown cell below:

We found a strong negative correlation between weight and mpg. Does this mean that:

Making a car heavier causes it to have lower mpg?
What are potential confounding variables?
Can we conclude causation from this correlation?

Your Answer:

Write your answer here discussing causation vs correlation, and potential confounding variables

Task 5.3: Pearson vs Spearman (5 points)#

TODO: Calculate both Pearson and Spearman correlations for horsepower vs mpg. Which one is more appropriate and why?

# TODO: Calculate Pearson and Spearman correlations
# Hint: Use stats.pearsonr() and stats.spearmanr()

# Your code here


# print(f"Pearson correlation:  {r_pearson:.4f}")
# print(f"Spearman correlation: {r_spearman:.4f}")

Bonus Challenge (10 extra points)#

For extra credit, complete the following challenge.

Bonus: Multi-variable Analysis#

TODO: Investigate whether the relationship between weight and mpg differs by origin (USA, Europe, Japan).

Calculate the correlation between weight and mpg separately for each origin
Create a visualization showing the different relationships
Explain your findings

# BONUS: Analyze weight-mpg correlation by origin
# Calculate correlations for each origin group

# Your code here

# BONUS: Create visualization with separate regression lines for each origin

plt.figure(figsize=(12, 6))

# Your code here

plt.tight_layout()
plt.show()

Bonus Question:#

What do the different correlations by origin tell us? Is the weight-mpg relationship consistent across all origins, or are there differences?