ASSIGNMENT 6¶

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import loadmat

Question 1¶

The peanuts data set contains measurements of the alfatoxin $(X)$ and the corresponding percentage of noncontaminated peanuts in the batch $(Y)$. Do a scatterplot of these data. What is a good model for these data? Use cross-validation to choose the best model.


In [13]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

data = loadmat("Data_Files/peanuts.mat")
X = data["X"].reshape(-1, 1)
Y = data["Y"].reshape(-1, 1)

# Scatterplot
plt.figure(figsize=(8, 4))
plt.scatter(X, Y, color='black', facecolors='none', label='Data')
plt.xlabel("Aflatoxin (X)")
plt.ylabel("Percentage of noncontaminated peanuts (Y)")
plt.title("Peanuts Data Scatterplot")
plt.legend()
plt.grid(True)
plt.show()

# Cross-validation 
kf = KFold(n_splits=5, shuffle=True, random_state=1905)

degrees = [1, 2, 3, 4, 5]

plt.figure(figsize=(8, 4))
plt.scatter(X, Y, color='black', facecolors='none', label='Data')
X_seq = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
best_mse = float('inf')
best_degree = 0

for d in degrees:
    model = Pipeline([
        ("poly", PolynomialFeatures(degree=d, include_bias=False)),
        ("linreg", LinearRegression())
    ])
    
    # convert negative MSE values to positive
    mse = -cross_val_score(
        model, X, Y,
        scoring="neg_mean_squared_error",
        cv=kf
    ).mean()

    print(f"Degree {d}: MSE = {mse:.6f}")

    if mse < best_mse:
        best_mse = mse
        best_degree = d
    
    # fit on whole dataset for plotting
    model.fit(X, Y)
    plt.plot(X_seq, model.predict(X_seq), label=f'Degree {d}')


print(f"\n Best Model: Polynomial Degree {best_degree}")
plt.title('Polynomial Model Fits')
plt.xlabel('Alfatoxin (X)')
plt.ylabel('Percentage of noncontaminated peanuts (Y)')
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image
Degree 1: MSE = 0.001687
Degree 2: MSE = 0.001672
Degree 3: MSE = 0.001845
Degree 4: MSE = 0.002937
Degree 5: MSE = 0.006057

 Best Model: Polynomial Degree 2
No description has been provided for this image

Scatterplot:¶

From the scatterplot, we observe a decreasing relationship between aflatoxin level (X) and the percentage of non-contaminated peanuts (Y). As aflatoxin increases, the proportion of non-contaminated peanuts declines.

The relationship is smooth and it's not perfectly linear. There is no sharp curvature or oscillatory behavior, this suggests that high-degree polynomials are unnecessary.

Model comparison via cross-validation:¶

We fitted polynomial models of degrees 1 through 5 and used cross-validation mean squared error (CV MSE) as the selection criterion.

The results show that: - Degree 1, which is the linear model, slightly underfits the curvature. - Degrees 2–5 provide very similar fits visually. - Higher degrees start to bend more at the boundaries, near large X values.

Cross-validation selected degree 2 as the best model. This indicates that: - A quadratic term captures best the curvature of the data. - Additional higher-order terms do not improve model performance.


Question 2¶

Use Monte Carlo simulation to compare the performance of the bootstrap and the jackknife methods for estimating the standard error and bias of the sample second central moment. For every Monte Carlo trial, generate 100 standard normal random variables and calculate the bootstrap and jackknife estimates of the standard error and bias. Show the distribution of the bootstrap estimates (of bias and standard error) and the jackknife estimates (of bias and standard error) in a histogram or a box plot. Make some comparisons of the two methods.


In [14]:
# Parameters
MC_trials = 500          # number of Monte Carlo trials
n = 100                  # sample size
B = 1000                 # bootstrap resamples

boot_bias = []
boot_se = []
jack_bias = []
jack_se = []

# true second central moment of N(0,1)
true_var = 1.0

for _ in range(MC_trials):
    # generate data
    x = np.random.randn(n)
    
    # sample second central moment
    theta_hat = np.mean((x - np.mean(x))**2)
    
    # ---------- Bootstrap ----------
    boot_thetas = []
    for _ in range(B):
        xb = np.random.choice(x, size=n, replace=True)
        boot_thetas.append(np.mean((xb - np.mean(xb))**2))
    boot_thetas = np.array(boot_thetas)
    
    # Bootstrap estimates
    boot_bias.append(np.mean(boot_thetas) - theta_hat)
    boot_se.append(np.std(boot_thetas, ddof=1))
    
    # ---------- Jackknife ----------
    jack_thetas = []
    for i in range(n):
        x_leave = np.delete(x, i)
        jack_thetas.append(np.mean((x_leave - np.mean(x_leave))**2))
    jack_thetas = np.array(jack_thetas)
    
    theta_jack_mean = np.mean(jack_thetas)
    
    # Jackknife estimates
    jack_bias.append((n - 1) * (theta_jack_mean - theta_hat))
    jack_se.append(
        np.sqrt((n - 1) / n * np.sum((jack_thetas - theta_jack_mean) ** 2))
    )

boot_bias = np.array(boot_bias)
boot_se = np.array(boot_se)
jack_bias = np.array(jack_bias)
jack_se = np.array(jack_se)

# ---------- Plots ----------
plt.figure(figsize=(14, 10))

# Histograms of Bias
plt.subplot(2, 2, 1)
plt.hist(boot_bias, bins=30, alpha=0.5, label="Bootstrap Bias", color='orange', edgecolor='black')
plt.hist(jack_bias, bins=30, alpha=0.5, label="Jackknife Bias", color='blue', edgecolor='black')
plt.title("Distribution of Bias Estimates")
plt.xlabel('Estimated Bias')
plt.legend()

# Histograms of SE
plt.subplot(2, 2, 2)
plt.hist(boot_se, bins=30, alpha=0.5, label="Bootstrap SE", color='orange', edgecolor='black')
plt.hist(jack_se, bins=30, alpha=0.5, label="Jackknife SE", color='blue', edgecolor='black')
plt.title("Distribution of SE Estimates")
plt.xlabel('Estimated SE')
plt.legend()

# Boxplots of Bias
plt.subplot(2, 2, 3)
plt.boxplot([jack_bias, boot_bias], labels=['Jackknife', 'Bootstrap'])
plt.title('Comparison of Bias Estimates')
plt.ylabel('Bias')
plt.grid(True, linestyle='--', alpha=0.3)

# Boxplots of SE
plt.subplot(2, 2, 4)
plt.boxplot([jack_se, boot_se], labels=['Jackknife', 'Bootstrap'])
plt.title('Comparison of SE Estimates')
plt.ylabel('Standard Error')
plt.grid(True, linestyle='--', alpha=0.3)

plt.tight_layout()
plt.show()

# ---------- Numerical summaries ----------
print(f"--- Summary Statistics (M={MC_trials} trials) ---")
print(f"Jackknife Mean Bias: {np.mean(jack_bias):.5f} (SD: {np.std(jack_bias):.5f})")
print(f"Bootstrap Mean Bias: {np.mean(boot_bias):.5f} (SD: {np.std(boot_bias):.5f})")
print("-" * 40)
print(f"Jackknife Mean SE:   {np.mean(jack_se):.5f} (SD: {np.std(jack_se):.5f})")
print(f"Bootstrap Mean SE:   {np.mean(boot_se):.5f} (SD: {np.std(boot_se):.5f})")
/var/folders/tm/0v_xzs6s5vzfw60jkby5vd980000gn/T/ipykernel_15810/2986901424.py:73: MatplotlibDeprecationWarning: The 'labels' parameter of boxplot() has been renamed 'tick_labels' since Matplotlib 3.9; support for the old name will be dropped in 3.11.
  plt.boxplot([jack_bias, boot_bias], labels=['Jackknife', 'Bootstrap'])
/var/folders/tm/0v_xzs6s5vzfw60jkby5vd980000gn/T/ipykernel_15810/2986901424.py:80: MatplotlibDeprecationWarning: The 'labels' parameter of boxplot() has been renamed 'tick_labels' since Matplotlib 3.9; support for the old name will be dropped in 3.11.
  plt.boxplot([jack_se, boot_se], labels=['Jackknife', 'Bootstrap'])
No description has been provided for this image
--- Summary Statistics (M=500 trials) ---
Jackknife Mean Bias: -0.00994 (SD: 0.00151)
Bootstrap Mean Bias: -0.00967 (SD: 0.00469)
----------------------------------------
Jackknife Mean SE:   0.13870 (SD: 0.02681)
Bootstrap Mean SE:   0.13584 (SD: 0.02625)

Bias estimation:

Both methods produce negative bias estimates, which is expected because the sample second central moment (with denominator 𝑛) is a biased estimator of the population variance.

From the Monte Carlo summaries:

Jackknife mean bias: −0.00994 (SD = 0.00151)
Bootstrap mean bias: −0.00967 (SD = 0.00469)

Key observations:

- The mean bias estimates are very close for the two methods.
- The jackknife bias estimates are much more concentrated, as seen both in the histogram and the smaller standard deviation.
- Bootstrap bias estimates show higher variability, with heavier tails.

This reflects the fact that the jackknife provides a stable, linear approximation to bias, while the bootstrap bias estimate is more sensitive to resampling variability.

Standard error estimation:

For the standard error of the second central moment: - Both methods yield very similar average standard error estimates. - The bootstrap SE is slightly smaller on average, but the difference is negligible. - The distributions of SE estimates largely overlap, as confirmed by the boxplots.

Thus, for standard error estimation, both methods perform similarly in this setting.

Overall comparison: The jackknife provides more stable bias estimates and lower variability across Monte Carlo trials. The bootstrap produces comparable SE estimates and exhibits higher variability in bias estimation.


Question 3¶

Using the law data set, find the jackknife replicates of the median. How many different values are there? What is the jackknife estimate of the standard error of the median? Use the bootstrap method to get an estimate of the standard error of the median. Compare the two estimates of the standard error of the median.


In [18]:
data = loadmat("Data_Files/law.mat")

X = data['law']  # Shape is (15, 2)

def analyze_variable(data_vector, label):
    n = len(data_vector)

    # ---------- Jackknife ----------
    jack_medians = []

    for i in range(n):
        # create dataset with i-th observation removed
        x_leave = np.delete(x, i)
        jack_medians.append(np.median(x_leave))

    jack_medians = np.array(jack_medians)

    # num of distinct jackknife medians
    num_distinct = len(np.unique(jack_medians))

    # Jackknife SE
    jack_mean = np.mean(jack_medians)
    jack_se = np.sqrt((n - 1) / n * np.sum((jack_medians - jack_mean) ** 2))

    # ---------- Bootstrap ----------
    B = 1000
    boot_medians = []

    for _ in range(B):
        # resample with replacement
        xb = np.random.choice(x, size=n, replace=True)
        boot_medians.append(np.median(xb))

    boot_medians = np.array(boot_medians)
    boot_se = np.std(boot_medians, ddof=1)

    # ---------- Plots ----------
    plt.figure()
    plt.hist(jack_medians, bins=15)
    plt.title(f"Jackknife Replicates of the Median for {label}")
    plt.xlabel("Median")
    plt.ylabel("Frequency")
    plt.show()

    plt.figure()
    plt.hist(boot_medians, bins=30)
    plt.title(f"Bootstrap Replicates of the Median for {label}")
    plt.xlabel("Median")
    plt.ylabel("Frequency")
    plt.show()

    # ---------- Output ----------
    print("Number of observations:", n)
    print("Number of distinct jackknife medians:", num_distinct)
    print(f"Jackknife SE of median: {jack_se:.5f}")
    print(f"Bootstrap SE of median: {boot_se:.5f}")

# run for both variables (assuming Col 1 is LSAT, Col 2 is GPA)
analyze_variable(X[:, 0], "Variable 1 (e.g., LSAT)")
analyze_variable(X[:, 1], "Variable 2 (e.g., GPA)")
No description has been provided for this image
No description has been provided for this image
Number of observations: 15
Number of distinct jackknife medians: 2
Jackknife SE of median: 0.02439
Bootstrap SE of median: 0.26446
No description has been provided for this image
No description has been provided for this image
Number of observations: 15
Number of distinct jackknife medians: 2
Jackknife SE of median: 0.02439
Bootstrap SE of median: 0.25740

Jackknife replicates of the median

For each variable in the law data set (total of 15 observations):

- Number of distinct jackknife medians: 2

This confirms the high discreteness of the jackknife distribution for the median:

- Only deletions of observations very close to the sample median affect its value.
- Removing most observations leaves the median unchanged.
- Therefore, the jackknife replicates take only a small number of distinct values.

Standard error estimates:

For each variable, the estimated standard errors are:

- Jackknife SE: 0.02439
- Bootstrap SE: ≈ 0.26


- The bootstrap SE is an order of magnitude larger than the jackknife SE.
- The jackknife SE is extremely small due to the lack of variability in the jackknife replicates.
- This indicates that the jackknife severely underestimates the true variability of the median in this case.

Comparison of methods:

- The jackknife performs poorly for the median because:

    - The statistic is not smooth.
    - The jackknife relies on linear approximations.
    - The replicates exhibit very limited variation.

- The bootstrap, by resampling with replacement, captures the full variability of the median and produces a much more realistic standard error estimate.

Question 4¶

Generate data according to $$y = 4x^3 + 6x^2 - 1 + \varepsilon,$$ where $\varepsilon$ represents some noise with constant variance. Fit a first-degree model to the data and plot the residuals versus the observed predictor values $x_i$ (residual dependence plot). Construct also box plots and histograms of the residuals. Do they show that the model is not adequate? Repeat for $d = 2, 3$.


In [19]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

np.random.seed(1905)

# generate predictor
n = 200
x = np.linspace(-2, 2, n)
X = x.reshape(-1, 1)

# generate response
epsilon = np.random.normal(0, 5, size=n)
y = 4*x**3 + 6*x**2 - 1 + epsilon

# degrees to fit
degrees = [1, 2, 3]

# create subplots: 3 rows (degrees) x 3 columns (plots)
fig, axes = plt.subplots(3, 3, figsize=(15, 12))
plt.subplots_adjust(hspace=0.4, wspace=0.3)

for i, d in enumerate(degrees):
    # Fit polynomial model
    model = Pipeline([
        ("poly", PolynomialFeatures(degree=d, include_bias=False)),
        ("linreg", LinearRegression())
    ])
    model.fit(X, y)
    
    # predict and calculate residuals
    y_hat = model.predict(X)
    residuals = y - y_hat

    # row for the current degree
    ax_scatter = axes[i, 0]
    ax_hist = axes[i, 1]
    ax_box = axes[i, 2]

    # ---------- Residual dependence plot ----------
    ax_scatter.scatter(x, residuals, color='blue', alpha=0.6)
    ax_scatter.axhline(0, color='red', linestyle='--')
    ax_scatter.set_title(f'Degree {d}: Residuals vs x')
    ax_scatter.set_xlabel('x')
    ax_scatter.set_ylabel('Residuals')

    # ---------- Histogram ----------
    # checks for normality
    ax_hist.hist(residuals, bins=30, color='green', edgecolor='black', alpha=0.7)
    ax_hist.set_title(f'Degree {d}: Histogram of Residuals')
    ax_hist.set_xlabel('Residuals')
    
    # ---------- Boxplot ----------
    # Checks for symmetry and outliers
    ax_box.boxplot(residuals, vert=False)
    ax_box.set_title(f'Degree {d}: Box Plot')
    ax_box.set_xlabel('Residuals')

plt.suptitle('Residual Analysis for Polynomial Models (d=1, 2, 3)', fontsize=16)
plt.show()
No description has been provided for this image

Results:¶

1) Degree 1 (Linear Model):¶
  • Residuals vs X: We see a distinct U-shape or wave pattern in the residuals. This structure in the residuals indicates that the linear model failed to capture the non-linear trend of the data. The assumption of random errors is violated.
  • Histogram/Boxplot: The residuals does not look perfectly normal; they are spread out or skewed because the deterministic part of the signal ($x^3, x^2$) is leaking into the error term.
2) Degree 2 (Quadratic Model):¶
  • Residuals vs X: We see a pattern (like an S-shape), because a parabola cannot fit a cubic curve perfectly. The residuals are not randomly scattered around zero.
  • Adequacy: Still not adequate.
3) Degree 3 (Cubic Model):¶
  • Residuals vs X: The residuals appear randomly scattered around the horizontal zero line with no discernible pattern. This indicates the model has successfully captured the underlying signal.
  • Histogram/Boxplot: These resemble a normal distribution centered at zero, consistent with the noise term $\epsilon$ we generated. This confirms the model is adequate.

Question 5¶

Fit the wais data using probit and the complementary log-log link. Compare the results with the logistic regression using appropriate techniques to assess model fit.


In [26]:
import statsmodels.api as sm
import statsmodels.genmod.families.links as links

data = loadmat('Data_Files/wais.mat')
wais_matrix = data['wais']
df = pd.DataFrame(wais_matrix, columns=['WAIS_Score', 'Senile_Status'])

# define X and y
y = df['Senile_Status']
X = df['WAIS_Score']
X = sm.add_constant(X)  # add intercept term

# -------- Probit Model ------------
model_probit = sm.GLM(y, X, family=sm.families.Binomial(link=sm.families.links.Probit()))
res_probit = model_probit.fit()

# -------- Complementary Log-Log Model -----------
model_cloglog = sm.GLM(y, X, family=sm.families.Binomial(link=sm.families.links.cloglog()))
res_cloglog = model_cloglog.fit()

# -------- Logistic Regression (for comparison) -----------
model_logit = sm.GLM(y, X, family=sm.families.Binomial(link=sm.families.links.Logit()))
res_logit = model_logit.fit()

# compare Results
print(f"{'Model':<15} | {'AIC':<10} | {'BIC':<10} | {'Log-Likelihood':<15}")
print("-" * 60)
print(f"{'Probit':<15} | {res_probit.aic:<10.4f} | {res_probit.bic_llf:<10.4f} | {res_probit.llf:<15.4f}")
print(f"{'CLogLog':<15} | {res_cloglog.aic:<10.4f} | {res_cloglog.bic_llf:<10.4f} | {res_cloglog.llf:<15.4f}")
print(f"{'Logistic':<15} | {res_logit.aic:<10.4f} | {res_logit.bic_llf:<10.4f} | {res_logit.llf:<15.4f}")

print("\n------------------------ Probit Model Summary ------------------------\n")
print(res_probit.summary())
Model           | AIC        | BIC        | Log-Likelihood 
------------------------------------------------------------
Probit          | 54.9836    | 58.9615    | -25.4918       
CLogLog         | 55.3107    | 59.2887    | -25.6554       
Logistic        | 55.0174    | 58.9953    | -25.5087       

------------------------ Probit Model Summary ------------------------

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:          Senile_Status   No. Observations:                   54
Model:                            GLM   Df Residuals:                       52
Model Family:                Binomial   Df Model:                            1
Link Function:                 Probit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -25.492
Date:                Wed, 17 Dec 2025   Deviance:                       50.984
Time:                        16:45:03   Pearson chi2:                     51.2
No. Iterations:                     5   Pseudo R-squ. (CS):             0.1816
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.3862      0.685      2.023      0.043       0.043       2.729
WAIS_Score    -0.1880      0.063     -2.984      0.003      -0.312      -0.065
==============================================================================
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/statsmodels/genmod/families/links.py:13: FutureWarning: The cloglog link alias is deprecated. Use CLogLog instead. The cloglog link alias will be removed after the 0.15.0 release.
  warnings.warn(

Data Analysis:¶

  • The dataset contains 54 observations.
  • Column 1 (containing 0s and 1s) is used the binary response variable ($Y$, "senile" status).
  • Column 0 is used as the predictor ($X$, WAIS score).
  • Three models were fitted:
    1. Logistic Regression (Logit Link): AIC = 55.02
    2. Probit Regression (Probit Link): AIC = 54.98
    3. Complementary Log-Log (CLogLog Link): AIC = 55.31

Conclusion: The Probit model provides the best fit for this data, having the lowest AIC (54.98), though the Logistic model is extremely close (55.02). The CLogLog model fits slightly worse than the other two.


Question 6¶

Convert the number of satellites in the horseshoe crab data to a binomial response variable, keeping the carapace width as the predictor. Set $Y = 1$ if a female crab has one or more satellites, and $Y = 0$ if she has no satellites. Fit various models and assess the results.


In [27]:
import statsmodels.api as sm

data = loadmat('Data_Files/crab.mat')
X_mat = data['X']               
sat_counts = data['satellites'] 

width = X_mat[:, 0]
# create binary response: 1 if satellites > 0, else 0
y = (sat_counts > 0).astype(int).flatten()

# prepare design matrix (add intercept)
X = sm.add_constant(width)

# fit models
# compare Logit, Probit, and CLogLog
models = {
    'Logit': sm.families.links.Logit(),
    'Probit': sm.families.links.Probit(),
    'CLogLog': sm.families.links.cloglog()
}

results = {}
colors = ['blue', 'red', 'green']
styles = ['-', '--', '-.']

print("--- Model Assessment Results ---")
print(f"{'Model':<10} | {'AIC':<10} | {'BIC':<10} | {'Log-Likelihood':<15}")
print("-" * 55)

plt.figure(figsize=(10, 6))
# plot observed data
plt.scatter(width, y + np.random.normal(0, 0.02, len(y)), 
            color='black', alpha=0.3, marker='o', label='Observed (Jittered)')

# generate prediction grid
x_plot = np.linspace(width.min(), width.max(), 100)
x_plot_const = sm.add_constant(x_plot)

for i, (name, link) in enumerate(models.items()):
    # Fit GLM
    model = sm.GLM(y, X, family=sm.families.Binomial(link=link))
    res = model.fit()
    results[name] = res
    
    print(f"{name:<10} | {res.aic:<10.4f} | {res.bic_llf:<10.4f} | {res.llf:<15.4f}")
    
    y_pred = res.predict(x_plot_const)
    plt.plot(x_plot, y_pred, label=f'{name} (AIC: {res.aic:.1f})', 
             color=colors[i], linestyle=styles[i], linewidth=2)

plt.title('Probability of Satellites vs Carapace Width')
plt.xlabel('Carapace Width (cm)')
plt.ylabel('Probability of having Satellites (Y=1)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# best model 
best_model_name = min(results, key=lambda k: results[k].aic)
print(f"\nBest Model based on AIC: {best_model_name}")
--- Model Assessment Results ---
Model      | AIC        | BIC        | Log-Likelihood 
-------------------------------------------------------
Logit      | 198.4527   | 204.7592   | -97.2263       
Probit     | 198.0357   | 204.3423   | -97.0179       
CLogLog    | 197.2753   | 203.5818   | -96.6376       
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/statsmodels/genmod/families/links.py:13: FutureWarning: The cloglog link alias is deprecated. Use CLogLog instead. The cloglog link alias will be removed after the 0.15.0 release.
  warnings.warn(
No description has been provided for this image
Best Model based on AIC: CLogLog

Results:¶

  • Data Structure: The variable satellites is used to create the binary response (Y = 1 if count > 0). The predictor Width was extracted from the first column of the matrix $X$.
  • Model Comparison:
    • Complementary Log-Log (CLogLog): This model performed the best with the lowest AIC (197.28) and highest Log-Likelihood (-96.64). This link function is more suitable for binary data derived from counts, which fits the nature of "having at least one satellite".
    • Probit: AIC = 198.04. Very close to CLogLog.
    • Logit: AIC = 198.45. Slightly higher than the others but still a good fit.
  • Conclusion: All models indicate a strong positive relationship: as the carapace width increases, the probability of a female crab having satellites increases significantly. The CLogLog model is the preferred choice based on AIC.

Question 7¶

The cpunish data contain $n = 17$ observations. The response variable is the number of times that capital punishment is implemented in a state for the year 1997.

The explanatory variables are:

  • median per capita income (dollars),
  • the percent of the population living in poverty,
  • the percent of African-American citizens in the population,
  • the $\log(\text{rate})$ of violent crimes,
  • a variable indicating whether a state is in the South,
  • the proportion of the population with a college degree.

Use the glmfit function to fit a model using the Poisson link function. Note that glmfit automatically puts a column of ones for the constant term. Get the deviance (dev) and the statistics (stats). Obtain the deviance and relevant statistics. What are the estimated coefficients and their standard errors? Find the 95% Wald confidence intervals for each estimated coefficient.


In [28]:
import statsmodels.api as sm

data = loadmat('Data_Files/cpunish.mat')
y = data['y']  # response: # of executions
X = data['X']  # predictors

X_const = sm.add_constant(X)

feature_names = [
    'Intercept', 
    'Median Income', 
    'Poverty %', 
    'African-Amer %', 
    'Log(Violent Crime)', 
    'South', 
    'College Degree %'
]

# fit GLM 
model = sm.GLM(y, X_const, family=sm.families.Poisson(link=sm.families.links.log()))
result = model.fit()

deviance = result.deviance
params = result.params        # estimated coefficients
bse = result.bse              # standard errors
conf_int = result.conf_int()  # 95% Wald confidence intervals

print(f"Deviance: {deviance:.4f}\n")
print(f"{'Variable':<20} | {'Coeff':<10} | {'Std. Err':<10} | {'95% CI Lower':<12} | {'95% CI Upper':<12}")
print("-" * 75)

for i, name in enumerate(feature_names):
    lower, upper = conf_int[i]
    print(f"{name:<20} | {params[i]:<10.4f} | {bse[i]:<10.4f} | {lower:<12.4f} | {upper:<12.4f}")
Deviance: 18.5916

Variable             | Coeff      | Std. Err   | 95% CI Lower | 95% CI Upper
---------------------------------------------------------------------------
Intercept            | -6.8015    | 4.1469     | -14.9292     | 1.3262      
Median Income        | 0.0003     | 0.0001     | 0.0002       | 0.0004      
Poverty %            | 0.0778     | 0.0794     | -0.0778      | 0.2334      
African-Amer %       | -0.0949    | 0.0229     | -0.1399      | -0.0500     
Log(Violent Crime)   | 0.2969     | 0.4375     | -0.5606      | 1.1545      
South                | 2.3012     | 0.4284     | 1.4616       | 3.1408      
College Degree %     | -18.7221   | 4.2840     | -27.1185     | -10.3256    
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/statsmodels/genmod/families/links.py:13: FutureWarning: The log link alias is deprecated. Use Log instead. The log link alias will be removed after the 0.15.0 release.
  warnings.warn(

Results:¶

  • Deviance: 18.5916

  • The South variable has a strong positive effect (Coeff: 2.30) and is statistically significant (CI does not include 0), indicating states in the South are associated with a higher number of executions.

  • College Degree % has a strong negative effect (Coeff: -18.72) and is also significant, suggesting higher education levels correlate with fewer executions.

  • Median Income is statistically significant but has a very small coefficient due to the scale of the variable (dollars).


Question 8¶

Apply the penalized approaches — ridge regression, lasso, and elastic net — to the airfoil data. Compare your results with subset selection.


In [30]:
from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV, LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
import itertools

data = loadmat('Data_Files/airfoil.mat')

# predictors: Freq, Angle, Chord, Veloc, Suction
# response: Sound
df = pd.DataFrame({
    'Freq': data['Freq'].flatten(),
    'Angle': data['Angle'].flatten(),
    'Chord': data['Chord'].flatten(),
    'Veloc': data['Veloc'].flatten(),
    'Suction': data['Suction'].flatten(),
    'Sound': data['Sound'].flatten()
})

X = df.drop(columns=['Sound'])
y = df['Sound']

print(f"Dataset Shape: {df.shape}")

# preprocessing
# scale features for penalized regression 
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)

# split data (70% Train, 30% Test)
X_train, X_test, y_train, y_test = train_test_split(X_scaled_df, y, test_size=0.3, random_state=42)
X_train_orig, X_test_orig, _, _ = train_test_split(X, y, test_size=0.3, random_state=42)

# --- Ridge Regression ---
alphas_ridge = np.logspace(-4, 4, 100)
ridge = RidgeCV(alphas=alphas_ridge, cv=5)
ridge.fit(X_train, y_train)
mse_ridge = mean_squared_error(y_test, ridge.predict(X_test))

print(f"\nRidge Regression:")
print(f"  Best Alpha: {ridge.alpha_:.4f}")
print(f"  MSE: {mse_ridge:.4f}")
print(f"  Coefficients: {ridge.coef_}")

# --- Lasso Regression ---
lasso = LassoCV(cv=5, random_state=42)
lasso.fit(X_train, y_train)
mse_lasso = mean_squared_error(y_test, lasso.predict(X_test))

print(f"\nLasso Regression:")
print(f"  Best Alpha: {lasso.alpha_:.4f}")
print(f"  MSE: {mse_lasso:.4f}")
print(f"  Coefficients: {lasso.coef_}")

# --- Elastic Net Regression ---
enet = ElasticNetCV(l1_ratio=[.1, .5, .7, .9, .95, .99, 1], cv=5, random_state=42)
enet.fit(X_train, y_train)
mse_enet = mean_squared_error(y_test, enet.predict(X_test))

print(f"\nElastic Net Regression:")
print(f"  Best Alpha: {enet.alpha_:.4f}")
print(f"  MSE: {mse_enet:.4f}")
print(f"  Coefficients: {enet.coef_}")

# --- Best Subset Selection ---
best_mse_subset = float('inf')
best_subset_features = []

for k in range(1, len(X.columns) + 1):
    for features in itertools.combinations(X.columns, k):
        features = list(features)
        
        model = LinearRegression()
        model.fit(X_train_orig[features], y_train)
        
        y_pred = model.predict(X_test_orig[features])
        mse = mean_squared_error(y_test, y_pred)
        
        if mse < best_mse_subset:
            best_mse_subset = mse
            best_subset_features = features

print(f"\nBest Subset Selection:")
print(f"  Best Size: {len(best_subset_features)}")
print(f"  Selected Features: {best_subset_features}")
print(f"  MSE: {best_mse_subset:.4f}")

print("\n----- Final Comparison -----")
print(f"{'Method':<15} | {'MSE':<10}")
print("-" * 30)
print(f"{'Ridge':<15} | {mse_ridge:.4f}")
print(f"{'Lasso':<15} | {mse_lasso:.4f}")
print(f"{'ElasticNet':<15} | {mse_enet:.4f}")
print(f"{'Best Subset':<15} | {best_mse_subset:.4f}")
Dataset Shape: (1503, 6)

Ridge Regression:
  Best Alpha: 7.0548
  MSE: 23.7296
  Coefficients: [-3.97042665 -2.10186543 -3.18850678  1.53761334 -2.07640231]

Lasso Regression:
  Best Alpha: 0.0030
  MSE: 23.6948
  Coefficients: [-4.00496626 -2.15110265 -3.22942077  1.5523164  -2.06786733]

Elastic Net Regression:
  Best Alpha: 0.0069
  MSE: 23.7175
  Coefficients: [-3.98330377 -2.11877692 -3.2030425   1.54179897 -2.07336023]

Best Subset Selection:
  Best Size: 5
  Selected Features: ['Freq', 'Angle', 'Chord', 'Veloc', 'Suction']
  MSE: 23.6879

----- Final Comparison -----
Method          | MSE       
------------------------------
Ridge           | 23.7296
Lasso           | 23.6948
ElasticNet      | 23.7175
Best Subset     | 23.6879

Results Analysis:¶

  • Performance: All methods show very similar Mean Squared Errors (MSE $\approx$ 23.69 - 23.73).
  • Best Subset Selection: The best subset included all 5 predictors, effectively making it the full Ordinary Least Squares model. This indicates that all variables contribute significantly to predicting the sound pressure level.
  • Lasso/Regularization: The Lasso and Elastic Net models did not shrink any coefficients to zero, which accords with the finding from the subset selection that the full model is optimal. The optimal penalties were relatively small, suggesting OLS solution was already quite stable and accurate given the sample size ($n=1503$) relative to the number of predictors ($p=5$).

Conclusion: For the airfoil dataset, the penalized approaches did not outperform the full linear model found via subset selection, as the signal appears to be distributed across all features without sparsity.