ASSIGNMENT 6¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import loadmat
Question 1¶
The peanuts data set contains measurements of the alfatoxin $(X)$ and the corresponding percentage of noncontaminated peanuts in the batch $(Y)$. Do a scatterplot of these data. What is a good model for these data? Use cross-validation to choose the best model.
from sklearn.model_selection import KFold, cross_val_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
data = loadmat("Data_Files/peanuts.mat")
X = data["X"].reshape(-1, 1)
Y = data["Y"].reshape(-1, 1)
# Scatterplot
plt.figure(figsize=(8, 4))
plt.scatter(X, Y, color='black', facecolors='none', label='Data')
plt.xlabel("Aflatoxin (X)")
plt.ylabel("Percentage of noncontaminated peanuts (Y)")
plt.title("Peanuts Data Scatterplot")
plt.legend()
plt.grid(True)
plt.show()
# Cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=1905)
degrees = [1, 2, 3, 4, 5]
plt.figure(figsize=(8, 4))
plt.scatter(X, Y, color='black', facecolors='none', label='Data')
X_seq = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
best_mse = float('inf')
best_degree = 0
for d in degrees:
model = Pipeline([
("poly", PolynomialFeatures(degree=d, include_bias=False)),
("linreg", LinearRegression())
])
# convert negative MSE values to positive
mse = -cross_val_score(
model, X, Y,
scoring="neg_mean_squared_error",
cv=kf
).mean()
print(f"Degree {d}: MSE = {mse:.6f}")
if mse < best_mse:
best_mse = mse
best_degree = d
# fit on whole dataset for plotting
model.fit(X, Y)
plt.plot(X_seq, model.predict(X_seq), label=f'Degree {d}')
print(f"\n Best Model: Polynomial Degree {best_degree}")
plt.title('Polynomial Model Fits')
plt.xlabel('Alfatoxin (X)')
plt.ylabel('Percentage of noncontaminated peanuts (Y)')
plt.legend()
plt.grid(True)
plt.show()
Degree 1: MSE = 0.001687 Degree 2: MSE = 0.001672 Degree 3: MSE = 0.001845 Degree 4: MSE = 0.002937 Degree 5: MSE = 0.006057 Best Model: Polynomial Degree 2
Scatterplot:¶
From the scatterplot, we observe a decreasing relationship between aflatoxin level (X) and the percentage of non-contaminated peanuts (Y). As aflatoxin increases, the proportion of non-contaminated peanuts declines.
The relationship is smooth and it's not perfectly linear. There is no sharp curvature or oscillatory behavior, this suggests that high-degree polynomials are unnecessary.
Model comparison via cross-validation:¶
We fitted polynomial models of degrees 1 through 5 and used cross-validation mean squared error (CV MSE) as the selection criterion.
The results show that: - Degree 1, which is the linear model, slightly underfits the curvature. - Degrees 2–5 provide very similar fits visually. - Higher degrees start to bend more at the boundaries, near large X values.
Cross-validation selected degree 2 as the best model. This indicates that: - A quadratic term captures best the curvature of the data. - Additional higher-order terms do not improve model performance.
Question 2¶
Use Monte Carlo simulation to compare the performance of the bootstrap and the jackknife methods for estimating the standard error and bias of the sample second central moment. For every Monte Carlo trial, generate 100 standard normal random variables and calculate the bootstrap and jackknife estimates of the standard error and bias. Show the distribution of the bootstrap estimates (of bias and standard error) and the jackknife estimates (of bias and standard error) in a histogram or a box plot. Make some comparisons of the two methods.
# Parameters
MC_trials = 500 # number of Monte Carlo trials
n = 100 # sample size
B = 1000 # bootstrap resamples
boot_bias = []
boot_se = []
jack_bias = []
jack_se = []
# true second central moment of N(0,1)
true_var = 1.0
for _ in range(MC_trials):
# generate data
x = np.random.randn(n)
# sample second central moment
theta_hat = np.mean((x - np.mean(x))**2)
# ---------- Bootstrap ----------
boot_thetas = []
for _ in range(B):
xb = np.random.choice(x, size=n, replace=True)
boot_thetas.append(np.mean((xb - np.mean(xb))**2))
boot_thetas = np.array(boot_thetas)
# Bootstrap estimates
boot_bias.append(np.mean(boot_thetas) - theta_hat)
boot_se.append(np.std(boot_thetas, ddof=1))
# ---------- Jackknife ----------
jack_thetas = []
for i in range(n):
x_leave = np.delete(x, i)
jack_thetas.append(np.mean((x_leave - np.mean(x_leave))**2))
jack_thetas = np.array(jack_thetas)
theta_jack_mean = np.mean(jack_thetas)
# Jackknife estimates
jack_bias.append((n - 1) * (theta_jack_mean - theta_hat))
jack_se.append(
np.sqrt((n - 1) / n * np.sum((jack_thetas - theta_jack_mean) ** 2))
)
boot_bias = np.array(boot_bias)
boot_se = np.array(boot_se)
jack_bias = np.array(jack_bias)
jack_se = np.array(jack_se)
# ---------- Plots ----------
plt.figure(figsize=(14, 10))
# Histograms of Bias
plt.subplot(2, 2, 1)
plt.hist(boot_bias, bins=30, alpha=0.5, label="Bootstrap Bias", color='orange', edgecolor='black')
plt.hist(jack_bias, bins=30, alpha=0.5, label="Jackknife Bias", color='blue', edgecolor='black')
plt.title("Distribution of Bias Estimates")
plt.xlabel('Estimated Bias')
plt.legend()
# Histograms of SE
plt.subplot(2, 2, 2)
plt.hist(boot_se, bins=30, alpha=0.5, label="Bootstrap SE", color='orange', edgecolor='black')
plt.hist(jack_se, bins=30, alpha=0.5, label="Jackknife SE", color='blue', edgecolor='black')
plt.title("Distribution of SE Estimates")
plt.xlabel('Estimated SE')
plt.legend()
# Boxplots of Bias
plt.subplot(2, 2, 3)
plt.boxplot([jack_bias, boot_bias], labels=['Jackknife', 'Bootstrap'])
plt.title('Comparison of Bias Estimates')
plt.ylabel('Bias')
plt.grid(True, linestyle='--', alpha=0.3)
# Boxplots of SE
plt.subplot(2, 2, 4)
plt.boxplot([jack_se, boot_se], labels=['Jackknife', 'Bootstrap'])
plt.title('Comparison of SE Estimates')
plt.ylabel('Standard Error')
plt.grid(True, linestyle='--', alpha=0.3)
plt.tight_layout()
plt.show()
# ---------- Numerical summaries ----------
print(f"--- Summary Statistics (M={MC_trials} trials) ---")
print(f"Jackknife Mean Bias: {np.mean(jack_bias):.5f} (SD: {np.std(jack_bias):.5f})")
print(f"Bootstrap Mean Bias: {np.mean(boot_bias):.5f} (SD: {np.std(boot_bias):.5f})")
print("-" * 40)
print(f"Jackknife Mean SE: {np.mean(jack_se):.5f} (SD: {np.std(jack_se):.5f})")
print(f"Bootstrap Mean SE: {np.mean(boot_se):.5f} (SD: {np.std(boot_se):.5f})")
/var/folders/tm/0v_xzs6s5vzfw60jkby5vd980000gn/T/ipykernel_15810/2986901424.py:73: MatplotlibDeprecationWarning: The 'labels' parameter of boxplot() has been renamed 'tick_labels' since Matplotlib 3.9; support for the old name will be dropped in 3.11. plt.boxplot([jack_bias, boot_bias], labels=['Jackknife', 'Bootstrap']) /var/folders/tm/0v_xzs6s5vzfw60jkby5vd980000gn/T/ipykernel_15810/2986901424.py:80: MatplotlibDeprecationWarning: The 'labels' parameter of boxplot() has been renamed 'tick_labels' since Matplotlib 3.9; support for the old name will be dropped in 3.11. plt.boxplot([jack_se, boot_se], labels=['Jackknife', 'Bootstrap'])
--- Summary Statistics (M=500 trials) --- Jackknife Mean Bias: -0.00994 (SD: 0.00151) Bootstrap Mean Bias: -0.00967 (SD: 0.00469) ---------------------------------------- Jackknife Mean SE: 0.13870 (SD: 0.02681) Bootstrap Mean SE: 0.13584 (SD: 0.02625)
Bias estimation:
Both methods produce negative bias estimates, which is expected because the sample second central moment (with denominator 𝑛) is a biased estimator of the population variance.
From the Monte Carlo summaries:
Jackknife mean bias: −0.00994 (SD = 0.00151)
Bootstrap mean bias: −0.00967 (SD = 0.00469)
Key observations:
- The mean bias estimates are very close for the two methods.
- The jackknife bias estimates are much more concentrated, as seen both in the histogram and the smaller standard deviation.
- Bootstrap bias estimates show higher variability, with heavier tails.
This reflects the fact that the jackknife provides a stable, linear approximation to bias, while the bootstrap bias estimate is more sensitive to resampling variability.
Standard error estimation:
For the standard error of the second central moment: - Both methods yield very similar average standard error estimates. - The bootstrap SE is slightly smaller on average, but the difference is negligible. - The distributions of SE estimates largely overlap, as confirmed by the boxplots.
Thus, for standard error estimation, both methods perform similarly in this setting.
Overall comparison: The jackknife provides more stable bias estimates and lower variability across Monte Carlo trials. The bootstrap produces comparable SE estimates and exhibits higher variability in bias estimation.
Question 3¶
Using the law data set, find the jackknife replicates of the median. How many different values are there? What is the jackknife estimate of the standard error of the median? Use the bootstrap method to get an estimate of the standard error of the median. Compare the two estimates of the standard error of the median.
data = loadmat("Data_Files/law.mat")
X = data['law'] # Shape is (15, 2)
def analyze_variable(data_vector, label):
n = len(data_vector)
# ---------- Jackknife ----------
jack_medians = []
for i in range(n):
# create dataset with i-th observation removed
x_leave = np.delete(x, i)
jack_medians.append(np.median(x_leave))
jack_medians = np.array(jack_medians)
# num of distinct jackknife medians
num_distinct = len(np.unique(jack_medians))
# Jackknife SE
jack_mean = np.mean(jack_medians)
jack_se = np.sqrt((n - 1) / n * np.sum((jack_medians - jack_mean) ** 2))
# ---------- Bootstrap ----------
B = 1000
boot_medians = []
for _ in range(B):
# resample with replacement
xb = np.random.choice(x, size=n, replace=True)
boot_medians.append(np.median(xb))
boot_medians = np.array(boot_medians)
boot_se = np.std(boot_medians, ddof=1)
# ---------- Plots ----------
plt.figure()
plt.hist(jack_medians, bins=15)
plt.title(f"Jackknife Replicates of the Median for {label}")
plt.xlabel("Median")
plt.ylabel("Frequency")
plt.show()
plt.figure()
plt.hist(boot_medians, bins=30)
plt.title(f"Bootstrap Replicates of the Median for {label}")
plt.xlabel("Median")
plt.ylabel("Frequency")
plt.show()
# ---------- Output ----------
print("Number of observations:", n)
print("Number of distinct jackknife medians:", num_distinct)
print(f"Jackknife SE of median: {jack_se:.5f}")
print(f"Bootstrap SE of median: {boot_se:.5f}")
# run for both variables (assuming Col 1 is LSAT, Col 2 is GPA)
analyze_variable(X[:, 0], "Variable 1 (e.g., LSAT)")
analyze_variable(X[:, 1], "Variable 2 (e.g., GPA)")
Number of observations: 15 Number of distinct jackknife medians: 2 Jackknife SE of median: 0.02439 Bootstrap SE of median: 0.26446
Number of observations: 15 Number of distinct jackknife medians: 2 Jackknife SE of median: 0.02439 Bootstrap SE of median: 0.25740
Jackknife replicates of the median
For each variable in the law data set (total of 15 observations):
- Number of distinct jackknife medians: 2
This confirms the high discreteness of the jackknife distribution for the median:
- Only deletions of observations very close to the sample median affect its value.
- Removing most observations leaves the median unchanged.
- Therefore, the jackknife replicates take only a small number of distinct values.
Standard error estimates:
For each variable, the estimated standard errors are:
- Jackknife SE: 0.02439
- Bootstrap SE: ≈ 0.26
- The bootstrap SE is an order of magnitude larger than the jackknife SE.
- The jackknife SE is extremely small due to the lack of variability in the jackknife replicates.
- This indicates that the jackknife severely underestimates the true variability of the median in this case.
Comparison of methods:
- The jackknife performs poorly for the median because:
- The statistic is not smooth.
- The jackknife relies on linear approximations.
- The replicates exhibit very limited variation.
- The bootstrap, by resampling with replacement, captures the full variability of the median and produces a much more realistic standard error estimate.
Question 4¶
Generate data according to $$y = 4x^3 + 6x^2 - 1 + \varepsilon,$$ where $\varepsilon$ represents some noise with constant variance. Fit a first-degree model to the data and plot the residuals versus the observed predictor values $x_i$ (residual dependence plot). Construct also box plots and histograms of the residuals. Do they show that the model is not adequate? Repeat for $d = 2, 3$.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
np.random.seed(1905)
# generate predictor
n = 200
x = np.linspace(-2, 2, n)
X = x.reshape(-1, 1)
# generate response
epsilon = np.random.normal(0, 5, size=n)
y = 4*x**3 + 6*x**2 - 1 + epsilon
# degrees to fit
degrees = [1, 2, 3]
# create subplots: 3 rows (degrees) x 3 columns (plots)
fig, axes = plt.subplots(3, 3, figsize=(15, 12))
plt.subplots_adjust(hspace=0.4, wspace=0.3)
for i, d in enumerate(degrees):
# Fit polynomial model
model = Pipeline([
("poly", PolynomialFeatures(degree=d, include_bias=False)),
("linreg", LinearRegression())
])
model.fit(X, y)
# predict and calculate residuals
y_hat = model.predict(X)
residuals = y - y_hat
# row for the current degree
ax_scatter = axes[i, 0]
ax_hist = axes[i, 1]
ax_box = axes[i, 2]
# ---------- Residual dependence plot ----------
ax_scatter.scatter(x, residuals, color='blue', alpha=0.6)
ax_scatter.axhline(0, color='red', linestyle='--')
ax_scatter.set_title(f'Degree {d}: Residuals vs x')
ax_scatter.set_xlabel('x')
ax_scatter.set_ylabel('Residuals')
# ---------- Histogram ----------
# checks for normality
ax_hist.hist(residuals, bins=30, color='green', edgecolor='black', alpha=0.7)
ax_hist.set_title(f'Degree {d}: Histogram of Residuals')
ax_hist.set_xlabel('Residuals')
# ---------- Boxplot ----------
# Checks for symmetry and outliers
ax_box.boxplot(residuals, vert=False)
ax_box.set_title(f'Degree {d}: Box Plot')
ax_box.set_xlabel('Residuals')
plt.suptitle('Residual Analysis for Polynomial Models (d=1, 2, 3)', fontsize=16)
plt.show()
Results:¶
1) Degree 1 (Linear Model):¶
- Residuals vs X: We see a distinct U-shape or wave pattern in the residuals. This structure in the residuals indicates that the linear model failed to capture the non-linear trend of the data. The assumption of random errors is violated.
- Histogram/Boxplot: The residuals does not look perfectly normal; they are spread out or skewed because the deterministic part of the signal ($x^3, x^2$) is leaking into the error term.
2) Degree 2 (Quadratic Model):¶
- Residuals vs X: We see a pattern (like an S-shape), because a parabola cannot fit a cubic curve perfectly. The residuals are not randomly scattered around zero.
- Adequacy: Still not adequate.
3) Degree 3 (Cubic Model):¶
- Residuals vs X: The residuals appear randomly scattered around the horizontal zero line with no discernible pattern. This indicates the model has successfully captured the underlying signal.
- Histogram/Boxplot: These resemble a normal distribution centered at zero, consistent with the noise term $\epsilon$ we generated. This confirms the model is adequate.
Question 5¶
Fit the wais data using probit and the complementary log-log link. Compare the results with the logistic regression using appropriate techniques to assess model fit.
import statsmodels.api as sm
import statsmodels.genmod.families.links as links
data = loadmat('Data_Files/wais.mat')
wais_matrix = data['wais']
df = pd.DataFrame(wais_matrix, columns=['WAIS_Score', 'Senile_Status'])
# define X and y
y = df['Senile_Status']
X = df['WAIS_Score']
X = sm.add_constant(X) # add intercept term
# -------- Probit Model ------------
model_probit = sm.GLM(y, X, family=sm.families.Binomial(link=sm.families.links.Probit()))
res_probit = model_probit.fit()
# -------- Complementary Log-Log Model -----------
model_cloglog = sm.GLM(y, X, family=sm.families.Binomial(link=sm.families.links.cloglog()))
res_cloglog = model_cloglog.fit()
# -------- Logistic Regression (for comparison) -----------
model_logit = sm.GLM(y, X, family=sm.families.Binomial(link=sm.families.links.Logit()))
res_logit = model_logit.fit()
# compare Results
print(f"{'Model':<15} | {'AIC':<10} | {'BIC':<10} | {'Log-Likelihood':<15}")
print("-" * 60)
print(f"{'Probit':<15} | {res_probit.aic:<10.4f} | {res_probit.bic_llf:<10.4f} | {res_probit.llf:<15.4f}")
print(f"{'CLogLog':<15} | {res_cloglog.aic:<10.4f} | {res_cloglog.bic_llf:<10.4f} | {res_cloglog.llf:<15.4f}")
print(f"{'Logistic':<15} | {res_logit.aic:<10.4f} | {res_logit.bic_llf:<10.4f} | {res_logit.llf:<15.4f}")
print("\n------------------------ Probit Model Summary ------------------------\n")
print(res_probit.summary())
Model | AIC | BIC | Log-Likelihood
------------------------------------------------------------
Probit | 54.9836 | 58.9615 | -25.4918
CLogLog | 55.3107 | 59.2887 | -25.6554
Logistic | 55.0174 | 58.9953 | -25.5087
------------------------ Probit Model Summary ------------------------
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: Senile_Status No. Observations: 54
Model: GLM Df Residuals: 52
Model Family: Binomial Df Model: 1
Link Function: Probit Scale: 1.0000
Method: IRLS Log-Likelihood: -25.492
Date: Wed, 17 Dec 2025 Deviance: 50.984
Time: 16:45:03 Pearson chi2: 51.2
No. Iterations: 5 Pseudo R-squ. (CS): 0.1816
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 1.3862 0.685 2.023 0.043 0.043 2.729
WAIS_Score -0.1880 0.063 -2.984 0.003 -0.312 -0.065
==============================================================================
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/statsmodels/genmod/families/links.py:13: FutureWarning: The cloglog link alias is deprecated. Use CLogLog instead. The cloglog link alias will be removed after the 0.15.0 release. warnings.warn(
Data Analysis:¶
- The dataset contains 54 observations.
- Column 1 (containing 0s and 1s) is used the binary response variable ($Y$, "senile" status).
- Column 0 is used as the predictor ($X$, WAIS score).
- Three models were fitted:
- Logistic Regression (Logit Link): AIC = 55.02
- Probit Regression (Probit Link): AIC = 54.98
- Complementary Log-Log (CLogLog Link): AIC = 55.31
Conclusion: The Probit model provides the best fit for this data, having the lowest AIC (54.98), though the Logistic model is extremely close (55.02). The CLogLog model fits slightly worse than the other two.
Question 6¶
Convert the number of satellites in the horseshoe crab data to a binomial response variable, keeping the carapace width as the predictor. Set $Y = 1$ if a female crab has one or more satellites, and $Y = 0$ if she has no satellites. Fit various models and assess the results.
import statsmodels.api as sm
data = loadmat('Data_Files/crab.mat')
X_mat = data['X']
sat_counts = data['satellites']
width = X_mat[:, 0]
# create binary response: 1 if satellites > 0, else 0
y = (sat_counts > 0).astype(int).flatten()
# prepare design matrix (add intercept)
X = sm.add_constant(width)
# fit models
# compare Logit, Probit, and CLogLog
models = {
'Logit': sm.families.links.Logit(),
'Probit': sm.families.links.Probit(),
'CLogLog': sm.families.links.cloglog()
}
results = {}
colors = ['blue', 'red', 'green']
styles = ['-', '--', '-.']
print("--- Model Assessment Results ---")
print(f"{'Model':<10} | {'AIC':<10} | {'BIC':<10} | {'Log-Likelihood':<15}")
print("-" * 55)
plt.figure(figsize=(10, 6))
# plot observed data
plt.scatter(width, y + np.random.normal(0, 0.02, len(y)),
color='black', alpha=0.3, marker='o', label='Observed (Jittered)')
# generate prediction grid
x_plot = np.linspace(width.min(), width.max(), 100)
x_plot_const = sm.add_constant(x_plot)
for i, (name, link) in enumerate(models.items()):
# Fit GLM
model = sm.GLM(y, X, family=sm.families.Binomial(link=link))
res = model.fit()
results[name] = res
print(f"{name:<10} | {res.aic:<10.4f} | {res.bic_llf:<10.4f} | {res.llf:<15.4f}")
y_pred = res.predict(x_plot_const)
plt.plot(x_plot, y_pred, label=f'{name} (AIC: {res.aic:.1f})',
color=colors[i], linestyle=styles[i], linewidth=2)
plt.title('Probability of Satellites vs Carapace Width')
plt.xlabel('Carapace Width (cm)')
plt.ylabel('Probability of having Satellites (Y=1)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# best model
best_model_name = min(results, key=lambda k: results[k].aic)
print(f"\nBest Model based on AIC: {best_model_name}")
--- Model Assessment Results --- Model | AIC | BIC | Log-Likelihood ------------------------------------------------------- Logit | 198.4527 | 204.7592 | -97.2263 Probit | 198.0357 | 204.3423 | -97.0179 CLogLog | 197.2753 | 203.5818 | -96.6376
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/statsmodels/genmod/families/links.py:13: FutureWarning: The cloglog link alias is deprecated. Use CLogLog instead. The cloglog link alias will be removed after the 0.15.0 release. warnings.warn(
Best Model based on AIC: CLogLog
Results:¶
- Data Structure: The variable
satellitesis used to create the binary response (Y = 1 if count > 0). The predictorWidthwas extracted from the first column of the matrix $X$. - Model Comparison:
- Complementary Log-Log (CLogLog): This model performed the best with the lowest AIC (197.28) and highest Log-Likelihood (-96.64). This link function is more suitable for binary data derived from counts, which fits the nature of "having at least one satellite".
- Probit: AIC = 198.04. Very close to CLogLog.
- Logit: AIC = 198.45. Slightly higher than the others but still a good fit.
- Conclusion: All models indicate a strong positive relationship: as the carapace width increases, the probability of a female crab having satellites increases significantly. The CLogLog model is the preferred choice based on AIC.
Question 7¶
The cpunish data contain $n = 17$ observations. The response variable is the number of times that capital punishment is implemented in a state for the year 1997.
The explanatory variables are:
- median per capita income (dollars),
- the percent of the population living in poverty,
- the percent of African-American citizens in the population,
- the $\log(\text{rate})$ of violent crimes,
- a variable indicating whether a state is in the South,
- the proportion of the population with a college degree.
Use the glmfit function to fit a model using the Poisson link function. Note that glmfit automatically puts a column of ones for the constant term. Get the deviance (dev) and the statistics (stats). Obtain the deviance and relevant statistics. What are the estimated coefficients and their standard errors? Find the 95% Wald confidence intervals for each estimated coefficient.
import statsmodels.api as sm
data = loadmat('Data_Files/cpunish.mat')
y = data['y'] # response: # of executions
X = data['X'] # predictors
X_const = sm.add_constant(X)
feature_names = [
'Intercept',
'Median Income',
'Poverty %',
'African-Amer %',
'Log(Violent Crime)',
'South',
'College Degree %'
]
# fit GLM
model = sm.GLM(y, X_const, family=sm.families.Poisson(link=sm.families.links.log()))
result = model.fit()
deviance = result.deviance
params = result.params # estimated coefficients
bse = result.bse # standard errors
conf_int = result.conf_int() # 95% Wald confidence intervals
print(f"Deviance: {deviance:.4f}\n")
print(f"{'Variable':<20} | {'Coeff':<10} | {'Std. Err':<10} | {'95% CI Lower':<12} | {'95% CI Upper':<12}")
print("-" * 75)
for i, name in enumerate(feature_names):
lower, upper = conf_int[i]
print(f"{name:<20} | {params[i]:<10.4f} | {bse[i]:<10.4f} | {lower:<12.4f} | {upper:<12.4f}")
Deviance: 18.5916 Variable | Coeff | Std. Err | 95% CI Lower | 95% CI Upper --------------------------------------------------------------------------- Intercept | -6.8015 | 4.1469 | -14.9292 | 1.3262 Median Income | 0.0003 | 0.0001 | 0.0002 | 0.0004 Poverty % | 0.0778 | 0.0794 | -0.0778 | 0.2334 African-Amer % | -0.0949 | 0.0229 | -0.1399 | -0.0500 Log(Violent Crime) | 0.2969 | 0.4375 | -0.5606 | 1.1545 South | 2.3012 | 0.4284 | 1.4616 | 3.1408 College Degree % | -18.7221 | 4.2840 | -27.1185 | -10.3256
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/statsmodels/genmod/families/links.py:13: FutureWarning: The log link alias is deprecated. Use Log instead. The log link alias will be removed after the 0.15.0 release. warnings.warn(
Results:¶
Deviance: 18.5916
The South variable has a strong positive effect (Coeff: 2.30) and is statistically significant (CI does not include 0), indicating states in the South are associated with a higher number of executions.
College Degree % has a strong negative effect (Coeff: -18.72) and is also significant, suggesting higher education levels correlate with fewer executions.
Median Income is statistically significant but has a very small coefficient due to the scale of the variable (dollars).
Question 8¶
Apply the penalized approaches — ridge regression, lasso, and elastic net — to the airfoil data. Compare your results with subset selection.
from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV, LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
import itertools
data = loadmat('Data_Files/airfoil.mat')
# predictors: Freq, Angle, Chord, Veloc, Suction
# response: Sound
df = pd.DataFrame({
'Freq': data['Freq'].flatten(),
'Angle': data['Angle'].flatten(),
'Chord': data['Chord'].flatten(),
'Veloc': data['Veloc'].flatten(),
'Suction': data['Suction'].flatten(),
'Sound': data['Sound'].flatten()
})
X = df.drop(columns=['Sound'])
y = df['Sound']
print(f"Dataset Shape: {df.shape}")
# preprocessing
# scale features for penalized regression
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)
# split data (70% Train, 30% Test)
X_train, X_test, y_train, y_test = train_test_split(X_scaled_df, y, test_size=0.3, random_state=42)
X_train_orig, X_test_orig, _, _ = train_test_split(X, y, test_size=0.3, random_state=42)
# --- Ridge Regression ---
alphas_ridge = np.logspace(-4, 4, 100)
ridge = RidgeCV(alphas=alphas_ridge, cv=5)
ridge.fit(X_train, y_train)
mse_ridge = mean_squared_error(y_test, ridge.predict(X_test))
print(f"\nRidge Regression:")
print(f" Best Alpha: {ridge.alpha_:.4f}")
print(f" MSE: {mse_ridge:.4f}")
print(f" Coefficients: {ridge.coef_}")
# --- Lasso Regression ---
lasso = LassoCV(cv=5, random_state=42)
lasso.fit(X_train, y_train)
mse_lasso = mean_squared_error(y_test, lasso.predict(X_test))
print(f"\nLasso Regression:")
print(f" Best Alpha: {lasso.alpha_:.4f}")
print(f" MSE: {mse_lasso:.4f}")
print(f" Coefficients: {lasso.coef_}")
# --- Elastic Net Regression ---
enet = ElasticNetCV(l1_ratio=[.1, .5, .7, .9, .95, .99, 1], cv=5, random_state=42)
enet.fit(X_train, y_train)
mse_enet = mean_squared_error(y_test, enet.predict(X_test))
print(f"\nElastic Net Regression:")
print(f" Best Alpha: {enet.alpha_:.4f}")
print(f" MSE: {mse_enet:.4f}")
print(f" Coefficients: {enet.coef_}")
# --- Best Subset Selection ---
best_mse_subset = float('inf')
best_subset_features = []
for k in range(1, len(X.columns) + 1):
for features in itertools.combinations(X.columns, k):
features = list(features)
model = LinearRegression()
model.fit(X_train_orig[features], y_train)
y_pred = model.predict(X_test_orig[features])
mse = mean_squared_error(y_test, y_pred)
if mse < best_mse_subset:
best_mse_subset = mse
best_subset_features = features
print(f"\nBest Subset Selection:")
print(f" Best Size: {len(best_subset_features)}")
print(f" Selected Features: {best_subset_features}")
print(f" MSE: {best_mse_subset:.4f}")
print("\n----- Final Comparison -----")
print(f"{'Method':<15} | {'MSE':<10}")
print("-" * 30)
print(f"{'Ridge':<15} | {mse_ridge:.4f}")
print(f"{'Lasso':<15} | {mse_lasso:.4f}")
print(f"{'ElasticNet':<15} | {mse_enet:.4f}")
print(f"{'Best Subset':<15} | {best_mse_subset:.4f}")
Dataset Shape: (1503, 6) Ridge Regression: Best Alpha: 7.0548 MSE: 23.7296 Coefficients: [-3.97042665 -2.10186543 -3.18850678 1.53761334 -2.07640231] Lasso Regression: Best Alpha: 0.0030 MSE: 23.6948 Coefficients: [-4.00496626 -2.15110265 -3.22942077 1.5523164 -2.06786733] Elastic Net Regression: Best Alpha: 0.0069 MSE: 23.7175 Coefficients: [-3.98330377 -2.11877692 -3.2030425 1.54179897 -2.07336023] Best Subset Selection: Best Size: 5 Selected Features: ['Freq', 'Angle', 'Chord', 'Veloc', 'Suction'] MSE: 23.6879 ----- Final Comparison ----- Method | MSE ------------------------------ Ridge | 23.7296 Lasso | 23.6948 ElasticNet | 23.7175 Best Subset | 23.6879
Results Analysis:¶
- Performance: All methods show very similar Mean Squared Errors (MSE $\approx$ 23.69 - 23.73).
- Best Subset Selection: The best subset included all 5 predictors, effectively making it the full Ordinary Least Squares model. This indicates that all variables contribute significantly to predicting the sound pressure level.
- Lasso/Regularization: The Lasso and Elastic Net models did not shrink any coefficients to zero, which accords with the finding from the subset selection that the full model is optimal. The optimal penalties were relatively small, suggesting OLS solution was already quite stable and accurate given the sample size ($n=1503$) relative to the number of predictors ($p=5$).
Conclusion: For the airfoil dataset, the penalized approaches did not outperform the full linear model found via subset selection, as the signal appears to be distributed across all features without sparsity.