Sugary Content Prediction Model

The purpose of project is to train a model and predict the sugar content of a recipe based on given features


Project maintained by Leogeon Hosted on GitHub Pages — Theme by mattgraham

RecipeAnalysis

Authors: Wenbin Jiang, Jevan Chahal

Exploratory Data:

Table of Contents:

Framing the Problem

Prediction Problem:

Response Variable:

Evaluation Metrics:

Baseline Model

The baseline model we used is a linear regression model designed to predict the ‘sugar’ content in food items based on other nutritional information. The model is built using Python’s scikit-learn library.

Features in the Model:

All features in this model are quantitative, representing measurable quantities expressed as numerical values. There are no ordinal or nominal features in this model. The data preprocessing involved converting string representations of lists in the ‘nutrition’ column into actual lists using a custom safe_eval function. Missing values in features and the target variable were filled with their respective means.

Model Pipeline:

The model pipeline consists of two stages:

Model Evaluation:

Performance:

The performance of the model was assessed using the Mean Squared Error (MSE) and R² Score for both the training and testing datasets, along with cross-validation scores for overall assessment.

Metric Training Results Testing Results
MSE 102.63865766316138 110.07593549277638
R² Score 0.7571468808446675 0.75714688084466755
CV Fold Score
1 0.67128828
2 0.7337971
3 0.77311845
4 0.70509366
5 0.74905523

Final Model

In this section, we refined our model by adding new features and employing Lasso Regression for better prediction accuracy and feature selection.

Added Features and Their Rationale:

Modeling Algorithm and Hyperparameters:

Improvement Over Baseline Model:

The final model introduces Lasso regression and additional features, which are expected to enhance prediction accuracy and model interpretability:

Performance Metrics:

The model’s performance is showcased by a slight reduction in the reduction in RMSE and slight increase of the R² score, suggesting improved prediction accuracy and model.

Best Parameters:

Performance of the Final Model:

The final model’s performance, assessed through training and testing data, along with cross-validation, is as follows:

Metric Training Results Testing Results
RMSE 99.81431897766402 107.0671157234895
R² Score 0.7677415984965867 0.7702417458530844
CV Fold Score
1 0.69369467
2 0.76144211
3 0.77577528
4 0.7589276
5 0.72730318

Significance of the Improvements:

Significance of the Improvements:

Fairness Analysis

Choice of Groups X and Y:

Evaluation Metric:

Null and Alternative Hypotheses:

Test Statistic and Significance Level:

Permutation Test Result:

Conclusion:

Reject the null hypothesis at a significance level of 0.05. Indicating a statistically significant difference in model performance between High Calorie and Low-Calorie groups, suggesting potential fairness concerns. This kind of makes sense since we can assume the high-calorie group and the low-calorie group have different nutritional content. Higher calorie groups usually contain higher amounts of sugar content (milkshake vs salad).