Labeling Recipes with Logistic Regression

Part 3 - Lunch / Dinner Labels, More Insights

Python code in this project: lunch_label_code.txt + dinner_label_code.txt + embedded snippets.
Having cleaned the data, dealt with possible multicollinearity issues, and looked at breakfast labels in part 2, let's move onto lunch and dinner labels.

Multiclass Logistic Regression


In part 1 we talked about logistic regression for just two classes \( y = 0 \) and \( y = 1 \). However, since I am trying to choose between three labels (breakfast, lunch, dinner) here, I would need to think about how to generalize to more than two classes. There are many ways to go about doing this. A few common methods are:

Just like what I did for breakfast labels, I am going to continue using the one-vss-all method, and fit a separate logistic regression model for lunch and dinner labels.

Note that the logistic regression module in sklearn offers both the one-vs-rest and multinomial schemes: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Lunch Labels


The Python code for lunch labels can be found here: lunch_label_code.txt. This is not a perfect separation like for breakfast labels, but the fit is surprisingly close to perfect!


Just like for breakfast labels, we look at top 30 and bottom 30 explanatory variables, in terms of the value of the fitted coefficients.






Dinner Labels

Python code for dinner labels: dinner_label_code.txt. Dinner labels performed the worse, but I was extremely surprised to see that it is close to perfect!


Again, the top 30 and bottom 30 explanatory variables, in terms of the value of the fitted coefficients.






Using Unlabeled Recipes as Test Set

As I mentioned before in part 1, the large majority of the recipes in this dataset are missing breakfast/lunch/dinner labels. We can actually use these as a kind of test set in cross validation cross validation,

Cross validation involves training a machine learning model on different sets of training data, then validating the accuracy on a disjoint set of test data. The goal is to avoid overfitting to particular patterns unique to any one training data set, while capturing patterns that are generalizable.

It is actually quite similar to the concept of bootstrapping in statistics, which is a technique for studying how a statistical estimator varies across different sample datasets. This classic paper has an excellent write up about it: http://www.medicine.mcgill.ca/epidemiology/hanley/bios601/Mean-Quantile/EfronDiaconisBootstrap.pdf,

This section is a work in progress that will be finished soon!