Saturday, May 28, 2022
HomeData ScienceSandwich Information Science 2: Electrical Boogaloo

Sandwich Information Science 2: Electrical Boogaloo


I advised myself I wouldn’t do it once more. The final time almost broke me. And but, simply after I thought I used to be out, they pull me again in.

In opposition to my higher judgement, I did one other sandwich information science undertaking. Fortunately, this one was considerably easier.

I work at Sq., and their NYC workplace is in SoHo. Whereas there are a lot of causes not to enter the workplace these days, one draw is that I can decide up lunch at Alidoro, a tiny Italian sandwich store that’s close by. The sandwiches are the quintissential European antithesis to American sandwiches; they encompass solely a pair, extraordinarily prime quality components.

From these few components emerge 40 several types of sandwiches, and these 40 sandwiches kind an impenetrable menu.

Naively, it’s possible you’ll assume you’ll be able to decide a sandwich that appears near what you need after which customise it. Maybe you want to the Romeo however with some contemporary mozzarella? Properly then maybe you’ll be mistaken as a result of customization just isn’t allowed. Did I point out that there are some Soup Nazi vibes to this place? You’ll be able to solely order what’s on the menu, and it took the worldwide pandemic to lastly break their will to stay money solely.

Some individuals wish to discover new objects on a menu, whereas I at all times exploit the one which I’ve been proud of. Working example: I get the Fellini on Foccacia each time. Nonetheless, I keep in mind what it was wish to be a newcomer and encounter that impenetrable menu.

And so, this weblog publish is my try at information visualization. My objective is to visualise the menu in such a means that one can rapidly scan it to discover a sandwich they want. As an added bonus, I’ll shut with some statistical modeling of the sandwich pricing.

Packaging and Presentation

Like a lot of my weblog posts, I wrote this one in a Jupyter pocket book. Whereas I would like to point out the total code for the weblog publish, I didn’t need this publish to be as impenetrable as Alidoro’s menu. I made a small sandmat bundle to deal with a lot of the code. The bundle, together with the Jupyter pocket book model of this weblog publish may be discovered on GitHub right here.

%config InlineBackend.figure_format = 'retina'

import matplotlib.pyplot as plt

from sandmat import scrape, sorting, viz

To start out, we have to get the menu and switch it into “information”. For no matter cause, I didn’t really feel like utilizing pandas for this weblog publish, so every part we’ll take care of will likely be collections of dataclasses.

Inside the sandmat bundle, I make two dataclasses: Ingredient and Sandwich. Many of the fields are self-explanatory with the exception for the ingredient classes. For these, I manually classify components into meat, cheese, topping, or dressing. In hindsight, I in all probability ought to’ve made this an Enum area.

@dataclass(frozen=True)
class Ingredient:
    title: str
    class: str


@dataclass(frozen=True)
class Sandwich:
    title: str
    components: Tuple[Ingredient]
    worth: float 

I take advantage of Lovely Soup to scrape the menu web page of the Alidoro web site and seize the part of the HTML that pertains to the menu. I then do some parsing, cleansing, and categorization with a view to flip the menu into a listing of Sandwich objects.

URL = "https://www.alidoronyc.com/menu/menu/"
sandwiches = scrape.get_sandwiches(URL)
print(f"{len(sandwiches)} sandwiches discovered.")
print("Displaying the primary two:n")
for sandwich in sandwiches[:2]:
    print(sandwich)
    print()
40 sandwiches discovered.
Displaying the primary two:

Sandwich(title="Matthew", components=(Ingredient(title="prosciutto", class='meat'), Ingredient(title="contemporary mozzarella", class='cheese'), Ingredient(title="dressing", class='dressing'), Ingredient(title="arugula", class='topping')), worth=14.0)

Sandwich(title="Alyssa", components=(Ingredient(title="smoked hen breast", class='meat'), Ingredient(title="contemporary mozzarella", class='cheese'), Ingredient(title="arugula", class='topping'), Ingredient(title="dressing", class='dressing')), worth=14.0)

Ingredient Rank and File

I wish to show the sandwiches in “matrix” kind. Every sandwich will likely be a row, every ingredient will likely be a column, and the values of the matrix will point out if a sandwich has a selected ingredient. What’s left is to resolve on an order to the sandwich rows and an order to the ingredient columns.

In my preliminary method, I coded up a touring salesman drawback wherein sandwiches had been cities and the overlap in components between any two sandwiches was the “distance” between sandwiches. It might’ve made for the proper title (“Touring Sandwich Downside”, clearly), however, opposite to the numerical answer, the consequence was visually suboptimal.

Fortunately, it is a drawback the place we are able to depend on area experience. As a sandwich eater myself, I thought of how I sometimes decide a sandwich. I typically take a look at the meats first, then the cheeses, after which every part else. Okay, let’s kind the ingredient columns by class “rank”: meat, cheese, topping, dressing. Inside every class, how about utilizing the recsys go-to of sorting in descending order of recognition? Combining class rank and recognition provides us our full ingredient column order. In SQL, we’d wish to do one thing like

SELECT
  class
  , CASE 
    WHEN class = 'meat' THEN 1
    WHEN class = 'cheese' THEN 2
    WHEN cateogry = 'topping' THEN 3
    WHEN class = 'dressing' THEN 4
  END AS category_rank
  , ingredient
  , COUNT(DISTINCT sandwich) as num_sandwiches
FROM sandwich_ingredients
GROUP BY class, category_rank, ingredient
ORDER BY category_rank ASC, num_sandwiches DESC
ranked_categories = sorting.get_ranked_categories(sandwiches)
ordered_ingredients = sorting.get_ordered_ingredients(ranked_categories)

For ordering our sandwich rows, let’s kind them by a particular key which is a tuple that comprises their hottest ingredient in every class the place the tuple is so as of meat, cheese, topping, dressing.

ordered_sandwiches = sorting.get_ordered_sandwiches(sandwiches, ranked_categories)

Visualizing the Matrix

Lastly, with our ordered components and sandwiches, we are able to visualize the Alidoro sandwich menu as a matrix.

sandwich_mat = viz.make_sandwich_matrix(ordered_sandwiches, ordered_ingredients)
fig, ax = viz.plot_sandwiches(sandwich_mat, ordered_sandwiches, ordered_ingredients)
plt.present();

Only for prosciuttos and giggles, I made a decision to deal with my sandwich matrix as a design matrix. I’ll match a linear regression on the sandwich matrix with the sandwich worth because the goal variable. The mannequin coefficients will thus be the value of every ingredient, and a bias time period will maintain the bottom worth of the sandwich (which incorporates the bread). As you’ll be able to see, the mannequin is fairly well-calibrated! I suppose Alidoro’s sandwich pricing is fairly constant.

import statsmodels.api as sm
import numpy as np
y = np.array([sandwich.price for sandwich in ordered_sandwiches])
X = sandwich_mat.copy()

X = sm.add_constant(X, prepend=True)

mannequin = sm.OLS(y, X)
res = mannequin.match()
res.abstract(
    yname="Value ($)", xname=["Base Sandwich Price"] + record(ordered_ingredients)
)
OLS Regression Outcomes
Dep. Variable: Value ($) R-squared: 0.971
Mannequin: OLS Adj. R-squared: 0.940
Methodology: Least Squares F-statistic: 31.39
Date: Solar, 26 Sep 2021 Prob (F-statistic): 1.48e-10
Time: 10:22:02 Log-Chance: 9.6979
No. Observations: 40 AIC: 22.60
Df Residuals: 19 BIC: 58.07
Df Mannequin: 20
Covariance Sort: nonrobust
coef std err t P>|t| [0.025 0.975]
Base Sandwich Value 8.0451 0.265 30.334 0.000 7.490 8.600
prosciutto 2.1138 0.166 12.769 0.000 1.767 2.460
sopressata 1.9554 0.152 12.875 0.000 1.638 2.273
smoked hen breast 2.0618 0.182 11.323 0.000 1.681 2.443
tuna 1.7025 0.171 9.940 0.000 1.344 2.061
salami 2.1288 0.279 7.641 0.000 1.546 2.712
capicollo 2.0982 0.327 6.421 0.000 1.414 2.782
mortadella 3.0738 0.359 8.573 0.000 2.323 3.824
sardines or mackerel 2.4387 0.375 6.497 0.000 1.653 3.224
contemporary mozzarella 1.3168 0.174 7.581 0.000 0.953 1.680
smoked mozzarella 1.3141 0.210 6.271 0.000 0.875 1.753
m. bel paese 1.2748 0.223 5.707 0.000 0.807 1.742
provolone cheese 1.3559 0.250 5.429 0.000 0.833 1.879
arugula 1.2985 0.129 10.076 0.000 1.029 1.568
artichokes 1.2708 0.140 9.074 0.000 0.978 1.564
solar dried tomatoes 1.2414 0.147 8.458 0.000 0.934 1.549
candy roasted peppers 1.1692 0.135 8.637 0.000 0.886 1.453
sizzling peppers 1.0734 0.183 5.850 0.000 0.689 1.458
caponata of eggplant 1.0643 0.210 5.074 0.000 0.625 1.503
dressing 1.0242 0.172 5.963 0.000 0.665 1.384
olive paste 0.5690 0.285 1.998 0.060 -0.027 1.165
Omnibus: 14.030 Durbin-Watson: 2.450
Prob(Omnibus): 0.001 Jarque-Bera (JB): 17.010
Skew: -1.089 Prob(JB): 0.000202
Kurtosis: 5.337 Cond. No. 15.8

Notes:
[1] Customary Errors assume that the covariance matrix of the errors is accurately specified.

We are able to examine this mannequin visually by plotting the costs of the entire components. I had no thought mortadella was the most costly meat.

viz.plot_ingredients(ordered_ingredients, res)


And final however not least, we are able to evaluate the sandwich worth to the mannequin’s predicted worth with a view to get an thought if any sandwich’s worth is wildly inconsistent. Most sandiwch costs are constant, though the Gabriella is outwardly cheaper than anticipated at $11.00 for (solely!) contemporary mozzarella, dressing, and arugula. I don’t know if I’d name that low-cost, however, then once more, neither is SoHo.

y_pred = mannequin.predict(res.params)
chart = viz.plot_actual_vs_pred(y, y_pred, ordered_sandwiches)
chart.properties(width=400, peak=400)


RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments