Table of Contents
What's Cooking? Classification of recipes into cuisines according to ingredients
Source of data
Description of the dataset
The dataset was downloaded from the American recipe website Yummly (www.yummly.com) and contains almost 40,000 recipes and 6700 ingredients. Each recipe is classified (by their author) into one of the cuisine categories (e.g. British, Japanese, Mexican), and each contains a set of ingredients needed to cook given food (e.g. water, vegetable oil, wheat, salt; no quantities are given). I downloaded the original JSON format file from Kaggle and prepared two versions of tables for further use, both having columns representing individual ingredients. In the first, complete version (with > 40,000 rows), each row represents the original recipe, with a value 1 for each ingredient present in that recipe. In the second, aggregated table (with 20 rows), each row contains one cuisine, and values in the cells represent counts of how many times in recipes from a given cuisine a given ingredient was used.
Locality
Global (I assume), but perhaps with geographic bias (since www.yummly.com is an American server)
Data for download
File name | File type | Description |
---|---|---|
recipes_cuisines.txt | tab-delimited txt format | Contains Cuisines × ingredients table (20 cuisines in rows, 6714 ingredients in columns, cell values equal to the count of recipes of give cuisine mentioning given ingredient) |
recipes_cuisines_count.txt | tab-delimited txt format | Contains two column, cuisine and the count of recipes in the dataset assigned to this cuisine. |
recipes_long.txt | tab-delimited txt format | All recipes in a long format, containing recipe ID, cuisine, and ingredients. Need to be reshaped into wide format for analysis (code included in the section for direct upload of data into R) |
Script for direct import of data to R
Cuisine x ingredients table and recipe count table
library (readr) recipes_cuisines <- read_delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/recipes_cuisines.txt') recipes_cuisines_count <- read_delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/recipes_cuisines_count.txt')
Individual recipes (long format) and reshaping into wide format
recipes_long <- read_delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/recipes_long.txt') recipes_compl_tb <- as_tibble (recipes_long) %>% select (-cuisine) %>% mutate (presence = 1L) %>% pivot_wider (names_from = ingredients, values_from = presence, values_fn = sum, values_fill = 0L) recipes_compl <- as.data.frame (recipes_compl_tb) rownames (recipes_compl) <- recipes_compl_tb$id recipes_compl <- recipes_compl [,-1] recipes_compl_cuisine <- distinct (recipes_long[,-3])
The code used for reshaping the long data from recipes_long
results in two objects: recipes_compl
, a data.frame containing each recipe as a row and each ingredient as a column, with cell values a presence-absence data; and recipes_compl_cuisine
(a tibble) with recipe ID and cuisine to which the recipe was assigned on yummly.com.
References
- Dataset from Kaggle, downloaded from Yummly (yummly.com). Unknown author.