User Tools

Site Tools


What's Cooking? Classification of recipes into cuisines according to ingredients

Source of data

Kaggle (unknown author), with dataset compiled from Yummly

Description of the dataset

The dataset was downloaded from the American recipe website Yummly ( and contains almost 40,000 recipes and 6700 ingredients. Each recipe is classified (by their author) into one of the cuisine categories (e.g. British, Japanese, Mexican), and each contains a set of ingredients needed to cook given food (e.g. water, vegetable oil, wheat, salt; no quantities are given). I downloaded the original JSON format file from Kaggle and prepared two versions of tables for further use, both having columns representing individual ingredients. In the first, complete version (with > 40,000 rows), each row represents the original recipe, with a value 1 for each ingredient present in that recipe. In the second, aggregated table (with 20 rows), each row contains one cuisine, and values in the cells represent counts of how many times in recipes from a given cuisine a given ingredient was used.


Global (I assume), but perhaps with geographic bias (since is an American server)

Data for download

File name File type Description
recipes_cuisines.txt tab-delimited txt format Contains Cuisines × ingredients table (20 cuisines in rows, 6714 ingredients in columns, cell values equal to the count of recipes of give cuisine mentioning given ingredient)
recipes_cuisines_count.txt tab-delimited txt format Contains two column, cuisine and the count of recipes in the dataset assigned to this cuisine.
recipes_long.txt tab-delimited txt format All recipes in a long format, containing recipe ID, cuisine, and ingredients. Need to be reshaped into wide format for analysis (code included in the section for direct upload of data into R)

Script for direct import of data to R

Cuisine x ingredients table and recipe count table

library (readr)
recipes_cuisines <- read_delim ('')
recipes_cuisines_count <- read_delim ('')

Individual recipes (long format) and reshaping into wide format

recipes_long <- read_delim ('')
recipes_compl_tb <- as_tibble (recipes_long) %>%
  select (-cuisine) %>%
  mutate (presence = 1L) %>% 
  pivot_wider (names_from = ingredients, values_from = presence, values_fn = sum, values_fill = 0L) 
recipes_compl <- (recipes_compl_tb)
rownames (recipes_compl) <- recipes_compl_tb$id
recipes_compl <- recipes_compl [,-1]
recipes_compl_cuisine <- distinct (recipes_long[,-3]) 

The code used for reshaping the long data from recipes_long results in two objects: recipes_compl, a data.frame containing each recipe as a row and each ingredient as a column, with cell values a presence-absence data; and recipes_compl_cuisine (a tibble) with recipe ID and cuisine to which the recipe was assigned on


  • Dataset from Kaggle, downloaded from Yummly ( Unknown author.
en/data/recipes.txt · Last modified: 2022/11/25 13:48 by David Zelený

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki