I'd had the idea since 2012 of creating a piece of software to optimize for various criteria, particularly nutrition and cost, given a set of recipes with nutrition data and cost defined. I've worked on this project on and off since then. This was a simple idea that was disproportionately difficult for me to actually see through. It took until this year, 2025, to produce a minimal, pared-down version of what I'd had in mind.
I spent a fair amount of time gathering nutrition data from the USDA and massaging it into a usable form. I was interested in Go, the programming language, when I did this part, so that's what I used for writing the program that generates the SQL that creates and fills the database.
The process was a lot less straightforward than I'd thought it would be, so hopefully this will help others trying to make similar use of the USDA's datasets.
I spent some time looking for recipe collections released under the Creative Commons. I initially found Foodista and crawled the entire wiki, generating YAML files from the HTML. I haven't used any of these yet, though, so far preferring to generate meal plans from the recipes I later pulled from Grimgrains.
First, I gathered recipe URLs using this shell script I wrote:
#!/usr/bin/env sh set -e # The "pause" indicates how many seconds to wait between pages. # Pages are 0-indexed. To start from the beginning, pass "0" as the start. # The "end" is exclusive. To pull through page 282, pass "282" as the end. # (Page 282 is at index 281.) [ "$3" ] || { >&2 echo "usage: $0 <pause> <start> <end>" && exit 1; } pause="$1" page="$2" while [ "$page" != "$3" ]; do # only show diagnostic output if this is an interactive terminal [ ! -t 1 ] || echo "Fetching page $(expr $page + 1)..." curl -s "https://www.foodista.com/browse/recipes?page=$page" \ | grep -oP '<a href="/recipe/\K[^\"]+' \ | sed 's|^|https://www.foodista.com/recipe/|' sleep $pause page=$(expr $page + 1) done
I made it a little more user-friendly than was necessary for my own personal use because I wanted it to be somewhat easy for others to follow in my footsteps if they so chose.
I then fed the URLs into a script that downloaded the HTML for each recipe's page and plopped it into a directory structure matching the structure of the URL given:
#!/usr/bin/env sh [ "$2" ] || { >&2 echo "usage: $0 <cache directory> <url list file>" && exit; } export CACHE_DIR="$1" tmp="$(mktemp)" trap 'rm "$tmp"' EXIT INT HUP cat > "$tmp" <<"EOF" url="$1" path="$CACHE_DIR/$(echo "$url" | sed 's|https\?://||')" if [ -f "$path" ]; then echo "Already exists, skipping: $path" else echo "Caching to $path" dir="$(dirname "$path")" mkdir -p "$dir" curl -s -o "$path" "$url" # rate limit, don't be *too* obnoxious sleep 1 fi EOF chmod +x "$tmp" cat "$2" | xargs -P 10 -n 1 "$tmp"
As I converted these files from HTML to the YAML-based recipe format I'd settled on, based on the format used by Chowdown, I discovered there were about 372 naming collisions:
$ find html-cache -type f -exec basename {} \; > recipe-names $ sort recipe-names | uniq > recipe-names-unique $ expr $(wc -l recipe-names) - $(wc -l recipe-names-unique) 372 $
I resorted to incorporating each recipe's unique ID number into its filename, and I decided to put all the files in the same folder, flattening the hierarchy I'd earlier created. If my effort had been less piecemeal, I might've thought to use this convention during the downloading phase, but for whatever reason I was still working on this project in small pieces with long breaks in between.
#!/usr/bin/env sh set -e find "$1" -type f | while read f; do tmp="$(echo "$f" | rev | cut -d/ -f-2 | rev)" new="$(echo "$tmp" | cut -d/ -f2)-$(echo "$tmp" | cut -d/ -f1)" mv "$f" "$1/$new" rmdir "$(dirname "$f")" done
Finally, after a false start with grep and sed, I put together a shell script using pup to turn the HTML files into YAML files, and to download each recipe's photo:
#!/usr/bin/env sh # Dependency: pup set -e [ "$2" ] || { echo "usage: $0 <recipe dir> <html source>"; exit 1; } echo "Converting $2" mkdir -p "$1/images" || true img="$1/images/$(basename "$2").jpg" imgurl="$(pup -f "$2" 'div.featured-image img attr{src}')" [ -f "$img" ] || curl -s -o "$img" "$imgurl" title="$(pup -f "$2" '#page-title text{}')" author="$(pup -f "$2" '.username text{}')" imgcredit="$(pup -f "$2" 'div.featured-image a text{}')" if [ "$imgcredit" ]; then imgcrediturl="$(pup -f "$2" 'div.featured-image a attr{href}' | tail -n1)" else imgcrediturl="" imgcredit="$author" fi description="$(pup -f "$2" 'div.field-type-text-with-summary text{}' \ | sed -z 's/\n\n\+/\n\n/g')" ingredients="$(pup -f "$2" "div[itemprop="ingredients"]" \ | tr -d "\n" \ | sed 's|</div>|</div>\n|g; s|<[^>]\+>||g;' \ | sed 's/^ \+//g; s/^/- /g' | tr -s ' ')" directions="$(pup -f "$2" "div[itemprop="recipeInstructions"].step-body" \ | tr -d "\n" \ | sed 's|</div>|</div>\n|g; s|<[^>]\+>||g;' \ | sed 's/^ \+//g; s/^[0-9]\+\. \+//g; s/^/- /g' | tr -s ' ')" tags="$(pup -f "$2" 'div.field-type-taxonomy-term-reference a text{}' \ | tr "\n" "," | sed 's/,$//g; s/,/, /g;')" cat > "$1/$(basename "$2").yml" <<EOF --- layout: recipe title: $title author: $author license: https://creativecommons.org/licenses/by/3.0/ image: $img image_credit: $imagecredit image_credit_url: $imagecrediturl tags: $tags ingredients: $ingredients directions: $(echo "$directions" | sed 's/ / /g') --- $(echo "$description" | sed 's/ / /g') EOF
Running this script on each HTML file and saving the output under a "recipes" directory produced a 733MB archive of YAML files and images, with 564MB of it being images.
I found myself stalled out on the project after Foodista, and eventually reasoned that I would do better to focus on a small set of recipes that I would be able to "complete." So I turned my attention to Grimgrains. This proved to be a correct intuition; while it still took me some time after crawling the recipes to get around to actually analyzing their nutrition data, I did find having a clear end goal and a way to measure my progress against it helped me expend more effort than I might have done otherwise once I got the task underway.
For this, I wound up using make
, curl
, and pup
. I created a git repository on Codeberg called grimgrains-yaml to document and share the process, but as redundancy combats bitrot, and as they're short anyway, I've inlined the scripts that did the crawling here.
The HTML to YAML shell script was a bit simpler than the Foodista one:
#!/bin/bash cat <<EOF title: $(pup -f "$1" "main.recipe h1 text{}") author: Hundred Rabbits (100r.co) license: https://creativecommons.org/licenses/by-nc-sa/4.0/ image: images/$(basename "$(pup -f "$1" "main.recipe img[src^=../media/recipes] attr{src}" | head -n1)") image_credit: Hundred Rabbits (100r.co) image_license: https://creativecommons.org/licenses/by-nc-sa/4.0/ tags: - vegan servings: $(pup -f "$1" "main.recipe h2 text{}" | grep -o "^[0-9]\+") ingredients: $(paste <(pup -f "$1" "dl.ingredients dt b text{}") \ <(pup -f "$1" "dl.ingredients dt u text{}") \ | sed 's|\t| (|; s|^|- |; s|$|)|;') directions: $( pup -f "$1" "ul.instructions" \ | tr -d '\n' \ | sed "s/'\;/'/g; s/<li>/\n-/g; s/<[^>]\+>//g; s/ \+/ /g; s/ \([\.,!]\)/\1/g;") EOF
And for crawling the website, grabbing all the HTML and images, and invoking the script above, I used this makefile:
html/: mkdir html curl --silent "https://grimgrains.com/site/home.html" \ | pup "ul.recipes li a attr{href}" \ | grep -v "basic_toothpaste.html" \ | while read link; do \ curl --output-dir html -O "https://grimgrains.com/site/$${link}"; \ sleep 1; \ done images/: html/ mkdir images ls html/*.html \ | while read file; do \ pup -f "$$file" "main.recipe img[src^=../media/recipes] attr{src}" \ | head -n1 \ | sed 's|^../|https://grimgrains.com/|' \ | xargs curl --output-dir images -O ; \ sleep 1; \ done yaml/todo/: html/ mkdir -p yaml/todo ls html/*.html \ | while read file; do \ ./html-to-yaml.sh "$$file" > "yaml/todo/$$(basename "$${file%.html}.yml")"; \ done
I spent a fair amount of time trying to put together a TUI-based utility for editing recipe YAML files, much more than I'd anticipated, and with only minor success. This was another point at which I wound up putting this project on the back burner once again.
I meanwhile learned Guile Scheme and at the same time found myself enjoying writing Javascript-free web apps in the language. After writing some simple personal utilities, I wrote a web app for viewing and searching my nutrition database and my recipe YAML files, with the intention of adding editing capability. In the process of working on the web app, though, I found that having a browser window open for the ingredients search, a browser window open for the recipe being worked on, and a window for editing the recipe YAML in vim, was really all I needed for a pleasant-enough editing experience. I finished the Grimgrains nutrition data analysis using this workflow.
I may write more about this later, but for now, as it's getting late: with the recipes gathered and their approximate nutrient profiles determined, the rest was fairly simple. I added a "recipes" table to my nutrition database and added a function to my recipes web app which wrote an indicated recipe to that table, as a means of selecting which recipes should be "in the running" for the meal plans. I wrote a script in Guile Scheme to generate random meal plans, using an elitist genetic algorithm to sift out the ones which best fit my nutrition goals. I have it up on Codeberg as guile-foodbot.
Right now it's hardcoded with my own nutrition goals and genetic algorithm parameters, but I hope to make these more readily user-selectable before I'm done.