🎓 AI Evaluation Case Study • Course Project🚧 In Progress

Precision Nutrition
Meets Rigorous Evaluation

Building a recipe generator that doesn't hallucinate — and proving it with systematic evaluation methodology learned from the Maven Evals course.

Try NalaBheem Live

🥗 Dietary Preferences

VegetarianLow-CarbKetoGluten-Free

🎯 Health Goals

Weight LossMuscle GainMaintenance

Generating recipe...

Overview

What is NalaBheem?

A precision nutrition recipe assistant that generates accurate, goal-aligned recipes — backed by rigorous evaluation.

NalaBheem is an AI-powered recipe assistant that generates nutrition-precise recipes tailored to specific fitness goals: weight loss, muscle gain, or maintenance. Named after legendary strength and mastery, it enforces strict macro targets, requires ingredient qualifiers, and validates nutrition calculations.

But building an AI that gives accurate nutrition advice is only half the challenge. The other half? Proving it actually works.

This project applies rigorous evaluation methodology learned from a Maven AI Evals course — combining open coding of 100+ traces, custom error taxonomies, and systematic comparison of manual vs. automated evaluation to surface real issues.

100+

Traces Open Coded

Error Codes Defined

65%

Automated Pass Rate

Test Interactions

Methodology

The Evaluation Process

From raw AI outputs to systematic quality measurement — a methodology that surfaces real issues, not just vibes.

🔧

Build the Agent

Created a precision nutrition bot with strict macro targets, ingredient qualifiers, and goal-specific constraints.

🔍

Open Code Traces

Manually reviewed 100+ bot responses, identifying patterns and failure modes without predefined categories.

🏷️

Define Taxonomy

Distilled observations into 6 precise error codes covering macros, ingredients, and nutrition accuracy.

⚡

Run Evaluations

Used PromptFoo for automated evals, then compared against manual annotations to measure evaluator quality.

Error Taxonomy

The 6 Critical Error Codes

A precision-focused taxonomy that catches real nutrition failures — not just formatting issues.

MACRO_UNACHIEVABLE

Impossible Constraints

Triggered when user constraints make the requested macros mathematically impossible to achieve.

ING_QUALIFIER_MISSING

Missing Ingredient Details

Ingredient lacks required qualifiers: fat %, lean %, skin-on/off, raw vs cooked, drained weight.

MACRO_PROTEIN_LOW

Protein Below Target

Protein per serving doesn't meet goal minimum: <40g weight loss, ≤50g muscle gain, <30g maintenance.

MACRO_CARB_POLICY_VIOLATION

Starchy Carbs Violation

Weight-loss recipe includes starchy carbs by default when the policy prohibits them.

NUTR_FIELD_MISSING

Incomplete Nutrition Panel

Fiber or sodium is missing from the required nutrition panel output fields.

NUTR_MISMATCH_CALC

Calculation Deviation

Computed macros deviate from USDA reference data: ±5% for macros or >±20 kcal for calories.

Tools Built

Custom Evaluation Infrastructure

Purpose-built tools for annotating, tracking, and comparing evaluation results.

Screenshot 01

Manual Annotation Interface

Screenshot 02

Annotation Tracker

Screenshot 03

Evaluation Results Comparison

Screenshot 04

Agreement Rate Analysis

Current State

What the Evals Revealed

The evaluation process is working — it's catching real issues that need to be fixed.

Manual vs Automated Evaluation

65% Pass Rate

🔄 Iteration Needed

40%

Agreement Rate

60%

Disagreement Rate

8/20

Cases Aligned

Issues Detected — Ready for Prompt Iteration

Type 1 Error — 8 cases

PromptFoo PASS, Manual FAIL: Automated evaluators missed nutrition calculation accuracy issues. Notes mention "underestimated calories" and "macros way off" — these are real issues in the prompt that need correction.

Type 2 Error — 4 cases

PromptFoo FAIL, Manual PASS: All flagged for "Nutrition label missing required fields: Fiber, Sodium" — the NUTR-FIELD-MISSING evaluator may need recalibration for context-specific requirements.

Roadmap

Next Steps

The evaluation framework is catching issues — now it's time to fix them.

🎯 Prompt Refinement

Address the 8 Type 1 errors by improving macro calculation instructions and adding explicit calorie validation steps to the system prompt.

⚙️ Evaluator Tuning

Recalibrate the NUTR-FIELD-MISSING evaluator to be context-aware — not every query requires fiber and sodium fields.

🔁 Re-run & Compare

After prompt changes, re-run the full eval suite to measure improvement and track progress toward higher agreement rates.

Precision NutritionMeets Rigorous Evaluation

NalaBheem