This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Autoevals is a dual-language library (TypeScript + Python) for evaluating AI model outputs. It provides LLM-as-a-judge evaluations, heuristic scorers (Levenshtein distance), and statistical metrics (BLEU). Developed by Braintrust.
pnpm install --frozen-lockfile # Install dependencies
pnpm run build # Build JS (outputs to jsdist/)
pnpm run test # Run all JS tests with vitest
pnpm run test -- js/llm.test.ts # Run single test file
pnpm run test -- -t "test name" # Run specific test by namemake develop # Set up Python venv with all dependencies
source env.sh # Activate the venv
pytest # Run all Python tests
pytest py/autoevals/test_llm.py # Run single test file
pytest py/autoevals/test_llm.py::test_openai # Run specific test
pytest -k "test_name" # Run tests matching patternpre-commit run --all-files # Run all linters (black, ruff, prettier, codespell)
make fixup # Same as aboveThe library maintains parallel implementations in TypeScript (js/) and Python (py/autoevals/). Both share:
- The same evaluation templates (
templates/*.yaml) - The same
Scoreinterface:{name, score (0-1), metadata} - The same scorer names and behavior
llm.ts/llm.py- LLM-as-a-judge scorers (Factuality, Battle, ClosedQA, Humor, Security, Sql, Summary, Translation)ragas.ts/ragas.py- RAG evaluation metrics (ContextRelevancy, Faithfulness, AnswerRelevancy, etc.)string.ts/string.py- Text similarity (Levenshtein, EmbeddingSimilarity)json.ts/json.py- JSON validation and diffoai.ts/oai.py- OpenAI client wrapper with cachingscore.ts/score.py- Core Score type and Scorer base class
YAML templates in templates/ define LLM classifier prompts. Templates use Mustache syntax ({{variable}}). The LLMClassifier class loads these templates and handles:
- Prompt rendering with chain-of-thought (CoT) suffix
- Tool-based response parsing via
select_choicefunction - Score mapping from choice letters to numeric scores
class Scorer(ABC):
def eval(self, output, expected=None, **kwargs) -> Score # Sync
async def eval_async(self, output, expected=None, **kwargs) # Async
def __call__(...) # Alias for eval()type Scorer<Output, Extra> = (
args: ScorerArgs<Output, Extra>,
) => Score | Promise<Score>;
// All scorers are async functionsTests require:
OPENAI_API_KEYorBRAINTRUST_API_KEY- For LLM-based evaluationsOPENAI_BASE_URL(optional) - Custom API endpoint
- Python tests use
pytestwithrespxfor HTTP mocking - TypeScript tests use
vitestwithmswfor HTTP mocking - Tests that call real LLM APIs need valid API keys
- Test files are colocated:
test_*.py(Python),*.test.ts(TypeScript)