Evals & Continuous Learning Engineer
The Opportunity
Shipping reliable AI applications means closing the loop: capture what happened in production, turn it into evaluation data, measure quality, and feed improvements back into the system.
We already have a real foundation in production — evals and experiments built into Logfire, our observability platform, plus our open source pydantic-evals library. We're looking for someone who has worked on evaluation or LLM-observability platforms to own this end to end — and to push it toward genuine continuous learning, where systems measurably improve from their own production data.
This might be you if you've worked on a product like Braintrust, Langfuse, LangSmith, Arize/Phoenix, or Humanloop — or built serious internal eval tooling.