About Sully.ai

Sully.ai is building the AI operating system for healthcare. Our mission is to make doctors superhuman through AI, so every patient can get faster, better, and more accessible care.

Healthcare is one of the most important systems in the world, yet clinicians still lose hours every day to documentation, scheduling, intake, coding, and administrative work. We are changing that with AI medical employees: Scribes, Receptionists, Coders, Consultants and other agents that operate inside real clinical workflows.

We are backed by Amity Ventures, YC, Baidu Ventures amongst others, have raised $40M+, and are already deployed across hundreds of healthcare organizations. Our products process tens of thousands of clinical encounters every day, automate millions of minutes of administrative work, and help clinicians spend more time with patients.

The role

As an Applied AI Engineer on the Research team, you will work directly with our research and engineering org on the AI medical employees at the core of Sully — Scribe, Receptionist, Consultant, Coder, and what comes next. Your job is to turn the frontier of agent research into agents and eval systems that run in production for real clinicians within days.

This is a builder-researcher role. You read the latest work in agent, harness and context engineering, evals, and reinforcement learning, then ship from it. We do not separate "research" and "engineering" — you are expected to do both, and the work is measured by what lands in front of clinicians, not by a benchmark in isolation. Our agents are architecture- and configuration-driven: most of the behavior comes from decomposition, prompts, routing logic, context assembly, and evals rather than one-off code. You will own the systems that decide what the model sees, measure whether it worked, and make it better on its own over time.

You will own three things: evals, agents, and self-improving loops. Just as importantly, you will help define how Sully does applied research at all — the harnesses, eval infrastructure, and feedback loops that make every future agent faster to build and safer to ship. The environment is high-stakes by design: in clinical AI, accuracy has zero margin for error, and that constraint is the most interesting part of the job.

What you’ll do

Build and scale evaluation systems

Design and scale automated evaluation pipelines (LLM-as-judge + clinician review) with clinical-grade benchmarks for accuracy, factuality, attribution, safety, and instruction-following.
Build the tooling and dashboards researchers use to run evals, and the CI gates that block regressions before they reach patients.
Treat measurement as a science: understand reliability, variance, and scalability of your eval methodology, not just the headline number.
Turn fuzzy notions of "is this good enough?" into clear, defensible metrics that researchers, engineers, and clinical stakeholders can act on.

Build and ship medical AI agents

Design, implement, and ship agents (Scribe, Receptionist, Consultant, Coder, and new ones) on Sully's stack — decomposition, parallel specialists, dynamic output contracts, and per-agent model routing.
Engineer the harness: memory, context compression, retrieval, session/state, and communication patterns between agents.
Own prompts as infrastructure — structured, versioned, composable per specialty and per organization — because without a correction loop as a safety net, prompt quality is load-bearing.
Take a research result to production end-to-end, fast.

Build self-improving and auto-evolving loops

Build closed-loop systems where clinician feedback and production traces drive targeted, evidence-grounded improvements to the specific components that need them — prompt quality that improves weekly, not quarterly.
Build self-improvement loops that automate model understanding and surface where each agent is failing and why.
Stand up continuous evaluation so production traffic is the experiment and the observability layer is the eval harness.

Push context and loop engineering forward

Track emerging research (context engineering, loop engineering, agent harnesses, evals, RL) and bring the best ideas into our agents quickly.
Experiment with the frontier: RL environments for clinical tasks, small fine-tuned models for focused extraction, synthetic data generation, and model selection per agent role.
Run rigorous experiments: define the hypothesis, build the pipeline, run the model, analyze the result, decide what to do next.

Partner across research, engineering, and clinical

Work directly with Engineering, QA, and Clinical to validate, harden, and roll out new capabilities.
Explain technical tradeoffs clearly to research, product, and clinical stakeholders.
Write up findings so they shape product decisions — and so others can build on them.

What good looks like

Fuzzy quality questions become numbers: every shipped agent has automated evals with regression gates, and "did this get better?" is answerable in minutes, not weeks.
Deployed agents perform reliably in production — measurable gains in accuracy, hallucination/omission reduction, or safety on the workflows you own.
New agent capabilities go from research idea to production in days, end-to-end, without sacrificing clinical safety.
A working closed-loop improvement system turns clinician feedback and traces into targeted component fixes, so agent quality improves on a weekly cadence.
The harness, eval tooling, and templates you build make the next agent faster to ship — applied research becomes more scalable, not more bespoke.
The research team has a builder who can read a paper at midnight and ship the improvement the same week, and who raises the technical bar for everyone around them.

What we’re looking for

Required

Strong Python and applied ML background; comfortable with PyTorch / Hugging Face and an agent framework.
Proven experience designing agentic systems and LLM evaluation / benchmarking frameworks — not just calling an API, but building the harness and the measurement around it.
Demonstrated ability to design rigorous experiments and translate findings into production.
Fluency with prompting and context engineering: you have opinions about what the model sees and why it matters.
A portfolio of things you have built with LLMs that got them to do hard tasks — agents, benchmarks, synthetic data, fine-tunes, harnesses. Show us your contribution. We weight slope over pedigree.
Strong written communication and documentation habits — you can turn a vague behavioral problem into a hypothesis, an experiment, a result, and a clear writeup for technical and clinical stakeholders.
Hungry for new information: you read papers and ship from them, you stay current, and you have taste about what is signal.
High ownership and bias toward action.

Strong signals

Reinforcement learning, post-training, reward modeling, or fine-tuning small models for focused tasks.
Multi-agent systems and agent harness design (memory, context compression, routing, decomposition).
Experience with healthcare technology, EHRs, or clinical integrations and awareness of HIPAA/PHI handling.
Published research or deep open-source / applied work in LLMs and agent evaluation.
Startup or founder experience, and comfort operating in 0→1, high-ambiguity environments.
Experience using AI-native development tools to move quickly.
High-stakes / regulated domain experience where accuracy has zero margin for error.

Who this role is not for

This role may not be a fit if you:

Want to do pure research with no obligation to ship into production.
Want to do pure application engineering and avoid the measurement and research side.
Treat evals as an afterthought rather than the core craft.
Need fully specified tickets before taking action.
Are uncomfortable with ambiguity or with the stakes of clinical AI.
Need a stable, fully-documented product — our agents move fast and you will help build the harnesses, evals, and playbooks as you go.
Keep up with the field passively but rarely turn new ideas into working systems.

Interview process

Introductory conversation
Technical screen or systems walkthrough
"Show us what you built" — walk through an LLM agent, eval harness, or fine-tune you have shipped
Applied eval / agent exercise (e.g. design an eval harness for a clinical extraction task)
Final cross-functional interview

Location: Open to US remote, but ideally office based/hybrid in Bay Area.

Applied AI Engineer

About Sully.ai

The role

What you’ll do

Build and scale evaluation systems

Build and ship medical AI agents

Build self-improving and auto-evolving loops

Push context and loop engineering forward

Partner across research, engineering, and clinical

What good looks like

What we’re looking for

Required

Strong signals

Who this role is not for

Interview process

About Sully.ai

Applied AI Engineer

Already working at Sully.ai?