Skip to content

Evals API

Note – All endpoints sit behind the same authentication layer used across Open Agents Builder. You must send a valid Bearer API key and the database-id-hash header with every request.


1. Overview

With this API you can:

#CapabilityEndpoint
1Generate test-cases directly from an agent’s system promptPOST /api/agent/{agentId}/evals/generate
2Run one or many cases (streamed ND-JSON)POST /api/agent/{agentId}/evals/run
3Adjust a failed case so that it becomes the new source-of-truthPOST /api/agent/{agentId}/evals/adjust

2. Mandatory Request Headers

HeaderDescription
Authorization: Bearer <OPEN_AGENTS_BUILDER_API_KEY>API key obtained from Settings → API Keys
database-id-hash: <YOUR_DATABASE_ID_HASH>Constant per-workspace hash (visible in Settings)
Optional Content-Type: application/jsonRequired for POST bodies

Example

Authorization: Bearer abc1234exampleKey
database-id-hash: 35f5c5b139a6b569d4649b788c1851831eb44d8e32b716b8411ec6431af8121d
Content-Type: application/json

3. Data Models & Validation Rules (Exact Schema)

Below are the canonical TypeScript-style definitions, reconstructed directly from the Zod validators in the source code.

3.1 Tool Call (inside a test-case request)

export interface ToolCallRequest {
name: string; // tool identifier, e.g. "checkAvailability"
arguments: Record<string, unknown>; // JSON-serialisable args
}

3.2 Chat Message

export interface ChatMessage {
role: 'user' | 'assistant';
content: string;
/** present **only** in assistant messages that simulate tool usage */
toolCalls?: ToolCallRequest[];
}

Validation notes

  • role must be exactly "user" or "assistant".
  • content cannot be empty.
  • toolCalls is optional and, when provided, must be a non-empty array of valid ToolCallRequests.

3.3 Test Case (/evals/generate & /evals/run payload)

export interface TestCase {
id: string; // unique across the suite (client generates or taken from /generate)
messages: ChatMessage[]; // ≥ 2 items (at least 1 user + 1 assistant)
expectedResult: string; // natural-language assertion to grade against
}

Zod reference (testCaseSchema):

z.object({
id: z.string(),
messages: z.array(
z.object({
role: z.enum(['user', 'assistant']),
content: z.string(),
toolCalls: z.array(
z.object({
name: z.string(),
arguments: z.record(z.unknown())
})
).optional()
})
),
expectedResult: z.string()
})

3.4 Evaluation Object (returned by /evals/run & /evals/adjust)

export interface EvaluationResult {
isCompliant: boolean; // true ⇢ meets or exceeds expectations
explanation: string; // concise reasoning produced by the LLM rubric
score: number; // 0.0 → 1.0
}

Zod reference (evaluationSchema):

z.object({
isCompliant: z.boolean(),
explanation: z.string(),
score: z.number().min(0).max(1)
})

3.5 Conversation Flow (object streamed back during /evals/run)

export interface ToolCallRuntime {
toolCallId: string;
toolName: string;
args: Record<string, unknown>;
result: unknown;
}
export interface ConversationFlow {
messages: (ChatMessage & { // same structure as request messages
/** runtime-filled tool call results */
toolCalls?: ToolCallRuntime[];
})[];
/** flattened list of tool calls for quick inspection */
toolCalls?: ToolCallRuntime[];
}

These models are strictly enforced at runtime via Zod; any payload that deviates will trigger a 400 Bad Request (during adjustment/generation) or a 500 Internal Server Error (during run).

4. Endpoints & Usage

4.1 GENERATE /api/agent/{id}/evals/generate

Create a ready-to-run test-suite from your agent’s prompt.

MethodBody
POSTjson { "prompt": "Full system prompt…" }
Example – Generate test-cases (cURL)
Terminal window
curl -X POST \
https://app.example.com/api/agent/agt_123/evals/generate \
-H "Authorization: Bearer abc1234exampleKey" \
-H "database-id-hash: 35f5c5b139a6b569..." \
-H "Content-Type: application/json" \
-d '{ "prompt": "You are a salon booking assistant..." }'

Success (200)

{
"testCases": [
{
"id": "tc-book-haircut",
"messages": [
{ "role": "user", "content": "I want to book a haircut." },
{ "role": "assistant", "content": "Sure – what time suits you?" }
],
"expectedResult": "The haircut is successfully booked for next Monday at 10 AM."
},
...
]
}

Errors

CodeBodyMeaning
500{ "error": "Failed to generate test cases" }LLM failure or invalid input

4.2 RUN /api/agent/{id}/evals/run

Execute one or many test-cases streaming progress & results.

MethodBody
POSTjson { "testCases": [/* TestCase[] */] }

The endpoint returns Transfer-Encoding: chunked & emits ND-JSON (one JSON object per line).

4.2.1 Stream Event Shapes

typePayload data fields
test_case_updatestatus (running | TX | RX | completed | warning | failed), actualResult, evaluation, conversationFlow, sessionId, …
test_case_errorerror, status:"failed", sessionId, full original test case
errorTop-level fatal issue (error string)

4.2.2 Status Colours & Thresholds

Final statusConditionMeaning
completedscore ≥ 0.75Passed
warning0.5 ≤ score < 0.75Partial match
failedscore < 0.5Failed
Example – Run two cases and print results (Node)
import fetch from 'node-fetch';
import { createInterface } from 'readline';
const body = JSON.stringify({ testCases: myCases });
const res = await fetch('https://app.example.com/api/agent/agt_123/evals/run', {
method: 'POST',
headers: {
Authorization: 'Bearer abc1234exampleKey',
'database-id-hash': '35f5c5b139a6b569...',
'Content-Type': 'application/json'
},
body
});
const rl = createInterface({ input: res.body! });
rl.on('line', line => {
const evt = JSON.parse(line);
if (evt.type === 'test_case_update' && ['completed','failed','warning'].includes(evt.data.status)) {
console.log(evt.data.id, '', evt.data.status, '(' + evt.data.evaluation.score + ')');
}
});

Errors

CodeBody
500{ "error": "Failed to run evaluations" }

4.3 ADJUST /api/agent/{id}/evals/adjust

Let the LLM rewrite a failing test-case to match a new, legitimate behaviour.

MethodBody
POSTjson { "testCaseId": "<string>", "actualResult": "Assistant reply that should become expected" }
Example – Adjust after a spec change (cURL)
Terminal window
curl -X POST \
https://app.example.com/api/agent/agt_123/evals/adjust \
-H "Authorization: Bearer abc1234exampleKey" \
-H "database-id-hash: 35f5c5b139a6b569..." \
-H "Content-Type: application/json" \
-d '{
"testCaseId": "tc-book-haircut",
"actualResult": "Your appointment is locked for 3 PM on 4 June, see you then!"
}'

Success (200)

{
"testCase": {
"id": "tc-book-haircut",
"messages": [
{ "role": "user", "content": "I'd like to book a haircut." },
{ "role": "assistant", "content": "Your appointment is locked for 3 PM on 4 June, see you then!" }
],
"expectedResult": "Your appointment is locked for 3 PM on 4 June, see you then!",
"status": "completed",
"evaluation": {
"isCompliant": true,
"explanation": "Test case adjusted to match actual result",
"score": 1
}
}
}

Errors

CodeBody
500{ "error": "Failed to adjust test case" }

5. Error Handling (General)

CodeWhen / Why
400Validation issues (missing body field, malformed JSON…)
401 / 403Invalid or missing API key / database-hash
404Agent ID not found
500Internal errors (LLM time-out, tool failure, etc.)

Error payloads follow:

{
"error": "Human-readable description",
"status": 500 // optional numeric code
}

6. Best Practices & Notes

  • Store tests in Git – keep the JSON you receive from Generate in source-control.
  • CI Automation – hit /evals/run in your pipeline; fail the build when any case returns failed.
  • Dynamic dates – prefer relative dates (e.g., “next Monday”) so the suite stays evergreen.
  • Score thresholds – raise the pass bar to ≥ 0.9 for stricter QA, or read the explanation string for granular gating.
  • Streaming – always treat /run as an infinite stream until the connection closes; do not wait for Content-Length.

7. Summary of Endpoints

#VerbURL
1POST/api/agent/{id}/evals/generate – build a suite from the agent prompt
2POST/api/agent/{id}/evals/run – run one/many tests (stream ND-JSON)
3POST/api/agent/{id}/evals/adjust – auto-repair a failing test

End of Documentation