Product Releases and Updates
Improved flexibility in Agent Evaluations

Improved flexibility in Agent Evaluations

17 June 2026

Build and maintain evaluation datasets

To keep evaluations accurate and up to date as your agents evolve, you need to be able to maintain the datasets behind them, control the environment they run in, and intervene when something goes wrong mid-run. This release extends the evals experience by allowing you to edit and export datasets, configure evaluation run setup and teardown, and cancel a run in progress.

Dataset editing and export

Datasets are now editable. Users can open any saved dataset and add rows, remove rows, or update cell values, including inputs, expected outputs, and expected tool calls. Changes are persisted immediately so the next evaluation run picks up the updated dataset without any re-import step. Datasets remain tied to the same Agentic App and Service Action, and the column-mapping rules from before still apply.

Users can also export any dataset as a JSON file. The exported format matches the standard JSON upload format, so it can be re-imported as-is, versioned in source control alongside agent code, or shared with teammates working in a different environment.

Setup and teardown for evaluation runs

Evaluation runs can now be configured with a setup Service Action and a teardown Service Action. Setup runs before the evaluation starts, and teardown runs after it completes. Both are regular Service Actions from any App or Agentic App, and both are optional. This lets users provision test data in external systems, load fixtures, or reset environment state automatically, without manually preparing or cleaning up between runs. Setup and teardown Service Actions must be fully self-contained and take no input parameters.

Cancelling a running evaluation

Users can cancel an evaluation run in progress. The run report reflects the cancellation, showing partial results for test cases that completed before the cancel and a "Cancelled" status for the run as a whole. This allows you to stop a long-running or mistaken run without waiting for it to finish.

ODC
Artificial Intelligence