> For the complete documentation index, see [llms.txt](https://docs.interactive.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.interactive.ai/sdk/experiments.md).

# Experiments

## Overview

Run experiments and batched evaluations over datasets.

`run_experiment` iterates over every item in a dataset, calls your task function, optionally runs evaluators, and records results as a dataset run. `run_batched_evaluation` re-evaluates existing traces in a dataset run.

***

## `run_experiment` [(source)](https://github.com/interactive-ai/interactiveai-python-sdk/blob/main/interactiveai/_client/client.py#L2373)

Run an experiment on a dataset with automatic tracing and evaluation.

This method executes a task function on each item in the provided dataset, automatically traces all executions with InteractiveAI for observability, runs item-level and run-level evaluators on the outputs, and returns comprehensive results with evaluation metrics.

The experiment system provides:

* Automatic tracing of all task executions
* Concurrent processing with configurable limits
* Comprehensive error handling that isolates failures
* Integration with InteractiveAI datasets for experiment tracking
* Flexible evaluation framework supporting both sync and async evaluators

```python
run_experiment(
    *,
    name: str,
    run_name: str | None = None,
    description: str | None = None,
    data: Union[List[LocalExperimentItem], List[ForwardRef('DatasetItemClient')]],
    task: TaskFunction,
    evaluators: List[EvaluatorFunction] = [],
    composite_evaluator: CompositeEvaluatorFunction | None = None,
    run_evaluators: List[RunEvaluatorFunction] = [],
    max_concurrency: int = 50,
    metadata: Dict[str, str] | None = None,
) -> ExperimentResult
```

**Parameters**

* `name` — Human-readable name for the experiment. Used for identification in the InteractiveAI UI.
* `run_name` — Optional exact name for the experiment run. If provided, this will be used as the exact dataset run name if the `data` contains InteractiveAI dataset items. If not provided, this will default to the experiment name appended with an ISO timestamp.
* `description` — Optional description explaining the experiment's purpose, methodology, or expected outcomes.
* `data` — Array of data items to process. Can be either: - List of dict-like items with 'input', 'expected\_output', 'metadata' keys - List of InteractiveAI DatasetItem objects from dataset.items
* `task` — Function that processes each data item and returns output. Must accept 'item' as keyword argument and can return sync or async results. The task function signature should be: task(\*, item, \*\*kwargs) -> Any
* `evaluators` — List of functions to evaluate each item's output individually. Each evaluator receives input, output, expected\_output, and metadata. Can return single Evaluation dict or list of Evaluation dicts.
* `composite_evaluator` — Optional function that creates composite scores from item-level evaluations. Receives the same inputs as item-level evaluators (input, output, expected\_output, metadata) plus the list of evaluations from item-level evaluators. Useful for weighted averages, pass/fail decisions based on multiple criteria, or custom scoring logic combining multiple metrics.
* `run_evaluators` — List of functions to evaluate the entire experiment run. Each run evaluator receives all item\_results and can compute aggregate metrics. Useful for calculating averages, distributions, or cross-item comparisons.
* `max_concurrency` — Maximum number of concurrent task executions (default: 50). Controls the number of items processed simultaneously. Adjust based on API rate limits and system resources.
* `metadata` — Optional metadata dictionary to attach to all experiment traces. This metadata will be included in every trace created during the experiment. If `data` are InteractiveAI dataset items, the metadata will be attached to the dataset run, too.

**Returns**

ExperimentResult containing:

* run\_name: The experiment run name. This is equal to the dataset run name if experiment was on InteractiveAI dataset.
* item\_results: List of results for each processed item with outputs and evaluations
* run\_evaluations: List of aggregate evaluation results for the entire run
* dataset\_run\_id: ID of the dataset run (if using InteractiveAI datasets)
* dataset\_run\_url: Direct URL to view results in InteractiveAI UI (if applicable)

***

## `run_batched_evaluation` [(source)](https://github.com/interactive-ai/interactiveai-python-sdk/blob/main/interactiveai/_client/client.py#L2560)

Fetch traces or observations and run evaluations on each item.

This method provides a powerful way to evaluate existing data in InteractiveAI at scale. It fetches items based on filters, transforms them using a mapper function, runs evaluators on each item, and creates scores that are linked back to the original entities. This is ideal for:

* Running evaluations on production traces after deployment
* Backtesting new evaluation metrics on historical data
* Batch scoring of observations for quality monitoring
* Periodic evaluation runs on recent data

The method uses a streaming/pipeline approach to process items in batches, making it memory-efficient for large datasets. It includes comprehensive error handling, retry logic, and resume capability for long-running evaluations.

```python
run_batched_evaluation(
    *,
    scope: Literal['traces', 'observations'],
    mapper: MapperFunction,
    filter: str | None = None,
    fetch_batch_size: int = 50,
    max_items: int | None = None,
    max_retries: int = 3,
    evaluators: List[EvaluatorFunction],
    composite_evaluator: CompositeEvaluatorFunction | None = None,
    max_concurrency: int = 50,
    metadata: Dict[str, Any] | None = None,
    resume_from: BatchEvaluationResumeToken | None = None,
    verbose: bool = False,
) -> BatchEvaluationResult
```

**Parameters**

* `scope` — The type of items to evaluate. Must be one of: - "traces": Evaluate complete traces with all their observations - "observations": Evaluate individual observations (spans, generations, events)
* `mapper` — Function that transforms API response objects into evaluator inputs. Receives a trace/observation object and returns an EvaluatorInputs instance with input, output, expected\_output, and metadata fields. Can be sync or async.
* `evaluators` — List of evaluation functions to run on each item. Each evaluator receives the mapped inputs and returns Evaluation object(s). Evaluator failures are logged but don't stop the batch evaluation.
* `filter` — Optional JSON filter string for querying items (same format as InteractiveAI API). Examples: - '{"tags": \["production"]}' - '{"user\_id": "user123", "timestamp": {"operator": ">", "value": "2024-01-01"}}' Default: None (fetches all items).
* `fetch_batch_size` — Number of items to fetch per API call and hold in memory. Larger values may be faster but use more memory. Default: 50.
* `max_items` — Maximum total number of items to process. If None, processes all items matching the filter. Useful for testing or limiting evaluation runs. Default: None (process all).
* `max_concurrency` — Maximum number of items to evaluate concurrently. Controls parallelism and resource usage. Default: 50.
* `composite_evaluator` — Optional function that creates a composite score from item-level evaluations. Receives the original item and its evaluations, returns a single Evaluation. Useful for weighted averages or combined metrics. Default: None.
* `metadata` — Optional metadata dict to add to all created scores. Useful for tracking evaluation runs, versions, or other context. Default: None.
* `max_retries` — Maximum number of retry attempts for failed batch fetches. Uses exponential backoff (1s, 2s, 4s). Default: 3.
* `verbose` — If True, logs progress information to console. Useful for monitoring long-running evaluations. Default: False.
* `resume_from` — Optional resume token from a previous incomplete run. Allows continuing evaluation after interruption or failure. Default: None.

**Returns**

BatchEvaluationResult containing: - total\_items\_fetched: Number of items fetched from API - total\_items\_processed: Number of items successfully evaluated - total\_items\_failed: Number of items that failed evaluation - total\_scores\_created: Scores created by item-level evaluators - total\_composite\_scores\_created: Scores created by composite evaluator - total\_evaluations\_failed: Individual evaluator failures - evaluator\_stats: Per-evaluator statistics (success rate, scores created) - resume\_token: Token for resuming if incomplete (None if completed) - completed: True if all items processed - duration\_seconds: Total execution time - failed\_item\_ids: IDs of items that failed - error\_summary: Error types and counts - has\_more\_items: True if max\_items reached but more exist


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.interactive.ai/sdk/experiments.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
