# Experiments

## Overview

Run experiments and batched evaluations over datasets.

`run_experiment` iterates over every item in a dataset, calls your task function, optionally runs evaluators, and records results as a dataset run. `run_batched_evaluation` re-evaluates existing traces in a dataset run.

***

## `run_experiment` [(source)](https://github.com/interactive-ai/interactiveai-python-sdk/blob/main/interactiveai/_client/client.py#L2373)

Run an experiment on a dataset with automatic tracing and evaluation.

This method executes a task function on each item in the provided dataset, automatically traces all executions with InteractiveAI for observability, runs item-level and run-level evaluators on the outputs, and returns comprehensive results with evaluation metrics.

The experiment system provides:

* Automatic tracing of all task executions
* Concurrent processing with configurable limits
* Comprehensive error handling that isolates failures
* Integration with InteractiveAI datasets for experiment tracking
* Flexible evaluation framework supporting both sync and async evaluators

```python
run_experiment(
    *,
    name: str,
    run_name: str | None = None,
    description: str | None = None,
    data: Union[List[LocalExperimentItem], List[ForwardRef('DatasetItemClient')]],
    task: TaskFunction,
    evaluators: List[EvaluatorFunction] = [],
    composite_evaluator: CompositeEvaluatorFunction | None = None,
    run_evaluators: List[RunEvaluatorFunction] = [],
    max_concurrency: int = 50,
    metadata: Dict[str, str] | None = None,
) -> ExperimentResult
```

**Parameters**

* `name` — Human-readable name for the experiment. Used for identification in the InteractiveAI UI.
* `run_name` — Optional exact name for the experiment run. If provided, this will be used as the exact dataset run name if the `data` contains InteractiveAI dataset items. If not provided, this will default to the experiment name appended with an ISO timestamp.
* `description` — Optional description explaining the experiment's purpose, methodology, or expected outcomes.
* `data` — Array of data items to process. Can be either: - List of dict-like items with 'input', 'expected\_output', 'metadata' keys - List of InteractiveAI DatasetItem objects from dataset.items
* `task` — Function that processes each data item and returns output. Must accept 'item' as keyword argument and can return sync or async results. The task function signature should be: task(\*, item, \*\*kwargs) -> Any
* `evaluators` — List of functions to evaluate each item's output individually. Each evaluator receives input, output, expected\_output, and metadata. Can return single Evaluation dict or list of Evaluation dicts.
* `composite_evaluator` — Optional function that creates composite scores from item-level evaluations. Receives the same inputs as item-level evaluators (input, output, expected\_output, metadata) plus the list of evaluations from item-level evaluators. Useful for weighted averages, pass/fail decisions based on multiple criteria, or custom scoring logic combining multiple metrics.
* `run_evaluators` — List of functions to evaluate the entire experiment run. Each run evaluator receives all item\_results and can compute aggregate metrics. Useful for calculating averages, distributions, or cross-item comparisons.
* `max_concurrency` — Maximum number of concurrent task executions (default: 50). Controls the number of items processed simultaneously. Adjust based on API rate limits and system resources.
* `metadata` — Optional metadata dictionary to attach to all experiment traces. This metadata will be included in every trace created during the experiment. If `data` are InteractiveAI dataset items, the metadata will be attached to the dataset run, too.

**Returns**

ExperimentResult containing:

* run\_name: The experiment run name. This is equal to the dataset run name if experiment was on InteractiveAI dataset.
* item\_results: List of results for each processed item with outputs and evaluations
* run\_evaluations: List of aggregate evaluation results for the entire run
* dataset\_run\_id: ID of the dataset run (if using InteractiveAI datasets)
* dataset\_run\_url: Direct URL to view results in InteractiveAI UI (if applicable)

***

## `run_batched_evaluation` [(source)](https://github.com/interactive-ai/interactiveai-python-sdk/blob/main/interactiveai/_client/client.py#L2560)

Fetch traces or observations and run evaluations on each item.

This method provides a powerful way to evaluate existing data in InteractiveAI at scale. It fetches items based on filters, transforms them using a mapper function, runs evaluators on each item, and creates scores that are linked back to the original entities. This is ideal for:

* Running evaluations on production traces after deployment
* Backtesting new evaluation metrics on historical data
* Batch scoring of observations for quality monitoring
* Periodic evaluation runs on recent data

The method uses a streaming/pipeline approach to process items in batches, making it memory-efficient for large datasets. It includes comprehensive error handling, retry logic, and resume capability for long-running evaluations.

```python
run_batched_evaluation(
    *,
    scope: Literal['traces', 'observations'],
    mapper: MapperFunction,
    filter: str | None = None,
    fetch_batch_size: int = 50,
    max_items: int | None = None,
    max_retries: int = 3,
    evaluators: List[EvaluatorFunction],
    composite_evaluator: CompositeEvaluatorFunction | None = None,
    max_concurrency: int = 50,
    metadata: Dict[str, Any] | None = None,
    resume_from: BatchEvaluationResumeToken | None = None,
    verbose: bool = False,
) -> BatchEvaluationResult
```

**Parameters**

* `scope` — The type of items to evaluate. Must be one of: - "traces": Evaluate complete traces with all their observations - "observations": Evaluate individual observations (spans, generations, events)
* `mapper` — Function that transforms API response objects into evaluator inputs. Receives a trace/observation object and returns an EvaluatorInputs instance with input, output, expected\_output, and metadata fields. Can be sync or async.
* `evaluators` — List of evaluation functions to run on each item. Each evaluator receives the mapped inputs and returns Evaluation object(s). Evaluator failures are logged but don't stop the batch evaluation.
* `filter` — Optional JSON filter string for querying items (same format as InteractiveAI API). Examples: - '{"tags": \["production"]}' - '{"user\_id": "user123", "timestamp": {"operator": ">", "value": "2024-01-01"}}' Default: None (fetches all items).
* `fetch_batch_size` — Number of items to fetch per API call and hold in memory. Larger values may be faster but use more memory. Default: 50.
* `max_items` — Maximum total number of items to process. If None, processes all items matching the filter. Useful for testing or limiting evaluation runs. Default: None (process all).
* `max_concurrency` — Maximum number of items to evaluate concurrently. Controls parallelism and resource usage. Default: 50.
* `composite_evaluator` — Optional function that creates a composite score from item-level evaluations. Receives the original item and its evaluations, returns a single Evaluation. Useful for weighted averages or combined metrics. Default: None.
* `metadata` — Optional metadata dict to add to all created scores. Useful for tracking evaluation runs, versions, or other context. Default: None.
* `max_retries` — Maximum number of retry attempts for failed batch fetches. Uses exponential backoff (1s, 2s, 4s). Default: 3.
* `verbose` — If True, logs progress information to console. Useful for monitoring long-running evaluations. Default: False.
* `resume_from` — Optional resume token from a previous incomplete run. Allows continuing evaluation after interruption or failure. Default: None.

**Returns**

BatchEvaluationResult containing: - total\_items\_fetched: Number of items fetched from API - total\_items\_processed: Number of items successfully evaluated - total\_items\_failed: Number of items that failed evaluation - total\_scores\_created: Scores created by item-level evaluators - total\_composite\_scores\_created: Scores created by composite evaluator - total\_evaluations\_failed: Individual evaluator failures - evaluator\_stats: Per-evaluator statistics (success rate, scores created) - resume\_token: Token for resuming if incomplete (None if completed) - completed: True if all items processed - duration\_seconds: Total execution time - failed\_item\_ids: IDs of items that failed - error\_summary: Error types and counts - has\_more\_items: True if max\_items reached but more exist
