# Experiments

An experiment executes your AI system against every item in a dataset and records the outputs. Think of it as a **controlled experiment**: you take a fixed set of test cases, run them through a specific configuration (a particular prompt version, model, or parameter set), and capture exactly what your system produced. This gives you a **snapshot of performance** that you can compare against other runs.

### Why Experiments Matter

Without experiments, comparing configurations is guesswork. These let you:

* Compare how **different models** perform on **identical inputs**
* **Test prompt variations** against the same test cases
* Track performance over time as your **system evolves**
* **Generate outputs** for scoring by automated evaluators or human reviewers
* **Detect regressions** before they reach production

***

### Running an Experiment

To create an experiment:

{% tabs %}
{% tab title="Via InteractiveAI Platform" %}

1. Navigate to **Improve→ Datasets** and open your dataset
2. Click **+ New Experiment** in the top right
3. Click on Create
4. Configure your experiment:
   * **Experiment name**: A descriptive identifier (e.g., "gpt-4o-mini-v2-temperature-0.7")
   * **Description**: Optional context about what you're testing
   * **Model**: Select the provider and model to use
   * **Dataset**: Confirm the target dataset
   * **Evaluators**: Any active evaluators will automatically score your results
5. Click **Create** to start the run

<div data-with-frame="true"><figure><img src="https://708770081-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F1ICwJbq7EJdn5kBgXnQu%2Fuploads%2F5yzV1Oklvadx71HcL1jf%2Fimage.png?alt=media&#x26;token=a374cb20-0d0e-4a5d-a9bd-f636ac43f433" alt=""><figcaption></figcaption></figure></div>

{% hint style="info" %}
If you have **evaluators** configured, they'll automatically score the results as they come in. To know more about evaluators, please refer to the [Evaluators](https://docs.interactive.ai/improve/evaluators) page.
{% endhint %}
{% endtab %}

{% tab title="Via InteractiveAI SDK" %}
Use `run_experiment()` to run an experiment programmatically. There are two ways to call it depending on your data source.

1. With local data:

```python
def my_task(*, item, **kwargs):
    return llm.generate(item["input"])

def accuracy_evaluator(*, input, output, expected_output=None, **kwargs):
    is_correct = expected_output and expected_output.lower() in output.lower()
    return {"name": "accuracy", "value": 1.0 if is_correct else 0.0}

result = interactiveai.run_experiment(
    name="Summarization Test",
    data=[
        {"input": "Long article text...", "expected_output": "Expected summary"},
        {"input": "Another article...", "expected_output": "Another summary"}
    ],
    task=my_task,
    evaluators=[accuracy_evaluator]
)

print(result.format())
```

2. With an InteractiveAI dataset:

```python
dataset = interactiveai.get_dataset("my-eval-dataset")

result = dataset.run_experiment(
    name="Production Model Evaluation",
    task=my_task,
    evaluators=[accuracy_evaluator]
)

# Results automatically linked to dataset in InteractiveAI UI
print(f"View results: {result.dataset_run_url}")
```

When using `dataset.run_experiment()`, the experiment results are automatically linked to the dataset in the platform, enabling comparison between runs in the UI.
{% endtab %}
{% endtabs %}

For the full `run_experiment()` API reference including advanced features like composite evaluators, run evaluators, and async tasks, see the [SDK Documentation](https://app.gitbook.com/s/jHEEbkpMbUW2x51XS8Ez/experiments).

***

### Viewing Runs

Switch to the **Runs** tab on your dataset to see all experiment runs. The view includes two charts at the top showing latency and average model cost trends across runs, making it easy to spot performance changes over time.

<div data-with-frame="true"><figure><img src="https://708770081-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F1ICwJbq7EJdn5kBgXnQu%2Fuploads%2FTOMdltI42JKp2wX5csmF%2Fimage.png?alt=media&#x26;token=53edbb1c-c20f-4534-a789-f0950ed73d68" alt=""><figcaption></figcaption></figure></div>

The table below lists each run with its name, description, number of items processed, average latency, average model cost, and evaluation scores. Click any run to drill into the details and see exactly what happened with each dataset item.

<div data-with-frame="true"><figure><img src="https://708770081-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F1ICwJbq7EJdn5kBgXnQu%2Fuploads%2FKph2mwbhTWNkXcgo814b%2Fimage.png?alt=media&#x26;token=fac8365a-bca9-4d2a-b231-8408d247a951" alt=""><figcaption></figcaption></figure></div>

When you open a specific run, you see every dataset item that was processed along with its trace link, latency, cost, the input that was sent, the output your system produced, and the expected output for comparison. This view lets you **quickly scan** for items where the output diverged from expectations or where latency spiked unexpectedly.

### Comparing Runs

The real power of experiments is comparison. Click **Compare** to open a side-by-side view where you can select multiple runs from the same dataset and see how they performed on each item.

<div data-with-frame="true"><figure><img src="https://708770081-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F1ICwJbq7EJdn5kBgXnQu%2Fuploads%2FXvz9RGATuXNhSd6wuOc5%2Fimage.png?alt=media&#x26;token=d74e896b-5f55-4363-aede-33ba0edb3be1" alt=""><figcaption></figcaption></figure></div>

The comparison view shows each dataset item as a row, with columns for input, expected output, metadata, and then the results from each selected run including their scores. This makes it immediately obvious where one configuration outperformed another and helps you make informed decisions about which changes to ship.
