Evaluators
Evaluators automate quality assessment using LLM-as-a-Judge, a method where a language model scores the outputs of your AI application. The judge model receives trace data, applies your chosen evaluation criteria, and returns scores with a clear chain-of-thought reasoning.
Why Evaluators Matter
Manual evaluation doesn't scale. Reviewing thousands of outputs by hand is slow, expensive, and inconsistent, so LLM-as-a-Judge offers a practical alternative because of it:
Scalability: Evaluate thousands of outputs in minutes rather than days.
Nuance: Captures subjective dimensions like helpfulness, relevance, and tone better than rule-based metrics.
Repeatable: With a fixed rubric, the same evaluation prompt produces consistent scores across runs.
Important LLM evaluators can inherit biases, favor verbose responses, or produce inconsistent scores. Periodically validate automated scores against human annotation to ensure alignment. See Annotations for establishing human baselines and calibration.
Evaluator Library View
Navigate to Improvement → Evaluators and select Evaluator Library in the top right to se all available evaluators in your project. The library displays each evaluator's name, maintainer (InteractiveAI or User), last edit date, usage count, version information and ID. In here you will find there are two main types of evaluators: InteractiveAI's Out-of-the-Box Evaluators and your Custom Evaluators.

Out-of-the-Box Evaluators
InteractiveAI provides a catalog of ready-to-use evaluators maintained by the platform. These capture best-practice evaluation prompts for common quality dimensions such as:
Conciseness
Brevity without unnecessary content
Context-correctness
Whether retrieved context is accurate
Context-relevance
Whether retrieved context fits the query
Correctness
Factual accuracy compared to ground truth
Hallucination
Fabricated or unsupported information in the output
Helpfulness
Whether the response assists the user effectively
Relevance
How directly the output addresses the query
Toxicity
Harmful, offensive, or inappropriate language
Click any evaluator to inspect its prompt, scoring criteria, and configuration in depth.
Custom Evaluators
When the built-in evaluators don't fit your needs, you are able to create your own. Custom evaluators appear in the library with "User" as the maintainer.
To create a custom evaluator:
Navigate to Evaluator Library
Click + Set Up Evaluator
Configure the evaluator:
Name: Identifier for this evaluator
Model: Toggle to use the default evaluation model or select a specific one
Score reasoning prompt: Instructions for how the judge should explain its reasoning
Score range prompt: Description of the scoring scale (e.g., "Score between 0 and 1")
Evaluation prompt: The full prompt template with
{{variables}}placeholders
Click Save

Your prompt can use variables like {{query}}, {{generation}}, and {{ground_truth}} that will be mapped to trace data when the evaluator runs.
Running Evaluators View
This view shows all your evaluator runs with their ID, score name, creation date, current status (active or inactive), and result counts. You'll also see links to the referenced evaluator configuration, the target that was evaluated, and which filters were applied.

When you need to dig deeper, click Open under the Logs Column to open the detailed execution log. This shows every trace that was evaluated along with its score value, the model's reasoning comment, and the final status. If something looks off or you want to understand exactly how the judge reached a particular score, click the trace link to inspect the full evaluation.

Since every LLM-as-a-Judge execution creates its own trace, you get complete visibility into what happened: the exact prompt sent to the judge, how the model responded, whether variable mapping worked correctly, and how many tokens were consumed.
Run an Evaluation
Once you have an evaluator (built-in or custom), configure where and how it should run.
To run an evaluator:
Navigate to Running Evaluators
Click Set Up Evaluator on the top right corner
Select your evaluator for this run
Configure the run:
Evaluator Runs On: Choose what data to evaluate.
New traces: Run on traces as they arrive
Existing traces: Run once on historical traces
Target Data: Select the data source.
Live Tracing Data: Production traces from your application
Experiment Runs: Outputs from dataset experiments
Target Filter: Narrow evaluation to specific subsets using filters like trace name, tags, environment, or dataset (only if you selected Experiment Runs on the above step).
Variable Mapping: Map each prompt variable to trace data.
Object: Select Trace or a specific observation
Object Variable: Choose Input, Output, or Metadata
JsonPath: Optionally extract a specific field from the JSON
Review the Preview Sample Matched Traces to verify your filters match the expected data
Click Execute
Last updated
Was this helpful?

