Skip to main content
Added in 2026.12 (beta)
This feature is in beta. We encourage you to try it out and provide us with feedback.
Custom Evaluation lets you define your own scoring criteria in addition to Basic Analysis. The custom evaluation allows you to evaluate behavior specific to your business domain, compliance requirements, or product context.
Custom Evaluation Dashboard Overview

When to Use Custom Evaluation

  • When predefined criteria aren’t enough — evaluate business-specific behaviors not covered by Basic Analysis.
  • For compliance monitoring — create criteria to track whether AI Agents follow regulatory or policy requirements specific to your industry.
  • After launching new products or services — create criteria to measure how well the AI Agent handles new offerings.
  • During A/B testing — compare AI Agent behavior across different prompt or instruction variants.

Restrictions

  • You can add up to 10 custom evaluation criteria.

Configuration

  1. In the left-side menu of the Insights interface, go to Configuration.
  2. In the Custom Criteria section, click + Add Custom Criterion and select a widget from the Widget Type list:
    A binary criterion with a pass or fail outcome. Use this for simple checks where you want to know whether a specific behavior occurred.
    • Title — a short display name for the criterion. For example, Accurate Product Information.
    • Instructions for the LLM — enter a statement that the LLM must evaluate and return as pass or fail. For example, The agent provides accurate information about financial products, rates, fees, or terms without contradicting known product documentation.
    The widget will be rendered as a ring chart on the Custom Evaluation dashboard, showing the percentage of conversations that passed versus failed this criterion.
  3. Fill in the required fields for the selected widget type. Save changes.
Custom Criteria Config

Tips for Writing LLM Instructions

The quality of your custom criteria results depends directly on how clearly you write the LLM instructions.
  • Be specific. Vague instructions produce inconsistent scores. Instead of “Was the agent helpful?”, describe exactly what behavior to look for and what each score level means.
  • Define the scale explicitly. For numeric and scale-based criteria, always describe what the lowest and highest scores represent.
  • Use plain language. Write instructions the way you would explain the task to a colleague, not in technical jargon.
  • One criterion, one thing. Each criterion should evaluate a single, well-defined behavior. Combining multiple checks in one criterion makes scores harder to interpret.
  • Test with known conversations. After adding a custom criterion, run a manual analysis on conversations where you already know the expected outcome to verify the scoring behaves as intended.
Last modified on June 9, 2026