Using evaluations with the SDK

In this cookbook we will show how to interact with evaluation in agenta programatically. Either using the SDK (or the raw API).

We will do the following:

Create a test set
Create and configure an evaluator
Run an evaluation
Retrieve the results of evaluations

We assume that you have already created an LLM application and variants in agenta.

Architectural Overview:

In this scenario, evaluations are executed on the Agenta backend. Specifically, Agenta invokes the LLM application for each row in the test set and subsequently processes the output using the designated evaluator. This operation is managed through Celery tasks. The interactions with the LLM application are asynchronous, batched, and include retry mechanisms. Additionally, the batching configuration can be adjusted to avoid exceeding the rate limits imposed by the LLM provider.

Setup

! pip install -U agenta

Configuration Setup

# Assuming an application has already been created through the user interface, you will need to obtain the application ID.
# In this example we will use the default template single_prompt which has the prompt "Determine the capital of {country}"

# You can find the application ID in the URL. For example, in the URL https://cloud.agenta.ai/apps/666dde95962bbaffdb0072b5/playground?variant=app.default, the application ID is `666dde95962bbaffdb0072b5`.
from agenta.client.backend.client import AgentaApi
# Let's list the applications
client.apps.list_apps()

app_id = "667d8cfad1812781f7e375d9"

# You can create the API key under the settings page. If you are using the OSS version, you should keep this as an empty string
api_key = "EUqJGOUu.xxxx"

# Host.
host = "https://cloud.agenta.ai"

# Initialize the client

client = AgentaApi(base_url=host + "/api", api_key=api_key)

Create a test set

from agenta.client.backend.types.new_testset import NewTestset

csvdata = [
        {"country": "france", "capital": "Paris"},
        {"country": "Germany", "capital": "paris"}
    ]

response = client.testsets.create_testset(request=NewTestset(name="test set", csvdata=csvdata))
test_set_id = response.id

# let's now update it

csvdata = [
        {"country": "france", "capital": "Paris"},
        {"country": "Germany", "capital": "Berlin"}
    ]

client.testsets.update_testset(testset_id=test_set_id, request=NewTestset(name="test set", csvdata=csvdata))

Create evaluators

# Create an evaluator that performs an exact match comparison on the 'capital' column
# You can find the list of evaluator keys and evaluators and their configurations in https://github.com/Agenta-AI/agenta/blob/main/agenta-backend/agenta_backend/resources/evaluators/evaluators.py
response = client.evaluators.create_new_evaluator_config(app_id=app_id, name="capital_evaluator", evaluator_key="auto_exact_match", settings_values={"correct_answer_key": "capital"})
exact_match_eval_id = response.id

code_snippet = """
from typing import Dict

def evaluate(
    app_params: Dict[str, str],
    inputs: Dict[str, str],
    output: str,  # output of the llm app
    datapoint: Dict[str, str]  # contains the testset row
) -> float:
    if output and output[0].isupper():
        return 1.0
    else:
        return 0.0
"""

response = client.evaluators.create_new_evaluator_config(app_id=app_id, name="capital_letter_evaluator", evaluator_key="auto_custom_code_run", settings_values={"code": code_snippet})
letter_match_eval_id = response.id

# get list of all evaluators
client.evaluators.get_evaluator_configs(app_id=app_id)

Run an evaluation

response = client.apps.list_app_variants(app_id=app_id)
print(response)
myvariant_id = response[0].variant_id

# Run an evaluation
from agenta.client.backend.types.llm_run_rate_limit import LlmRunRateLimit
response = client.evaluations.create_evaluation(app_id=app_id, variant_ids=[myvariant_id], testset_id=test_set_id, evaluators_configs=[exact_match_eval_id, letter_match_eval_id],
                                                rate_limit=LlmRunRateLimit(
        batch_size=10, # number of rows to call in parallel
        max_retries=3, # max number of time to retry a failed llm call
        retry_delay=2, # delay before retrying a failed llm call
        delay_between_batches=5, # delay between batches
    ),)
print(response)

# check the status
client.evaluations.fetch_evaluation_status('667d98fbd1812781f7e3761a')

# fetch the overall results
response = client.evaluations.fetch_evaluation_results('667d98fbd1812781f7e3761a')

results = [(evaluator["evaluator_config"]["name"], evaluator["result"]) for evaluator in response["results"]]
# End of  Selection

# fetch the detailed results
client.evaluations.fetch_evaluation_scenarios(evaluations_ids='667d98fbd1812781f7e3761a')

Using evaluations with the SDK

Architectural Overview:​

Setup​

Configuration Setup​

Create a test set​