This guide is also available as a Jupyter Notebook.
Using evaluations with the SDK
In this cookbook we will show how to interact with evaluation in agenta programatically. Either using the SDK (or the raw API).
We will do the following:
- Create a test set
- Create and configure an evaluator
- Run an evaluation
- Retrieve the results of evaluations
We assume that you have already created an LLM application and variants in agenta.
Architectural Overview:
In this scenario, evaluations are executed on the Agenta backend. Specifically, Agenta invokes the LLM application for each row in the test set and subsequently processes the output using the designated evaluator. This operation is managed through Celery tasks. The interactions with the LLM application are asynchronous, batched, and include retry mechanisms. Additionally, the batching configuration can be adjusted to avoid exceeding the rate limits imposed by the LLM provider.
Setup
! pip install -U agenta
Configuration Setup
# Assuming an application has already been created through the user interface, you will need to obtain the application ID.
# In this example we will use the default template single_prompt which has the prompt "Determine the capital of {country}"
# You can find the application ID in the URL. For example, in the URL https://cloud.agenta.ai/apps/666dde95962bbaffdb0072b5/playground?variant=app.default, the application ID is `666dde95962bbaffdb0072b5`.
from agenta.client.backend.client import AgentaApi
# Let's list the applications
client.apps.list_apps()
app_id = "667d8cfad1812781f7e375d9"
# You can create the API key under the settings page. If you are using the OSS version, you should keep this as an empty string
api_key = "EUqJGOUu.xxxx"
# Host.
host = "https://cloud.agenta.ai"
# Initialize the client
client = AgentaApi(base_url=host + "/api", api_key=api_key)
Create a test set
from agenta.client.backend.types.new_testset import NewTestset
csvdata = [
{"country": "france", "capital": "Paris"},
{"country": "Germany", "capital": "paris"}
]
response = client.testsets.create_testset(request=NewTestset(name="test set", csvdata=csvdata))
test_set_id = response.id
# let's now update it
csvdata = [
{"country": "france", "capital": "Paris"},
{"country": "Germany", "capital": "Berlin"}
]
client.testsets.update_testset(testset_id=test_set_id, request=NewTestset(name="test set", csvdata=csvdata))
Create evaluators
# Create an evaluator that performs an exact match comparison on the 'capital' column
# You can find the list of evaluator keys and evaluators and their configurations in https://github.com/Agenta-AI/agenta/blob/main/agenta-backend/agenta_backend/resources/evaluators/evaluators.py
response = client.evaluators.create_new_evaluator_config(app_id=app_id, name="capital_evaluator", evaluator_key="auto_exact_match", settings_values={"correct_answer_key": "capital"})
exact_match_eval_id = response.id
code_snippet = """
from typing import Dict
def evaluate(
app_params: Dict[str, str],
inputs: Dict[str, str],
output: str, # output of the llm app
datapoint: Dict[str, str] # contains the testset row
) -> float:
if output and output[0].isupper():
return 1.0
else:
return 0.0
"""
response = client.evaluators.create_new_evaluator_config(app_id=app_id, name="capital_letter_evaluator", evaluator_key="auto_custom_code_run", settings_values={"code": code_snippet})
letter_match_eval_id = response.id
# get list of all evaluators
client.evaluators.get_evaluator_configs(app_id=app_id)
Run an evaluation
response = client.apps.list_app_variants(app_id=app_id)
print(response)
myvariant_id = response[0].variant_id
# Run an evaluation
from agenta.client.backend.types.llm_run_rate_limit import LlmRunRateLimit
response = client.evaluations.create_evaluation(app_id=app_id, variant_ids=[myvariant_id], testset_id=test_set_id, evaluators_configs=[exact_match_eval_id, letter_match_eval_id],
rate_limit=LlmRunRateLimit(
batch_size=10, # number of rows to call in parallel
max_retries=3, # max number of time to retry a failed llm call
retry_delay=2, # delay before retrying a failed llm call
delay_between_batches=5, # delay between batches
),)
print(response)
# check the status
client.evaluations.fetch_evaluation_status('667d98fbd1812781f7e3761a')
# fetch the overall results
response = client.evaluations.fetch_evaluation_results('667d98fbd1812781f7e3761a')
results = [(evaluator["evaluator_config"]["name"], evaluator["result"]) for evaluator in response["results"]]
# End of Selection
# fetch the detailed results
client.evaluations.fetch_evaluation_scenarios(evaluations_ids='667d98fbd1812781f7e3761a')