Skip to main content

Quick Introduction

DeepEval is an open-source evaluation framework for LLMs. DeepEval makes it extremely easy to build and iterate on LLM (applications) and was built with the following principles in mind:

  • Easily "unit test" LLM outputs in a similar way to Pytest.
  • Plug-and-use 14+ LLM-evaluated metrics, most with research backing.
  • Synthetic dataset generation with state-of-the-art evolution techniques.
  • Metrics are simple to customize and covers all use cases.
  • Real-time evaluations in production.

Additionally, DeepEval integrates natively with Confident AI, which allows anyone to evaluate, regression test, and monitor LLM applications on the cloud.

Delivered by
Confident AI

Setup A Python Environment

Go to the root directory of your project and create a virtual environment (if you don't already have one). In the CLI, run:

python3 -m venv venv
source venv/bin/activate

Installation

In your newly created virtual environment, run:

pip install -U deepeval

deepeval runs evaluations locally on your enviornment. However, you can also run evaluations directly on Confident AI, an all in one evaluation platform:

deepeval login
tip

Running evaluations on Confident AI's infrastructure may use Confident AI's proprietary models instead. Contact us if you're dealing with sensitive data that has to reside in your private VPCs.

Create Your First Test Case

Run touch test_example.py to create a test file in your root directory. Open test_example.py and paste in your first test case:

test_example.py
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

def test_answer_relevancy():
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
# Replace this with the actual output of your LLM application
actual_output="We offer a 30-day full refund at no extra cost.",
retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."]
)
assert_test(test_case, [answer_relevancy_metric])

Run deepeval test run from the root directory of your project:

deepeval test run test_example.py

Congratulations! Your test case should have passed ✅ Let's breakdown what happened.

  • The variable input mimics a user input, and actual_output is a placeholder for what your application's supposed to output based on this input.
  • The variable retrieval_context contains the retrieved context from your knowledge base, and AnswerRelevancyMetric(threshold=0.5) is an default metric provided by deepeval for you to evaluate your LLM output's relevancy based on the provided retrieval context.
  • All metric scores range from 0 - 1, which the threshold=0.5 threshold ultimately determines if your test have passed or not.
info

You'll need to set your OPENAI_API_KEY as an enviornment variable before running the AnswerRelevancyMetric, since the AnswerRelevancyMetric is an LLM-evaluated metric.

To use ANY custom LLM of your choice, check out this part of the docs.

You can also save test results locally for each test run. Simply set the DEEPEVAL_RESULTS_FOLDER environment variable to your relative path of choice:

export DEEPEVAL_RESULTS_FOLDER="./data"

Create Your First Metric

Did you know?

You can now run all of deepeval's metrics for free on Confident AI's infrastructure. Start by following this guide.

deepeval provides two types of LLM evaluation metrics to evaluate LLM outputs: plug-and-use default metrics, and custom metrics for any evaluation criteria.

Default Metrics

deepeval offers 14+ research backed default metrics covering a wide range of use-cases (such as RAG and fine-tuning). To create a metric, simply import from the deepeval.metrics module:

from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

test_case = LLMTestCase(input="...", actual_output="...")
relevancy_metric = AnswerRelevancyMetric(threshold=0.5)

relevancy_metric.measure(test_case)
print(relevancy_metric.score, relevancy_metric.reason)

Note that you can run a metric as a standalone or as part of a test run as shown in previous sections.

info

All default metrics are evaluated using LLMs, and you can use ANY LLM of your choice. For more information, visit the metrics introduction section.

Custom Metrics

deepeval provides G-Eval, a state-of-the-art LLM evaluation framework for anyone to create a custom LLM-evaluated metric using natural language. Here's an example:

from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval

test_case = LLMTestCase(input="...", actual_output="...", expected_output="...")
correctness_metric = GEval(
name="Correctness",
criteria="Correctness - determine if the actual output is correct according to the expected output.",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
strict_mode=True
)

correctness_metric.measure(test_case)
print(correctness_metric.score, correctness_metric.reason)

Under the hood, deepeval first generates a series of evaluation steps, before using these steps in conjuction with information in an LLMTestCase for evaluation. For more information, visit the G-Eval documentation page.

tip

If you wish to customize your metrics a bit more, you can choose to implement your own metric. You can find a comprehensive step-by-step guide here, but here's a quick example of how you can create a metric that is NOT evaluated using LLMs:

from deepeval.scorer import Scorer
from deepeval.metrics import BaseMetric

class RougeMetric(BaseMetric):
def __init__(self, threshold: float = 0.5):
self.threshold = threshold
self.scorer = Scorer()

def measure(self, test_case: LLMTestCase):
self.score = self.scorer.rouge_score(
prediction=test_case.actual_output,
target=test_case.expected_output,
score_type="rouge1"
)
self.success = self.score >= self.threshold
return self.score

# Async implementation of measure(). If async version for
# scoring method does not exist, just reuse the measure method.
async def a_measure(self, test_case: LLMTestCase):
return self.measure(test_case)

def is_successful(self):
return self.success

@property
def __name__(self):
return "Rouge Metric"

#####################
### Example Usage ###
#####################
test_case = LLMTestCase(input="...", actual_output="...", expected_output="...")
metric = RougeMetric()

metric.measure(test_case)
print(metric.is_successful())

You'll notice that although not documented, deepeval additionally offers a scorer module for more traditional NLP scoring method and can be found here.

You can also create a custom metric to combine several different metrics into one. For example. combining the AnswerRelevancyMetric and FaithfulnessMetric to test whether an LLM output is both relevant and faithful (ie. not hallucinating).

Measure Multiple Metrics At Once

To avoid redundant code, deepeval offers an easy way to apply as many metrics as you wish for a single test case.

test_example.py
...

def test_everything():
assert_test(test_case, [answer_relevancy_metric, correctness_metric])

In this scenario, test_everything only passes if all metrics are passing. Run deepeval test run again to see the results:

deepeval test run test_example.py
info

deepeval optimizes evaluation speed by running all metrics for each test case concurrently.

Create Your First Dataset

A dataset in deepeval, or more specifically an evaluation dataset, is simply a collection of LLMTestCases and/or Goldens.

note

A Golden is simply an LLMTestCase with no actual_output, and it is an important concept if you're looking to generate LLM outputs at evaluation time. To learn more about Goldens, click here.

To create a dataset, simply initialize an EvaluationDataset with a list of LLMTestCases or Goldens:

from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset

first_test_case = LLMTestCase(input="...", actual_output="...")
second_test_case = LLMTestCase(input="...", actual_output="...")

dataset = EvaluationDataset(test_cases=[first_test_case, second_test_case])

Then, using deepeval's Pytest integration, you can utilize the @pytest.mark.parametrize decorator to loop through and evaluate your dataset.

test_dataset.py
import pytest
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
...

# Loop through test cases using Pytest
@pytest.mark.parametrize(
"test_case",
dataset,
)
def test_customer_chatbot(test_case: LLMTestCase):
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
assert_test(test_case, [answer_relevancy_metric])
tip

You can also evaluate entire datasets without going through the CLI (if you're in a notebook environment):

from deepeval import evaluate
...

evaluate(dataset, [answer_relevancy_metric])

Additionally, you can avoid re-evaluating test cases by reading from deepeval's local cache using the optional -c flag:

deepeval test run test_dataset.py -c

Or run test cases in parallel by using the optional -n flag followed by a number (that determines the number of processes that will be used) when executing deepeval test run:

deepeval test run test_dataset.py -n 2

Visit the evaluation introduction section to learn about the different types of flags you can use with the deepeval test run command.

Generate Synthetic Datasets

deepeval offers a synthetic data generator that uses state-of-the-art evolution techniques to make synthetic (aka. AI generated) datasets realistic. This is especially helpful if you don't have a prepared evaluation dataset, as it will help you generate the initiate testing data you need to get up and running with evaluation.

caution

You should aim to manually inspect and edit any synthetic data where possible.

Simply supply a list of local document paths to generate a synthetic dataset from your knowledge base.

from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.generate_goldens_from_docs(
document_paths=['example.txt', 'example.docx', 'example.pdf'],
max_goldens_per_document=10
)

After you're done with generating, simply evaluate your dataset as shown above. Note that deepeval's synthesizer does NOT generate actual_outputs for each golden. This is because actual_outputs are meant to be generated by your LLM (application), not deepeval's synthesizer.

Visit the synthesizer section to learn how to customize deepeval's synthetic data generation capabilities to your needs.

note

Remember, a Golden is basically an LLMTestCase but with no actual_output.

Red Team Your LLM application

LLM red teaming refers to the process of attacking your LLM application to expose any safety risks it may have, including but not limited to vulnerabilities such as bias, racism, encouraging illegal actions, etc. It is an automated way to test for LLM safety by prompting it with adversarial attacks, which will be all taken care of by deepeval.

info

Red teaming is a different form of testing from what you've seen above because while standard LLM evaluation tests your LLM on its intended functionality, red teaming is meant to test your LLM application against, intentional, adversarial attacks from malicious users.

Here's how you can scan your LLM for vulnerabilities in a few lines of code using deepeval's RedTeamer, an extremely powerful tool to automatically scan for 40+ vulnerabilities. First instantiate a RedTeamer instance:

from deepeval.red_teaming import RedTeamer

red_teamer = RedTeamer(
# describe purpose of your LLM application
target_purpose="Provide financial advice related to personal finance and market trends.",
# supply system prompt template
target_system_prompt="You are a financial assistant designed to help users with financial planning"
)

Then, supply the target LLM application you wish to scan:

from deepeval.red_teaming import AttackEnhancement, Vulnerability
...

results = red_teamer.scan(
# your target LLM of type DeepEvalBaseLLM
target_model=TargetLLM(),
attacks_per_vulnerability=5,
vulnerabilities=[v for v in Vulnerability],
attack_enhancements={
AttackEnhancement.BASE64: 0.25,
AttackEnhancement.GRAY_BOX_ATTACK: 0.25,
AttackEnhancement.JAILBREAK_CRESCENDO: 0.25,
AttackEnhancement.MULTILINGUAL: 0.25,
},
)
print("Red Teaming Results: ", results)

deepeval's RedTeamer is highly customizable and offers a range of different advanced red teaming capabilities for anyone to leverage. We highly recommend you read more about the RedTeamer at the red teaming section.

tip

The TargetLLM() you see being provided as argument to the target_model parameter is of type DeepEvalBaseLLM, which is basically a wrapper to wrap your LLM application into deepeval's ecosystem for easy evaluating. You can learn how to create a custom TargetLLM in a few lines of code here.

And that's it! You now know how to not only test your LLM application for its functionality, but also for any underlying risks and vulnerabilities it may expose and make your systems susceptible to malicious attacks.

Using Confident AI

If you have reached this point, you've likely ran deepeval test run multiple times. To keep track of all future evaluation results created by deepeval, login to Confident AI by running the following command in the CLI:

deepeval login
note

Follow the instructions displayed on the CLI to create an account, get your Confident API key, and paste it in the CLI. You should see a message congratulating your successful login.

Confident AI is an all-in-one platform that unlocks deepeval's full potential, and allows anyone to easily:

  • Run evaluations directly on Confident AI (instead of locally via deepeval)
  • Keep track of and debug historical test run results
  • Discover optimal hyperparameters, such as the best models and prompt templates to use
  • Centralize and synthesize evaluation datasets on the cloud
  • Safeguard against breaking changes in CI/CD pipelines
  • Monitor and trace LLM applications
  • Run real-time evaluations in production, with custom metrics
tip

Click here for the full documentation on using Confident AI through deepeval.

By logging in, you can:

  • Run evaluations directly on Confident AI's infrastructure (recommended)
  • Run evaluations locally via deepeval and have it sent to Confident AI upon completion

Both workflows would allow you to view, share, edit, and comment on resulting test runs on Confident AI.

info

If you do decide to stick to running evaluations locally instead, you'll be able to view test run results on Confident AI each time you execute a test run:

test_example.py
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

def test_answer_relevancy():
test_case = LLMTestCase(input="...", actual_output="...")
assert_test(test_case, [AnswerRelevancyMetric()])
deepeval test run test_example.py

Once evaluation has finished, the test run result will be automatically sent to Confident AI. You should now see a link being returned upon test completion, and you can paste it in your browser to view results.

Running Evaluations On Confident AI

Confident AI allows anyone to define (custom) LLM evaluation metrics and run evaluations scalably on the cloud. To begin, all you need to do is create an experiment on Confident AI and send over test cases with required fields such as actual_outputs generated by your LLM application that you wish to be evaluated.

note

All compute and LLMs required for evaluation are provided by Confident AI.

from deepeval import confident_evaluate
from deepeval.test_case import LLMTestCase

confident_evaluate(
experiment_name="Your Experiment Name",
test_cases=[LLMTestCase(...)]
)

Managing and Generating Datasets

deepeval allows you to push and pull datasets stored on Confident AI, which is similar to pushing and pulling a repo from GitHub. To push a dataset to Confident AI, create an EvaluationDataset instance, populate it with test cases, and use the push() method:

from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset(test_cases=[...])
dataset.push(alias="My First Dataset")

You can now edit, comment on, and manage test cases on the cloud instead of locally in a CSV file.

note

You can also create a dataset directly on Confident AI without going through deepeval.

To pull the dataset for evaluation, use the pull() method and evaluate it as shown in previous sections:

from deepeval import confident_evaluate
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull(alias="My First Dataset")

confident_evaluate(experiment_name="Your Experiment Name", dataset)
tip

In reality, you'll often times want to process the pulled dataset before evaluating it, since test cases in a dataset are stored as Goldens, which might not always be ready for evaluation (ie. missing an actual_output).

To see a concrete example and a more detailed explanation, visit the evaluating datasets section.

Optimizing Hyperparameters

Confident AI helps you easily discover the optimal set of hyperparameters, which in deepeval refers to properties such as the models, prompt templates, etc. used when generating the actual_outputs for each LLMTestCase.

caution

We're currently in the process of adding greater support for this feature but for now this section is only applicable if you're running evaluations locally using deepeval.

To discover which set of hyperparameters gives you the best evaluation metrics results, use the @deepeval.log_hyperparameters decorator:

test_example.py
import pytest
import deepeval
from deepeval import assert_test
from deepeval.dataset import EvaluationDataset
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

dataset = EvaluationDataset()
dataset.pull(alias="My First Dataset")

# Loop through test cases using Pytest
@pytest.mark.parametrize(
"test_case",
dataset,
)
def test_customer_chatbot(test_case: LLMTestCase):
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
assert_test(test_case, [answer_relevancy_metric])


# You should aim to make these values dynamic
@deepeval.log_hyperparameters(model="gpt-4", prompt_template="...")
def hyperparameters():
# Return a dict to log additional hyperparameters.
# You can also return an empty dict {} if there's no additional parameters to log
return {
"temperature": 1,
"chunk size": 500
}
note

The hyperparameters() function DOESN'T necessarily have to be named 'hyperparameters'. All you need in order to log hyperparameters on Confident AI is to decorate a function that returns a valid dictionary.

Once you've added this decorator, execute test_example.py once more:

deepeval test run test_example.py

The @deepeval.log_hyperparameters decorator helps Confident AI keep track of the hyperparameters used when generating the actual_outputs for a particular test run. This allows you to identify which combination of hyperparameters gives the best evaluation metric results over time.

tip

You can also log hyperparameters via the evaluate() function:

from deepeval import evaluate
...

evaluate(
test_cases=[...],
metrics=[...],
hyperparameters={"model": "gpt4o", "prompt template": "..."}
)

Feel free to execute this in a nested for loop to figure out the best hyperparameter combination!

Monitoring LLM Responses

Confident AI allows anyone to monitor, trace, and evaluate LLM responses in real-time. A single API request is all that's required, and this would typically happen at your servers right before returning an LLM response to your users:

import deepeval

# At the end of your LLM call
response_id = deepeval.monitor(
event_name="Chatbot",
model="gpt-4",
input="Example input.",
response="Example response.",
retrieval_context=["..."]
)

Confident AI will automatically run evaluations for each incoming LLM response on metrics you have turned on. Simply head over to the 'Project Details' page on Confident AI to turn on these real-time metrics.

info

You can also trace LLM applications on Confident AI. Learn more about how to setup tracing here.

Sending Human Feedback

Confident AI allows you to send human feedback on LLM responses monitored in production, all via one API call by using the previously returned response_id from deepeval.monitor():

import deepeval
...

deepeval.send_feedback(
response_id=response_id,
provider="user",
rating=7,
explanation="Although the response is accurate, I think the spacing makes it hard to read."
)

Confident AI allows you to keep track of either "user" feedback (ie. feedback provided by end users interacting with your LLM application), or "reviewer" feedback (ie. feedback provided by reviewers manually checking the quality of LLM responses in production).

note

To learn more, visit the human feedback section page.

Full Example

You can find the full example here on our Github.