README for Testing Generative AI Agent Applications

This repo contains the code and documentation for the AI Alliance user guide and tools for Testing Generative AI Agent Applications, which explores the inherent difficulties for enterprise developers who need to write the same kinds of repeatable, reliable, and automated tests for AI behaviors that they are accustomed to writing for "traditional" code. This is much more challenging with generative AI models involved, because non-Gen AI software is (mostly) deterministic and hence predictable, while Gen AI outputs are stochastic, governed by a random probability model, so how do you write a predictable test for that kind of behavior??

This documentation is the published user guide, with the GitHub Pages sources in this repo's docs folder and the example code in the src folder. A Makefile provides convenient make targets for doing most tasks.

Next, we discuss working with the tools and example application for this project. You can also find this information in the user guide's Using the Healthcare ChatBot Example Application and Tools chapter. For information about working with the user guide website content, see GITHUB_PAGES.md.

Warning

DISCLAIMER: In this guide, we develop a healthcare ChatBot example application, chosen because it is a worst case design challenge. Needless to say, but we will say it anyway, a ChatBot is notoriously difficult to implement successfully, because of the free form prompts from users and the many possible responses models can generate. A healthcare ChatBot is even more challenging because of the risk it could provide bad responses that lead to poor patient outcomes, if applied. Hence, this example is only suitable for educational purposes. It is not at all suitable for use in real healthcare applications and it must not be used in such a context. Use it at your own risk.

The User Guide Tools and Exercises vs. the ChatBot Application

In the published user guide, the tools and exercises are written with an emphasis on teaching concepts, with minimal distractions, with less concern about being fully engineered for real-world, enterprise development and deployment, where more sophisticated tools might be needed.

In contrast, the Healthcare ChatBox example application applies the concepts in a more “refined” implementation that also illustrates how existing software development practices and new AI-oriented techniques can be used together in a real software-development project.

Setup

Note

The make target processes discussed below assume you are using a MacOS or Linux shell environment like zsh or bash. However, the tools described below are written in Python, so the commands shown for running them should work on any operating system with minor adjustments, such as paths to files and directories. Let us know about your experiences: issues, discussions.

Whether using the tools or the example application, you start by setting up the required dependencies, etc.

Clone the project repo, change to the ai-application-testing directory, and run the following make command to do “one-time” setup steps:

make one-time-setup

If make won't work on your machine, do the following steps yourself:

Install uv, a Python package manager.
Run the uv setup commands to install dependencies, etc.:

uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"

Install ollama for local model inference (optional).

[!TIP]

Try make help for details about the make process. There are also --help options for all the tools discussed below.

For any make target, to see what commands will be executed without running them, pass the -n or --dry-run option to make.

The make targets have only been tested on MacOS. Let us know if they don't work on Linux. The actual tools are written in Python for portability.

The repo README has an Appendix with other tools you might find useful to install.

One of the dependencies managed with uv is LiteLLM, the library we use for flexible invocation of different inference services. All our development has been done using ollama for cost-benefit and flexibility reasons, but it is not required if you want to use an inference provider instead, like OpenAI or Anthropic. In that case, see the LiteLLM usage documentation for how to configure LiteLLM to use your preferred inference service. You will need to modify the Makefile, as discussed in Changing the Default Model Used below.

Pick Your Models

If you use ollama, download the models you want to use. For example:

ollama serve            # in one terminal window
ollama pull gemma4:e4b  # in another terminal window

In our experiments, we used the models shown in Table 1, served by ollama. See also a similar table in the user guide:

Model	Parameters	Memory	Notes
`gemma4:e4b`	8 B	11 GB	Default model used in the `Makefile`. Excellent performance, requiring about 11 GB, so it provides a good balance between performance and efficiency. The larger `gemma4` models available, `26b` and `31b` work even better, but require much more memory.
`gpt-oss:20b`	20 B	14 GB	Excellent performance, with slightly more memory required. However, at this time, this model doesn't work with the agent ChatBot implementation, which uses LangChain's Deep Agents framework. See this LangChain issue for details.
`qwen3.5:35b`	35 B	27 GB	Excellent performance, but requires about 27 GB of memory.
`llama3.2:3B`	3 B	7.5 GB	A small but effective model in the Llama family. Should work on most laptops. A good choice during development when overhead is more important than performance.
`granite4:latest`	3 B	7 GB	Another small model tuned for instruction following and tool calling.
`smollm2:1.7b-instruct-fp16`	1.7 B	5.6 GB	The model family used in Hugging Face's LLM course, which we plan to use to highlight some advanced concepts. The `instruct` label means the model was tuned for improved instruction following, important for ChatBots and other user-facing applications.

Table 1: Models used for our experiments.

Yes, some model names use B and others use b... The Memory sizes are what ollama ps shows when these models are being served on an M1 Max MacBook Pro.

Tip

REQUEST: Let us know which models work well for you!

Note

We provide example results for some of these models in src/data/examples/ollama_chat.

By default, we use gemma4:e4b as our inference model, served by ollama. Previously, we used gpt-oss:20b, but switched due to this LangChain issue. That is what we will show in the examples which follow. Just change the model name as appropriate for your situation. The default model is specified in the Makefile with the variable MODEL, which defaults to ollama_chat/gemma4:e4b. (Note the ollama_chat/ prefix.) All the models listed above are defined in the MODELS variable; there are all-models-* targets that try all of them.

We find that gemma4:e4b, requiring about 11 GB of memory performs reasonably well on a MacBook Pro with an M1 Max chips and 32GB of memory. The slightly larger gpt-oss:20b can be slower, especially if lots of other apps are using significant memory. 48 GB or 64 GB of memory is much better for both models and also supports larger models more easily.

We encourage you to experiment with other model sizes and with different model families. Consider also Quantized versions of models. It is worth the time to experiment with different models to find the ones that work best for your development environment and production deployments.

Changing the Default Model Used

If you don't want to use the default model, gemma4:e4b, served by ollama, you can change the definition of MODEL in the Makefile in one of several ways.

If you want to use a different inference option other than ollama, first see the LiteLLM documentation for information about specifying models for other inference services. In most cases, it will be as simple as changing a few definitions in the Makefile (discussed next).

There are two ways to specify your preferred model in the Makefile:

Edit the Makefile

Edit the Makefile and change the following definitions:

MODEL - e.g., ollama_chat/llama3.2:3B. This will change the default for all invocations of the tools and the example ChatBot app. (The ollama_chat/ or ollama/ prefix is required if you are using ollama, with ollama_chat/ recommended by LiteLLM.) For convenience, we defined several MODEL_* variables for different models, then refer to the one we want when defining MODEL. You can do follow this convention, if desired...
INFERENCE_SERVICE - e.g., openai, anthropic, ollama.
INFERENCE_URL - e.g., http://localhost:11434 for ollama or https://api.openai.com/v1 for OpenAI.
Others? The LiteLLM documentation may tell you to define other variables. You will most likely need an API key or other credentials for hosted services, like OpenAI and Anthropic. Do not put this information in the Makefile! This avoids the risk that you will accidentally commit secrets to a repo. Instead, use an environment variable or other solution described by the LiteLLM documentation.

Override Makefile Definitions on the Command Line

On invocation, you can dynamically change values, such as the MODEL used with ollama. This is the easiest way to do “one-off” experiments with different models served by ollama, for example:

make MODEL=ollama_chat/llama3.2:3B chatbot

If you want to try all the ollama-served models mentioned above with one command, use make all-models-..., where ... is one of the other make targets, like all-code, which runs all the tool invocations for a single model, e.g.,

make all-models-chatbot

You can also change the list of models you regularly want to use by changing the definition of the MODELS variable in the Makefile.

Note

See also the LiteLLM documentation for guidance on any required modifications to the arguments passed in our Python code to the LiteLLM completion function. (Search for response = completion to find all the occurrences.) We plan to implement automatic handling of such changes eventually.

Running the Tools and ChatBot Application with `make`

Now you are set up and you can use make to run the tools discussed and also the complete example ChatBot application. We will discuss the tools first, then the application.

On MacOS and Linux, using make is the easiest way to run the exercises. The actual commands are printed out and we repeat them below for those of you on other platforms. Hence, you can also run the Python tools directly without using make.

Tip

For all of the tool-invocation make commands discussed from now on, you can run each one for all the models by prefixing the target name with all-models-, e.g., all-models-run-tdd-example-refill-chatbot. This target doesn't make sense for for the ChatBot application, which is interactive, but you can use it for the tests and integration-tests targets.
For a given model (as defined by the Makefile variable MODEL), you can run all of the tools with one command, make all-code. Hence, you can run all the examples for all the models discussed above using make all-models-all-code.

Run `tdd-example-refill-chatbot`

This tool is explained in the section on Test-Driven Development. It introduces some concepts, but isn't intended to be a tool you use long term, in contrast to the next two unit-benchmark-data-* tools discussed below.

Run this make command:

make run-tdd-example-refill-chatbot

There are shorthand "aliases" for this target:

make run-terc
make terc

This target first checks the following:

The uv command is installed and on your path.
Two directories defined by make variables exist. If not, they are created with mkdir -p, where the option -p ensures that missing parent directories are also created:
- OUTPUT_LOG_DIR, where most output is written, which is temp/output/ollama_chat/gemma4_e4b/logs, when MODEL is defined to be ollama_chat/gemma4:e4b. (The : is converted to _, because : is not an allowed character in MacOS file system names.) Because MODEL has a /, we end up with a directory ollama_chat that contains a gemma4_e4b subdirectory.
- OUTPUT_DATA_DIR, where data files are written, which is temp/output/ollama_chat/gemma4_e4b/data.

If you don't use the make command, make sure you have uv installed and either manually create the same directories or modify the corresponding paths shown in the next command.

After the setup, the make target runs the following command:

cd src && time uv run tools/tdd-example-refill-chatbot.py \
	--model ollama_chat/gemma4:e4b \
	--service-url http://localhost:11434 \
	--template-dir tools/prompts/templates \
	--data-dir .../output/ollama_chat/gemma4_e4b/data \
	--log-file .../output/ollama_chat/gemma4_e4b/logs/${TIMESTAMP}/tdd-example-refill-chatbot.log

TIMESTAMP will be the current time when the uv command started, of the form YYYYMMDD-HHMMSS, and the values passed for --data-dir and --log-file are absolute paths. The other paths shown are relative to the src directory.

The time command prints execution time information for the uv command. It is optional and you can omit it when running this command directly yourself or on a system without this command. It returns how much system, user, and "wall clock" times were used for execution on MacOS and Linux systems. Note that uv is used to run tools/tdd-example-refill-chatbot.py.

The arguments are as follows:

Argument	Purpose
`--model ollama_chat/gemma4:e4b`	The model to use, defined by the `make` variable `MODEL`, as discussed above.
`--service-url http://localhost:11434`	The `ollama` local server URL. Some other inference services may also require this argument.
`--template-dir tools/prompts/templates`	Where we keep prompt templates we use for all the examples.
`--data-dir .../output/ollama_chat/gemma4_e4b/data`	Where any generated data files are written. (Not used by all tools.)
`--log-file .../output/ollama_chat/gemma4_e4b/logs/${TIMESTAMP}/tdd-example-refill-chatbot.log`	Where log output is captured.

The tdd-example-refill-chatbot.py tool runs two experiments, one with the template file q-and-a_patient-chatbot-prescriptions.yaml and the other with q-and-a_patient-chatbot-prescriptions-with-examples.yaml. The only difference is the second file contains embedded examples in the prompt, so in principal the results should be better, but in fact, they are often the same, as discussed in the TDD chapter.

Note

These template files were originally designed for use with the llm CLI tool (see the Appendix below for details about llm). In our Python tools, LiteLLM is used instead to invoke inference. We extract the content we need from the templates and construct the prompts we send through LiteLLM.

The tdd-example-refill-chatbot.py tool passes a number of hand-written prompts that are either prescription refill requests or something else, then checks what was returned by the model. As the TDD chapter explains, this is a very ad-hoc approach to creating and testing a unit benchmark.

Run `unit-benchmark-data-synthesis`

Described in Unit Benchmarks, this tool uses an LLM to generate Q&A (question and answer) pairs for unit benchmarks. It addresses some of the limitations of the more ad-hoc approach to benchmark creation used in the previous TDD exercise:

make run-unit-benchmark-data-synthesis

There are shorthand "aliases" for this target:

make run-ubds
make ubds

After the same setup steps as before, the following command is executed:

cd src && time uv run tools/unit-benchmark-data-synthesis.py \
	--model ollama_chat/gemma4:e4b \
	--service-url http://localhost:11434 \
	--template-dir tools/prompts/templates \
	--data-dir .../output/ollama_chat/gemma4_e4b/data \
	--log-file .../output/ollama_chat/gemma4_e4b/logs/${TIMESTAMP}/unit-benchmark-data-synthesis.log

Note

If you run the previous tool command, then this one, the two values for TIMESTAMP will be different. However, when you make all-code or any all-models-* target, the same value will be used for TIMESTAMP for all the invocations.

The arguments are the same as before, e.g., the --data-dir argument specifies the location where the Q&A pairs are written, one file per unit benchmark, with subdirectories for each model used. For example, after running this tool with ollama_chat/gemma4:e4b, the output will be in .../output/data/ollama_chat/gemma4_e4b, as discussed previously. This directory will have the following files of synthetic Q&A pairs:

synthetic-q-and-a_patient-chatbot-emergency-data.jsonl
synthetic-q-and-a_patient-chatbot-non-prescription-refills-data.jsonl
synthetic-q-and-a_patient-chatbot-prescription-refills-data.jsonl

(Examples can be found in the repo's src/data/examples/ollama_chat directory.)

They cover three unit-benchmarks:

emergency: The patient prompt suggests the patient needs urgent or emergency care, so they should stop using the ChatBot and call 911 (in the US) immediately.
refill: The patient is asking for a prescription refill.
other: (i.e., non-prescription-refills) All other patient questions.

The prompt used tells the model to return one of these classification labels, along with some additional information, rather than an ad-hoc, generated text like, "It sounds like you are having an emergency. Please call 911..."

Each of these data files are generated with a single inference invocation, with each invocation using these corresponding template files:

There is also an additional, optional argument --use-cases use-case1 use-case2 that allows you to specify one or more use cases to process. The current list of use cases is the following:

prescription-refills
non-prescription-refills
emergency

Hence, you have to quote the first two names. The default is to process all of them.

Run `unit-benchmark-data-validation`

Described in LLM as a Judge, this tool uses a teacher model to validate the quality of the Q&A pairs that were generated in the previous exercise:

make run-unit-benchmark-data-validation

There are shorthand "aliases" for this target:

make run-ubdv
make ubdv

After the same setup steps, the following command is executed:

cd src && time uv run tools/unit-benchmark-data-validation.py \
	--model ollama_chat/gemma4:e4b \
	--service-url http://localhost:11434 \
	--template-dir tools/prompts/templates \
	--data-dir .../output/ollama_chat/gemma4_e4b/data \
	--log-file .../output/ollama_chat/gemma4_e4b/logs/TIMESTAMP/unit-benchmark-data-validation.log \

In this case, the --data-dir argument specifies where to read the previously-generated Q&A files, and for each file, a corresponding “validation” file is written back to the same directory:

synthetic-q-and-a_patient-chatbot-emergency-data-validation.jsonl
synthetic-q-and-a_patient-chatbot-non-prescription-refills-data-validation.jsonl
synthetic-q-and-a_patient-chatbot-prescription-refills-data-validation.jsonl

(See examples in src/data/examples/ollama_chat.)

These files “rate” each Q&A pair from 1 (bad) to 5 (great). Also, summary statistics are written to stdout and to the output file .../output/ollama_chat/<model>/unit-benchmark-data-validation.out. Currently, we show the counts of each rating, meaning how good the teacher LLM rates the Q&A pair. For simplicity, we used the same model as the teacher that we used for generation, but for real use, consider using a different model.

From one of our test runs, the following table was printed:

Files:	1	2	3	4	5	Total
synthetic-q-and-a_patient-chatbot-emergency-data.jsonl	0	4	7	12	168	191
synthetic-q-and-a_patient-chatbot-prescription-refills-data.jsonl	0	0	0	0	108	108
synthetic-q-and-a_patient-chatbot-non-prescription-refills-data.jsonl	2	2	0	1	168	173

Totals:	2	6	7	13	444	472

Total count: 475 (includes errors), total errors: 3

The teacher model is asked to provide reasoning for its ratings. It is instructive to look at the output *-validation.jsonl files that we saved in src/data/examples/ollama_chat/gpt-oss_20b/data/.

Note that the emergency Q&A pairs had the greatest ambiguities, where the teacher model didn't think that many of the Q&A pairs represented real emergencies (lowest scores) or the situation was "ambiguous" (middle scores).

In fact, the program deliberately ignores the actual file where the Q&A pair appears. For example, we found that the emergency file contains some refill and other questions. Only the actual label in the answer corresponding to a question was evaluated.

Is this okay? Each data file is supposed to be for a particular use case, yet in fact the Q&A pairs have some mixing across files. You could decide to resort all the data by label to keep them separated by use case or you could just concatenate all the Q&A pairs into one big set. We think the first choice is more in the spirit of how focused, automated, use-case specific tests are supposed to work, but it is up to you...

The same template file was used for evaluating the three data files:

synthetic-q-and-a_patient-chatbot-data-validation.yaml

This tool also supports the optional argument --use-cases use-case1 use-case2 that allows you to specify one or more use cases to process, defaulting to all of them.

Run the Langflow Pipeline for the Tools

Preliminary support is also provided for running the unit benchmark data synthesis and validation tools as a Langflow pipeline. See src/tools/langflow/README.md for details.

Run the ChatBot Example Application

This purpose of this application is to represent something closer to what you would actually build, with automated unit and integration tests, a growing list of features, added incrementally, etc.

There are actually two implementations of this application:

ChatBotSimple - A "simple" implementation that just uses LLM inference wrapped with some custom Python code, but without agent tools.
ChatBotAgent - A a more advanced "agent" implementation that uses Langchain's deep agent tools for more advanced behaviors, like using agent skills to define new behaviors.

A command-line argument --which-chatbot is used with a shared code base to select which implementation to use. In the Makefile, the variable WHICH_CHATBOT is defined to be agent by default, but it can be overridden on the command line with the value simple to use the other implementation.

The application can be invoked in one of several ways:

make chatbot               # Run the interactive ChatBot with a simple, command-line "UX".
make run-chatbot           # Synonym for "chatbot".
make help-chatbot          # Show help for the ChatBot.

After the same setup steps, like output directory creation, the following command is executed, which you can run directly, where we show the values for arguments as defined by Makefile variables:

cd src && time uv run python -m apps.chatbot.main \
  --model ollama_chat/gemma4:e4b \
  --service-url http://localhost:11434 \
  --template-dir apps/chatbot/prompts/templates \
  --data-dir data \
  --confidence-threshold 0.9 \
  --which-chatbot agent \
  --log-file .../logs/.../chatbot.log

A simple prompt is presented where you can enter "patient prompts" and see the replies. If you prefer a nicer GUI interface, see Using the ChatBot with Open WebUI below.

The arguments are similar to the previously-discussed arguments, with a new argument --confidence-threshold, which we will explain below.

The source code, etc. for this application and the automated tests are located in these locations:

Content	Location	Notes
Source code	`src/apps/chatbot`	Main source code for the ChatBot and the MCP server.
Agent Skills	`src/apps/chatbot/skills`	Skills for the Deep Agent implementation, including the appointment management skill.
Prompt Templates	`src/apps/chatbot/prompts/templates`	The prompts used by the ChatBot application and related tests. There are different prompts for the "simple" vs. "agent" implementations.
Unit Tests	`src/tests/unit/`	Conventional unit tests and AI-specific tests for the chatbot in `src/tests/unit/apps/chatbot`. The AI-specific tests are executed with both ChatBot implementations.
Integration Tests	`src/tests/integration/`	The `make` target `integration-tests` also runs the unit tests in a more exhaustive way, as discussed below.
Test Data	`src/tests/data`	Test Q&A data for the AI tests.
Test Logs	`src/tests/logs/${MODEL_FILE_NAME}`	Special log output for easier examination of AI-related test results, where `MODEL_FILE_NAME` will be `ollama_chat/gemma4_e4b`, by default. It is computed from the value of the `MODEL` variable, where any colons are replaced with underscores.

Appointment Management Feature

The ChatBot Agent implementation includes a simple appointment management skill for demonstration purposes, implemented with "agent skills". It allows patients to:

Schedule appointments: Create new appointments during weekday business hours
Cancel appointments: Cancel existing appointments
Confirm appointments: Confirm scheduled appointments
Change appointments: Reschedule to a different time
List appointments: View all scheduled appointments

The appointment system enforces the following constraints:

Appointments must be scheduled on the hour (e.g., 10:00, 11:00, not 10:30)
Only weekdays (Monday-Friday) are available
Common USA holidays are excluded
Only one patient per time slot

The appointment data is stored in a JSONL file (data/appointments.jsonl) that persists across sessions.

Testing the Appointment Feature

Unit tests for the appointment tool can be run with:

make unit-tests-appointments

These tests verify all appointment operations including creation, cancellation, confirmation, rescheduling, and validation of business rules.

An MCP Server for the ChatBot

Running the MCP server is very similar. Since it runs the ChatBot for you, it takes the same arguments as the ChatBot. The only difference is the Python module invoked: uv run python -m apps.chatbot.mcp_server.server. The same --which-chatbot argument is used to select the ChatBot implementation.

make mcp-server            # Run the MCP server for the ChatBot (Also runs the ChatBot, so don't run both!)
make run-mcp-server        # Synonym for "mcp-server".
make help-mcp-server       # Show help for the MCP server.
make inspect-mcp-server    # Run the server with the `npx @modelcontextprotocol/inspector` tool.
make check-mcp-server      # Runs a 'sanity check' that the MCP server works.

For more details on running the MCP server, see the src/apps/chatbot/mcp_server/README.md.

Tip

Use the make inspect-mcp-server command to run the MCP server and inspect it with the npx @modelcontextprotocol/inspector tool. Node.js is required to run the inspector.

An OpenAI-compatible API Server for the ChatBot

Similarly, running the OpenAI-compatible API server is very similar. Since it runs the ChatBot for you, it takes the same arguments as the ChatBot, plus two additional options we will discuss shortly. The only other difference is the Python module invoked: uv run python -m apps.chatbot.api_server.server. The same --which-chatbot argument is used to select the ChatBot implementation.

make api-server            # Run the OpenAI-compatible API server for the ChatBot (Also runs the ChatBot, so don't run both!)
make run-api-server        # Synonym for "api-server".
make help-api-server       # Show help for the API server.
make check-api-server      # Runs a 'sanity check' that the API server works.

make view-api-server-docs  # Open a browser showing the API server "docs".
make view-api-server-redoc # Open a browser showing the API server "redoc".

the other two options are --host HOST and --port PORT for the API's own web server, not to be confused with --service-url SERVICE_URL for the Ollama server. The host cannot have the http:// prefix in the value. Just use localhost, 0.0.0.0, 192.168.0.1, etc. The default values for these two options are localhost and 8000, respectively.

For more details on running the OpenAI-compatible API server, see the src/apps/chatbot/api_server/README.md.

Using the ChatBot with Open WebUI

Finally, if you prefer using a GUI instead of the CLI prompt for the ChatBot, an integration is provided with Open WebUI. See the user guide for details. See also the src/apps/chatbot/open-webui/README.md.

Automated Testing: Practical Enhancements

Now we have automated tests in the src/tests directory and test data, which is a set of JSONL files with example prompt/answer pairs (with other metadata) used to test each supported use case of the ChatBot. In other words, this data is used for the unit benchmarks we have been advocating you use. This data was adapted from the example outputs of the tools found in src/data/examples/ollama_chat/gpt-oss_20b/data, but with changes reflecting what we learned while iterating on the development of the ChatBot application!

See the section Automated Testing: Practical Enhancements in the The Working Example and Tools chapter for more details.

Getting Involved

We welcome contributions as PRs, either to our code examples or our user guide. Please see our Alliance community repo for general information about contributing to any of our projects. This section provides some specific details you need to know.

In particular, see the AI Alliance CONTRIBUTING instructions. You will need to agree with the AI Alliance Code of Conduct.

Tip

Before submitting a PR, build the make target before-pr:

make before-pr   # Equivalent to 'make tests format lint type-check'

Licenses

All code contributions are licensed under the Apache 2.0 LICENSE (which is also in this repo, LICENSE.Apache-2.0).

All documentation contributions are licensed under the Creative Commons Attribution 4.0 International (which is also in this repo, LICENSE.CC-BY-4.0).

All data contributions are licensed under the Community Data License Agreement - Permissive - Version 2.0 (which is also in this repo, LICENSE.CDLA-2.0).

We use the "Developer Certificate of Origin" (DCO).

Warning

Before you make any git commits with changes, understand what's required for DCO.

See the Alliance contributing guide section on DCO for details. In practical terms, supporting this requirement means you must use the -s flag with your git commit commands.

The Website

The website for this repo is found in the docs directory. It is published using GitHub Pages. See GITHUB_PAGES.md for details.

Appendix: The `llm` and `jq` CLI Tools

If you want an excellent command-line tool for LLM inference, try llm from Simon Willison. Not only does it handle various inference options, from ollama to services like ChatGPT and Anthropic, it has features like a template system, tool registration, etc.

There are several make targets at the end of Makefile that you can use to install and use llm. Run make help-llm for details. Here we summarize a few key points to know.

We use several “templates” in src/tools/prompts/templates and src/apps/chatbot/prompts/templates with LiteLLM, but they were original designed for use with llm. The make install-llm-templates command will install them as llm templates, which you could use to make CLI versions of our Python tools, if you prefer.

The target first executes the following llm command to see where llm has templates installed on your system:

llm templates path

On MacOS, it will be $HOME/Library/Application Support/io.datasette.llm/templates. The target then copies all the YAML files in the src/tools/prompts/templates and src/apps/chatbot/prompts/templates directories to the correct location for templates on your system.

Try llm help for more details on the CLI.

Similarly, if you want a powerful CLI for parsing JSON, see jq ("JSON query"). The query language takes some practice to learn, but the documentation provides good examples.

Name		Name	Last commit message	Last commit date
Latest commit History 444 Commits
.github		.github
blogs		blogs
docs		docs
src		src
.gitignore		.gitignore
.python-version		.python-version
GITHUB_PAGES.md		GITHUB_PAGES.md
Gemfile		Gemfile
LICENSE.Apache-2.0		LICENSE.Apache-2.0
LICENSE.CC-BY-4.0		LICENSE.CC-BY-4.0
LICENSE.CDLA-2.0		LICENSE.CDLA-2.0
Makefile		Makefile
README.md		README.md
check-external-links.sh		check-external-links.sh
pylintrc.toml		pylintrc.toml
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README for Testing Generative AI Agent Applications

The User Guide Tools and Exercises vs. the ChatBot Application

Setup

Pick Your Models

Changing the Default Model Used

Edit the Makefile

Override Makefile Definitions on the Command Line

Running the Tools and ChatBot Application with `make`

Run `tdd-example-refill-chatbot`

Run `unit-benchmark-data-synthesis`

Run `unit-benchmark-data-validation`

Run the Langflow Pipeline for the Tools

Run the ChatBot Example Application

Appointment Management Feature

Testing the Appointment Feature

An MCP Server for the ChatBot

An OpenAI-compatible API Server for the ChatBot

Using the ChatBot with Open WebUI

Automated Testing: Practical Enhancements

Getting Involved

Licenses

The Website

Appendix: The `llm` and `jq` CLI Tools

About

Uh oh!

Releases 4

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

README for Testing Generative AI Agent Applications

The User Guide Tools and Exercises vs. the ChatBot Application

Setup

Pick Your Models

Changing the Default Model Used

Edit the Makefile

Override Makefile Definitions on the Command Line

Running the Tools and ChatBot Application with make

Run tdd-example-refill-chatbot

Run unit-benchmark-data-synthesis

Run unit-benchmark-data-validation

Run the Langflow Pipeline for the Tools

Run the ChatBot Example Application

Appointment Management Feature

Testing the Appointment Feature

An MCP Server for the ChatBot

An OpenAI-compatible API Server for the ChatBot

Using the ChatBot with Open WebUI

Automated Testing: Practical Enhancements

Getting Involved

Licenses

The Website

Appendix: The llm and jq CLI Tools

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Running the Tools and ChatBot Application with `make`

Run `tdd-example-refill-chatbot`

Run `unit-benchmark-data-synthesis`

Run `unit-benchmark-data-validation`

Appendix: The `llm` and `jq` CLI Tools

Packages