Skip to content

Commit f52b283

Browse files
committed
feat: add README.md in the dataset folder
Signed-off-by: Jorge Garcia Oncins <jgarciao@redhat.com>
1 parent 5b230b0 commit f52b283

File tree

1 file changed

+17
-0
lines changed

1 file changed

+17
-0
lines changed
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Llama Stack test fixtures (internal)
2+
3+
These files are for **internal Open Data Hub / OpenShift AI integration tests** only. We use them to hit **[Llama Stack](https://github.com/meta-llama/llama-stack) vector store APIs**—think ingest, indexing, search, and the plumbing around that—not as a shipped dataset or for model training.
4+
5+
## IBM finance PDFs (`corpus/finance/`)
6+
7+
The PDFs here are IBM **quarterly earnings press releases** (the same material IBM posts for investors). If you need to replace or refresh them, download the official PDFs from IBM’s site:
8+
9+
[Quarterly earnings announcements](https://www.ibm.com/investor/financial-reporting/quarterly-earnings) (choose year and quarter, then open the press release PDF).
10+
11+
## PDF edge cases (`corpus/pdf-testing/`)
12+
13+
This folder is for **weird PDFs on purpose**: password-protected files, digitally signed ones (e.g. PAdES), and similar cases so we can test how ingestion and parsers behave when the file is not a plain “print to PDF” document.
14+
15+
## Small print
16+
17+
Not for external distribution as a “dataset.” PDFs stay under their publishers’ terms; don’t reuse them outside this test context without checking those terms.

0 commit comments

Comments
 (0)