Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/checks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.9', '3.10', '3.11', '3.12']
python-version: ['3.9', '3.10', '3.11', '3.12', '3.13']
steps:
- uses: actions/checkout@v3
- uses: ./.github/actions/setup-poetry
Expand Down
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ repos:
hooks:
- id: docs
name: Docs
entry: poetry run ds_generate_docs docs
entry: poetry run generate_docs docs
pass_filenames: false
language: system
files: '\.py$'
Expand Down
16 changes: 8 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Docling Core

[![PyPI version](https://img.shields.io/pypi/v/docling-core)](https://pypi.org/project/docling-core/)
![Python](https://img.shields.io/badge/python-3.9%20%7C%203.10%20%7C%203.11%20%7C%203.12-blue)
![Python](https://img.shields.io/badge/python-3.9%20%7C%203.10%20%7C%20%203.11%20%7C%203.12%20%7C%203.13-blue)
[![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
Expand All @@ -21,7 +21,7 @@ pip install docling-core

### Development setup

To develop for Docling Core, you need Python 3.9 / 3.10 / 3.11 / 3.12 and Poetry. You can then install from your local clone's root dir:
To develop for Docling Core, you need Python 3.9 / 3.10 / 3.11 / 3.12 / 3.13 and Poetry. You can then install from your local clone's root dir:
```bash
poetry install
```
Expand All @@ -45,14 +45,14 @@ poetry run pytest test
Document.model_validate_json(data_str)
```

- You can generate the JSON schema of a model with the script `ds_generate_jsonschema`.
- You can generate the JSON schema of a model with the script `generate_jsonschema`.

```py
# for the `Document` type
ds_generate_jsonschema Document
generate_jsonschema Document

# for the use `Record` type
ds_generate_jsonschema Record
generate_jsonschema Record
```

## Documentation
Expand All @@ -61,12 +61,12 @@ Docling supports 3 main data types:

- **Document** for publications like books, articles, reports, or patents. When Docling converts an unstructured PDF document, the generated JSON follows this schema.
The Document type also models the metadata that may be attached to the converted document.
Check [Document](docs/Document.md) for the full JSON schema.
Check [Document](docs/Document.json) for the full JSON schema.
- **Record** for structured database records, centered on an entity or _subject_ that is provided with a list of attributes.
Related to records, the statements can represent annotations on text by Natural Language Processing (NLP) tools.
Check [Record](docs/Record.md) for the full JSON schema.
Check [Record](docs/Record.json) for the full JSON schema.
- **Generic** for any data representation, ensuring minimal configuration and maximum flexibility.
Check [Generic](docs/Generic.md) for the full JSON schema.
Check [Generic](docs/Generic.json) for the full JSON schema.

The data schemas are defined using [pydantic](https://pydantic-docs.helpmanual.io/) models, which provide built-in processes to support the creation of data that adhere to those models.

Expand Down
144 changes: 0 additions & 144 deletions docling_core/utils/ds_generate_docs.py

This file was deleted.

82 changes: 82 additions & 0 deletions docling_core/utils/generate_docs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
#
# Copyright IBM Corp. 2024 - 2024
# SPDX-License-Identifier: MIT
#

"""Generate documentation of Docling types as JSON schema.
Example:
python docling_core/utils/generate_docs.py /tmp/docling_core_files
"""
import argparse
import json
import os
from argparse import BooleanOptionalAction
from pathlib import Path
from shutil import rmtree
from typing import Final

from docling_core.utils.generate_jsonschema import generate_json_schema

MODELS: Final = ["Document", "Record", "Generic"]


def _prepare_directory(folder: str, clean: bool = False) -> None:
"""Create a directory or empty its content if it already exists.
Args:
folder: The name of the directory.
clean: Whether any existing content in the directory should be removed.
"""
if os.path.isdir(folder):
if clean:
for path in Path(folder).glob("**/*"):
if path.is_file():
path.unlink()
elif path.is_dir():
rmtree(path)
else:
os.makedirs(folder, exist_ok=True)


def generate_collection_jsonschema(folder: str):
"""Generate the JSON schema of Docling collections and export them to a folder.
Args:
folder: The name of the directory.
"""
for item in MODELS:
json_schema = generate_json_schema(item)
with open(
os.path.join(folder, f"{item}.json"), mode="w", encoding="utf8"
) as json_file:
json.dump(json_schema, json_file, ensure_ascii=False, indent=2)


def main() -> None:
"""Generate the JSON Schema of Docling collections and export documentation."""
argparser = argparse.ArgumentParser()
argparser.add_argument(
"directory",
help=(
"Directory to generate files. If it exists, any existing content will be"
" removed."
),
)
argparser.add_argument(
"--clean",
help="Whether any existing content in directory should be removed.",
action=BooleanOptionalAction,
dest="clean",
default=False,
required=False,
)
args = argparser.parse_args()

_prepare_directory(args.directory, args.clean)

generate_collection_jsonschema(args.directory)


if __name__ == "__main__":
main()
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"""Generate the JSON Schema of pydantic models and export them to files.
Example:
python docling_core/utils/ds_generate_jsonschema.py legacy_doc.base.TableCell
python docling_core/utils/generate_jsonschema.py doc.document.TableCell
"""
import argparse
Expand All @@ -27,10 +27,10 @@ def _import_class(class_reference: str) -> Any:


def generate_json_schema(class_reference: str) -> Union[dict, None]:
"""Generate a jsonable dict of a model's schema from DS data types.
"""Generate a jsonable dict of a model's schema from a data type.
Args:
class_reference: The reference to a class in 'src.data_types'.
class_reference: The reference to a class in 'docling_core.types'.
Returns:
A jsonable dict of the model's schema.
Expand All @@ -48,7 +48,7 @@ def main() -> None:
"""Print the JSON Schema of a model."""
argparser = argparse.ArgumentParser()
argparser.add_argument(
"class_ref", help="Class reference, e.g., legacy_doc.base.TableCell"
"class_ref", help="Class reference, e.g., doc.document.TableCell"
)
args = argparser.parse_args()

Expand Down
Loading
Loading