Skip to content

Commit 29ca4a4

Browse files
danielle-unstructured-ioclaudeawalker4
authored
PLU-347: feat(box): add ACL permissions metadata to Box connector (#704)
## Summary - Extends the Box connector to populate `permissions_data` on `FileDataSourceMetadata`, consistent with the Confluence and Google Drive implementations - Permissions are fetched at **index time** in `BoxIndexer.run()` so they're available to all downstream pipeline stages - `BoxDownloader.run()` retains a fallback for standalone usage (CLI, integration tests without the SND plugin layer) ## What changed **`unstructured_ingest/processes/connectors/fsspec/box.py`** - Added `BOX_ROLE_MAPPING` — maps Box collaboration roles to `[read]`, `[read, update]`, or `[read, update, delete]`; `uploader` excluded (write-only) - Added module-level `_normalize_collaborations()`, `_get_collaborations_for_folder()`, `_get_permissions_for_file()` helpers - `BoxIndexer.run()` override — initializes a Box SDK client once, then for each indexed file walks `path_collection` ancestor folders (LRU-cached, max 5 entries) plus direct file collaborations to build normalized permissions - `BoxDownloader.run()` fallback — only fetches permissions if `permissions_data is None` (i.e., indexer wasn't run) - `BoxConnectionConfig.get_box_client()` — returns an authenticated `boxsdk.Client` via JWT **Tests** - 21 unit tests covering `BOX_ROLE_MAPPING`, `_normalize_collaborations`, and `_get_permissions_for_file` (all mock-based) - Integration test + expected-results fixtures for both top-folder and second-tier subfolder files ## Test plan - [x] All 21 unit tests pass - [x] Verified end-to-end in SND: root-level and subfolder Box files both have `permissions_data` populated with correct user IDs - [ ] Integration tests pass in CI Closes PLU-347 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Adds new Box SDK calls and ACL normalization/caching logic that affects security-relevant `permissions_data` emitted for every indexed file, with risk of under/over-granting if role or inheritance handling is incorrect. > > **Overview** > Adds Box ACL pass-through by populating `FileDataSourceMetadata.permissions_data` from Box collaborations, including inherited folder collaborations (ancestor walk with small LRU cache), role-to-operation normalization (`read`/`update`/`delete`), skipping access-only and all-users-group grants, and a configurable per-file permission cap via `BoxIndexerConfig.max_num_metadata_permissions`. > > Introduces Box SDK client creation (`get_box_client`), fetches permissions at index time in `BoxIndexer.run()` with a downloader fallback for standalone usage, and adds unit + integration tests/fixtures for top-level and nested folder ACL scenarios. Also fixes integration fixture comparisons by stripping the randomized `unstructured_<random>/` tempdir prefix from downloaded file paths, updates docs image links, and bumps version to `1.6.0`. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 8416d47. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Austin Walker <austin@unstructured.io>
1 parent 3ab29e6 commit 29ca4a4

14 files changed

Lines changed: 867 additions & 13 deletions

File tree

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,13 @@
1+
## [1.6.0]
2+
3+
### Enhancements
4+
5+
- **feat(box): pass through ACL permission metadata.** Extract Box collaboration data and normalize to the standard read/update/delete schema. Permissions are fetched during indexing with an LRU-cached ancestor folder walk to handle inherited collaborations, plus a per-parent-folder `path_collection` cache so only the first file in a given parent pays the `file.get()` round-trip. Access-only collabs (`is_access_only=true`) are skipped to avoid overgranting; group IDs are stored directly without member expansion (consistent with Confluence). `boxsdk` is now installed via the `box` extra. Both the permissions cap and ancestor-cache size are configurable on `BoxIndexerConfig` (`max_num_metadata_permissions`, `permissions_cache_max_size`) and `BoxDownloaderConfig` for the standalone fallback path.
6+
7+
### Fixes
8+
9+
- **fix(test): strip randomized tempdir prefix from FsspecDownloader fixture paths.** `get_files()` now drops the leading `unstructured_<random>/` segment so `directory_structure.json` captures the logical structure rather than the per-run random suffix injected by `tempfile.mkdtemp`.
10+
111
## [1.5.2]
212

313
### Enhancements

docs/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ In checklist form, the above steps are summarized as:
101101

102102
The ingest flow is similar to an ETL pipeline that gets defined at runtime based on user input:
103103

104-
![unstructured ingest diagram](assets/pipeline.png)
104+
![unstructured ingest diagram](pipeline.png)
105105

106106

107107

@@ -117,7 +117,7 @@ The ingest flow is similar to an ETL pipeline that gets defined at runtime based
117117

118118

119119
### Sequence Diagram
120-
![unstructured ingest sequence diagram](assets/sequence.png)
120+
![unstructured ingest sequence diagram](sequence.png)
121121

122122

123123
### Parallel Execution

docs/connector_development.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -349,7 +349,7 @@ If you have any questions post in the public Slack channel `ask-for-help-open-so
349349

350350
Yellow (without the Uncompressing) represents the steps in a source connector. Orange represents a destination connector.
351351

352-
![unstructured_ingest diagram](assets/pipeline.png)
352+
![unstructured_ingest diagram](pipeline.png)
353353

354354

355355

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ astradb = ["astrapy>2.0.0"]
3636
azure-ai-search = ["azure-search-documents"]
3737
azure = ["adlfs", "fsspec"]
3838
biomed = ["beautifulsoup4", "requests"]
39-
box = ["boxfs", "fsspec"]
39+
box = ["boxfs", "boxsdk", "fsspec"]
4040
chroma = ["chromadb"]
4141
clarifai = ["clarifai"]
4242
confluence = ["atlassian-python-api", "requests"]
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"directory_structure": [
3+
"catalog.pdf"
4+
]
5+
}
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
{
2+
"identifier": "8b0303ba-7c77-5e47-b0ff-790b8fc9881f",
3+
"connector_type": "box",
4+
"source_identifiers": {
5+
"filename": "catalog.pdf",
6+
"fullpath": "/TestACLs-topfolder/TestACLs-secondtier/catalog.pdf",
7+
"rel_path": "catalog.pdf"
8+
},
9+
"metadata": {
10+
"url": "box:///TestACLs-topfolder/TestACLs-secondtier/catalog.pdf",
11+
"version": "2216144540657",
12+
"record_locator": {
13+
"protocol": "box",
14+
"remote_file_path": "box://TestACLs-topfolder/TestACLs-secondtier",
15+
"file_id": "2216144540657"
16+
},
17+
"date_created": "1777662782.0",
18+
"date_modified": "1777662782.0",
19+
"date_processed": "1777665707.7073228",
20+
"permissions_data": [
21+
{
22+
"read": {
23+
"users": [
24+
"50881967280",
25+
"50882409531"
26+
],
27+
"groups": []
28+
}
29+
},
30+
{
31+
"update": {
32+
"users": [
33+
"50881967280"
34+
],
35+
"groups": []
36+
}
37+
},
38+
{
39+
"delete": {
40+
"users": [
41+
"50881967280"
42+
],
43+
"groups": []
44+
}
45+
}
46+
],
47+
"filesize_bytes": 296006
48+
},
49+
"additional_metadata": {
50+
"name": "/TestACLs-topfolder/TestACLs-secondtier/catalog.pdf",
51+
"size": 296006,
52+
"type": "file",
53+
"id": "2216144540657",
54+
"modified_at": "2026-05-01T12:13:02-07:00",
55+
"created_at": "2026-05-01T12:13:02-07:00",
56+
"original_file_path": "/TestACLs-topfolder/TestACLs-secondtier/catalog.pdf"
57+
},
58+
"reprocess": false,
59+
"local_download_path": "/private/var/folders/gf/qwh2bdg93kb9gzxd_xhb49wc0000gn/T/tmpekwnxs4a/unstructured_uvopv4ry/catalog.pdf",
60+
"display_name": "/TestACLs-topfolder/TestACLs-secondtier/catalog.pdf"
61+
}
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"directory_structure": [
3+
"Billing issue - Example 1.pdf"
4+
]
5+
}
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
{
2+
"identifier": "11333818-b47e-5991-b32f-701975b2caca",
3+
"connector_type": "box",
4+
"source_identifiers": {
5+
"filename": "Billing issue - Example 1.pdf",
6+
"fullpath": "/TestACLs-topfolder/Billing issue - Example 1.pdf",
7+
"rel_path": "Billing issue - Example 1.pdf"
8+
},
9+
"metadata": {
10+
"url": "box:///TestACLs-topfolder/Billing issue - Example 1.pdf",
11+
"version": "2216145342898",
12+
"record_locator": {
13+
"protocol": "box",
14+
"remote_file_path": "box://TestACLs-topfolder",
15+
"file_id": "2216145342898"
16+
},
17+
"date_created": "1777662769.0",
18+
"date_modified": "1777662769.0",
19+
"date_processed": "1777665696.530676",
20+
"permissions_data": [
21+
{
22+
"read": {
23+
"users": [
24+
"50881967280",
25+
"50882409531"
26+
],
27+
"groups": []
28+
}
29+
},
30+
{
31+
"update": {
32+
"users": [
33+
"50881967280"
34+
],
35+
"groups": []
36+
}
37+
},
38+
{
39+
"delete": {
40+
"users": [
41+
"50881967280"
42+
],
43+
"groups": []
44+
}
45+
}
46+
],
47+
"filesize_bytes": 142776
48+
},
49+
"additional_metadata": {
50+
"name": "/TestACLs-topfolder/Billing issue - Example 1.pdf",
51+
"size": 142776,
52+
"type": "file",
53+
"id": "2216145342898",
54+
"modified_at": "2026-05-01T12:12:49-07:00",
55+
"created_at": "2026-05-01T12:12:49-07:00",
56+
"original_file_path": "/TestACLs-topfolder/Billing issue - Example 1.pdf"
57+
},
58+
"reprocess": false,
59+
"local_download_path": "/private/var/folders/gf/qwh2bdg93kb9gzxd_xhb49wc0000gn/T/tmpqw6nq7zk/unstructured_aqpewcxk/Billing issue - Example 1.pdf",
60+
"display_name": "/TestACLs-topfolder/Billing issue - Example 1.pdf"
61+
}
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
import os
2+
3+
import pytest
4+
5+
from test.integration.connectors.utils.constants import BLOB_STORAGE_TAG, SOURCE_TAG
6+
from test.integration.connectors.utils.validation.source import (
7+
SourceValidationConfigs,
8+
source_connector_validation,
9+
)
10+
from test.integration.utils import requires_env
11+
from unstructured_ingest.processes.connectors.fsspec.box import (
12+
CONNECTOR_TYPE,
13+
BoxAccessConfig,
14+
BoxConnectionConfig,
15+
BoxDownloader,
16+
BoxDownloaderConfig,
17+
BoxIndexer,
18+
BoxIndexerConfig,
19+
)
20+
21+
22+
def make_box_components(remote_url: str, download_dir):
23+
app_config = os.environ["BOX_APP_CONFIG"]
24+
connection_config = BoxConnectionConfig(
25+
access_config=BoxAccessConfig(box_app_config=app_config)
26+
)
27+
index_config = BoxIndexerConfig(remote_url=remote_url)
28+
download_config = BoxDownloaderConfig(download_dir=download_dir)
29+
indexer = BoxIndexer(connection_config=connection_config, index_config=index_config)
30+
downloader = BoxDownloader(connection_config=connection_config, download_config=download_config)
31+
return indexer, downloader
32+
33+
34+
@pytest.mark.asyncio
35+
@pytest.mark.tags(CONNECTOR_TYPE, SOURCE_TAG, BLOB_STORAGE_TAG)
36+
@requires_env("BOX_APP_CONFIG")
37+
async def test_box_top_folder(temp_dir):
38+
"""
39+
Integration test for Box source connector against the top-level ACL test folder.
40+
Validates that permissions_data is populated from direct folder collaborations.
41+
"""
42+
indexer, downloader = make_box_components(
43+
remote_url="box://TestACLs-topfolder",
44+
download_dir=temp_dir,
45+
)
46+
await source_connector_validation(
47+
indexer=indexer,
48+
downloader=downloader,
49+
configs=SourceValidationConfigs(
50+
test_id="box_top_folder",
51+
validate_downloaded_files=False,
52+
validate_file_data=True,
53+
exclude_fields_extend=[
54+
"metadata.date_processed",
55+
],
56+
),
57+
)
58+
59+
60+
@pytest.mark.asyncio
61+
@pytest.mark.tags(CONNECTOR_TYPE, SOURCE_TAG, BLOB_STORAGE_TAG)
62+
@requires_env("BOX_APP_CONFIG")
63+
async def test_box_second_tier(temp_dir):
64+
"""
65+
Integration test for Box source connector against the nested ACL test folder.
66+
Validates that permissions_data reflects inherited permissions from the parent folder.
67+
"""
68+
indexer, downloader = make_box_components(
69+
remote_url="box://TestACLs-topfolder/TestACLs-secondtier",
70+
download_dir=temp_dir,
71+
)
72+
await source_connector_validation(
73+
indexer=indexer,
74+
downloader=downloader,
75+
configs=SourceValidationConfigs(
76+
test_id="box_second_tier",
77+
validate_downloaded_files=False,
78+
validate_file_data=True,
79+
exclude_fields_extend=[
80+
"metadata.date_processed",
81+
],
82+
),
83+
)

test/integration/connectors/utils/validation/source.py

Lines changed: 19 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
import json
22
import os
3+
import re
34
import shutil
45
from pathlib import Path
56
from typing import Callable, Optional
@@ -86,9 +87,17 @@ def omit_ignored_fields(self, data: dict) -> dict:
8687
return copied_data
8788

8889

90+
# FsspecDownloader writes each file into a fresh tempfile.mkdtemp("unstructured_") subdir
91+
# to avoid path collisions. Strip that segment so fixtures capture the logical structure
92+
# rather than a randomized suffix that changes every run.
93+
_FSSPEC_TEMP_DIR_PATTERN = re.compile(r"^unstructured_[a-zA-Z0-9_-]+/")
94+
95+
8996
def get_files(dir_path: Path) -> list[str]:
9097
return [
91-
str(f).replace(str(dir_path), "").lstrip("/") for f in dir_path.rglob("*") if f.is_file()
98+
_FSSPEC_TEMP_DIR_PATTERN.sub("", str(f).replace(str(dir_path), "").lstrip("/"))
99+
for f in dir_path.rglob("*")
100+
if f.is_file()
92101
]
93102

94103

@@ -129,12 +138,17 @@ def check_raw_file_contents(
129138
current_output_dir: Path,
130139
configs: SourceValidationConfigs,
131140
):
132-
current_files = get_files(dir_path=current_output_dir)
133141
found_diff = False
134142
files = []
135-
for current_file in current_files:
136-
current_file_path = current_output_dir / current_file
137-
expected_file_path = expected_output_dir / current_file
143+
for current_file_path in current_output_dir.rglob("*"):
144+
if not current_file_path.is_file():
145+
continue
146+
relative = str(current_file_path.relative_to(current_output_dir))
147+
# Strip the unstructured_<random>/ tempdir segment when locating the
148+
# corresponding fixture; the on-disk file still lives under the random
149+
# subdir so don't strip it from current_file_path.
150+
expected_relative = _FSSPEC_TEMP_DIR_PATTERN.sub("", relative)
151+
expected_file_path = expected_output_dir / expected_relative
138152
if configs.detect_diff(expected_file_path, current_file_path):
139153
found_diff = True
140154
files.append(str(expected_file_path))

0 commit comments

Comments
 (0)