feat(): Support context document import from local filesystem and GitHub, Notion, Confluence by jjoyce0510 · Pull Request #16903 · datahub-project/datahub

jjoyce0510 · 2026-04-03T21:23:18Z

Summary

Adding a PR to introduce support for importing context inside the Documents section from local filesystem (one-time), GitHub (ingestion), Notion (ingestion), and Confluence (ingestion).

Also, extend the Notion + Confluence sources to support configuring the document source type, so you can ingest as a native document if you want to import into the editable experience under documents.

See the video below for the full experience.

Video Walkthrough

import-docs-flow.mov

Screenshots

Status

Ready for review.

codecov · 2026-04-03T21:25:03Z

Bundle Report

Changes will increase total bundle size by 535.74kB (1.8%) ⬆️. This is within the configured threshold ✅

Detailed changes

Bundle name	Size	Change
datahub-react-web-esm	30.23MB	535.74kB (1.8%) ⬆️

Affected Assets, Files, and Routes:

view changes for bundle: datahub-react-web-esm

Assets Changed:

Asset Name	Size Change	Total Size	Change (%)
`assets/index-*.js`	-3.38MB	502.07kB	-87.05%
`assets/index-*.js`	-5.12MB	3.88MB	-56.88%
*`assets/index-.js`** (New)	9.01MB	9.01MB	100.0% 🚀
`assets/en-*.js`	2.37kB	275.83kB	0.87%
`assets/iconLoader-*.js`	127 bytes	184.03kB	0.07%
*`assets/githublogo-.png`** (New)	14.9kB	14.9kB	100.0% 🚀
~~*`assets/Upload-.js`*~~ (Deleted)*	-3.29kB	0 bytes	-100.0% 🗑️

Files in assets/index-*.js:

./src/app/context/import/ImportDocumentsModal.tsx → Total Size: 3.6kB
./src/app/context/ContextSidebar.tsx → Total Size: 15.5kB
./src/app/context/import/buildConfluenceDocumentsIngestionState.ts → Total Size: 636 bytes
./src/alchemy-components/components/Input/utils.ts → Total Size: 164 bytes
./src/app/analytics/event.ts → Total Size: 15.63kB
./src/app/context/import/ImportDocumentsButton.tsx → Total Size: 1.07kB
./src/alchemy-components/components/FileDragAndDropArea/FileDragAndDropArea.tsx → Total Size: 3.53kB

alwaysmeticulous · 2026-04-03T21:26:49Z

🔴 Meticulous spotted visual differences in 5 of 1554 screens tested: view and approve differences detected.

Meticulous evaluated ~10 hours of user flows against your PR.

_{Last updated for commit 2961e5f fix(deps): pin @xmldom/xmldom to 0.8.13 for CVE-2026-41672. This comment will update as new commits are pushed.}

codecov · 2026-04-03T21:27:05Z

Codecov Report

❌ Patch coverage is 91.15930% with 106 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...ub/ingestion/source/github_documents/github_api.py	88.49%	26 Missing ⚠️
...eact/src/app/context/import/__tests__/testSetup.ts	55.31%	21 Missing ⚠️
...graphql/resolvers/knowledge/DocumentResolvers.java	14.28%	12 Missing ⚠️
...source/github_documents/github_documents_config.py	78.18%	12 Missing ⚠️
...stV2/source/builder/RecipeForm/github-documents.ts	90.00%	8 Missing ⚠️
...source/github_documents/github_documents_source.py	95.32%	8 Missing ⚠️
...adata/service/docimport/DocumentImportService.java	96.42%	3 Missing and 1 partial ⚠️
...rs/knowledge/ImportDocumentsFromFilesResolver.java	93.87%	2 Missing and 1 partial ⚠️
.../ingestion/source/unstructured/document_builder.py	66.66%	3 Missing ⚠️
...com/linkedin/datahub/graphql/GmsGraphQLEngine.java	0.00%	1 Missing ⚠️
... and 8 more

📢 Thoughts on this report? Let us know!

alexsku

PR Review

Type: behavior change (new feature)
Size: +4432 / -60 across 49 files

This PR adds the ability to import context documents into DataHub from two sources: local file upload (via browser-side text extraction) and GitHub repositories (via the GitHub REST API). It introduces a new DocumentImportService backend, three new GraphQL resolvers (import from files, import from GitHub, preview from GitHub), a React modal UI with source selection, and parent document hierarchy support. It also adds ignore_above to keyword mappings in ES to prevent indexing failures on long text values.

Key invariants:

Authorization: All import operations require MANAGE_DOCUMENTS platform privilege
Idempotency: Documents are created with deterministic IDs derived from source IDs, so re-imports update rather than duplicate
Hierarchy: GitHub folder structure is preserved as parent-child document relationships; candidates must be ordered parents-first
The GitHub token is passed per-request and not stored server-side

Risk Assessment


Risk	Medium
Blast radius	Document entities only; ES mapping change affects all entities with TEXT/TEXT_PARTIAL fields
Rollback	Safe for import feature (new code, no migrations). ES mapping change (`ignore_above`) is additive and backward-compatible
Rollout	Ship it — import resolvers are null-guarded if factory bean is absent
CI	Failing: `build-and-test (except_metadata_ingestion, UTC)`

Blocking Issues

None found.

High-Priority Issues

[HIGH] GitHub token passed through GraphQL — potential for logging/persistence

Category: security | Confidence: medium
Location: datahub-graphql-core/src/main/resources/documents.graphql:871 — githubToken: String!

The GitHub PAT is passed as a plain String! in the GraphQL input. While it's used only server-side and not stored, GraphQL query logging, APM tracing, or error serialization could inadvertently capture the token in server logs. The same ImportDocumentsFromGitHubInput type is reused for both the previewDocumentsFromGitHub query and the importDocumentsFromGitHub mutation, meaning the token is sent even for read-only preview operations.

Suggested fix: Ensure GraphQL request logging is configured to redact githubToken. Consider whether the preview query truly needs the token in the same input type (it does for private repos, so this is likely acceptable). Document the logging risk.
How to validate: Check if the DataHub GMS logging configuration logs full GraphQL variables. Search for query logging middleware.

[HIGH] No limit on number of files fetched from GitHub — unbounded server-side work

Category: performance | Confidence: high
Location: metadata-service/services/src/main/java/com/linkedin/metadata/service/docimport/GitHubDocumentSource.java:166-187 — fetchDocuments()

The fetchDocuments method iterates over all matching files from the GitHub tree API and fetches each file's content one-by-one via individual HTTP requests (line 167: fetchFileContent). For a large repo with thousands of matching files, this creates an unbounded number of sequential HTTP requests from the GMS server, tying up a GraphQL worker thread for potentially minutes. There's no configurable limit or pagination.

Suggested fix: Add a maxFiles parameter (defaulting to something like 500) and stop fetching after the limit. Log a warning when truncated.
How to validate: Point the import at a large repo (e.g., a docs monorepo with 1000+ .md files) and observe GMS behavior.

Other Issues

[MEDIUM / edge-case] GitHubDocumentSource.java:391 — matchesFilters uses startsWith(pathPrefix) without a / boundary. Prefix "doc" would match "document/README.md". Fix: append / to non-empty prefix before the startsWith check, or use filePath.startsWith(pathPrefix + "/").
[MEDIUM / edge-case] GitHubDocumentSource.java:90-92 — The GitHub tree API returns a truncated: true field when the tree is too large (>100K entries). This is not checked. If truncated, the user silently gets an incomplete file list. Fix: check tree.path("truncated").asBoolean() and warn the user.
[MEDIUM / performance] DocumentImportService.java:190 — entityClient.exists() is called for every candidate inside a loop. For a batch import of N documents, this makes N individual existence checks. Fix: consider batching existence checks or accepting that the first import will always be "create".
[MEDIUM / correctness] DocumentImportService.java:275-283 — makeDocumentId can produce collisions. Two different source IDs that differ only in characters replaced by - (e.g., "docs/a b" and "docs/a-b") will produce the same document ID. File-upload source IDs like upload.my file and upload.my-file would collide.
[MEDIUM / dx] ImportDocumentsModal.tsx:210-212 — <Text color="gray" colorLevel={1700}> passes color directly to alchemy <Text> components. Per the project's theming guidance, colors should come from theme tokens rather than being passed as props to alchemy components.
[LOW / operability] GitHubDocumentSource.java:362-386 — executeGitHubGet returns null on non-200 responses but doesn't distinguish between 401 (bad token), 403 (rate limit), and 404 (repo not found). A more specific error message would help users troubleshoot.

What's Missing

No limit on imported file count: The GitHub import has no cap on the number of files processed. Both the frontend and backend should enforce a maximum.
No cancellation support: The GitHub import is synchronous on the GraphQL thread. Consider async execution with progress tracking for large imports.
Bundle size increase: Codecov reports +528KB (2.33%) from the mammoth DOCX library. Consider lazy-loading it only when the import modal is opened.

Test Plan

Invariant	Covered?	Suggested test
Auth check on all import endpoints	Yes — resolver tests verify `AuthorizationException`	—
Idempotent re-import (same URN)	Yes — smoke test `test_import_file_upload_idempotent`	—
Parent-child hierarchy	Yes — smoke test `test_import_file_upload_with_parent`	—
`makeDocumentId` collision	No	Unit test: verify distinct source IDs don't collide
Large repo truncation handling	No	Integration test: mock truncated tree response
ES `ignore_above` behavior	Partial — `FieldTypeMapperTest` covers mapping generation	Could add integration test indexing a >8191 char keyword

Questions for Author

Is the CI build failure (build-and-test) unrelated to this PR, or does it need to be fixed before merge?
Was a file count limit for GitHub imports intentionally omitted, or is it planned for a follow-up?
Should the mammoth dependency be lazy-loaded to reduce the bundle impact for users who don't use the import feature?
The ES ignore_above / keyword mapping changes seem orthogonal to the document import feature — was this bundled intentionally due to document text fields hitting the limit, or could it be split into a separate PR for cleaner rollback?

datahub-connector-tests · 2026-06-19T03:44:44Z

Connector Tests Results

All connector tests passed for commit 2961e5f

View full test logs →

To skip connector tests, add the skip-connector-tests label (org members only).

Autogenerated by the connector-tests CI pipeline.

Rebase document-import work onto latest oss/master, merging OSS sidebar filters/i18n with ImportDocumentsButton and adapting github-documents stale-removal to the current StaleEntityRemovalHandler API. Co-authored-by: Cursor <cursoragent@cursor.com>

Force the patched version pulled in transitively by mammoth for docx import. Co-authored-by: Cursor <cursoragent@cursor.com>

shirshanka

Approving - please check CI!

addressed or not applicable!

…Hub, Notion, Confluence (#16903) Co-authored-by: John Joyce <john@ip-192-168-1-212.us-west-2.compute.internal> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: John Joyce <john@Mac-5837.lan> Co-authored-by: John Joyce <john@Mac-5917.lan> Co-authored-by: John Joyce <john@Mac-6389.lan>

github-actions Bot added ingestion PR or Issue related to the ingestion of metadata product PR or Issue related to the DataHub UI/UX devops PR or Issue related to DataHub backend & deployment smoke_test Contains changes related to smoke tests labels Apr 3, 2026

github-actions Bot deployed to datahub-project-web-react (Preview) April 3, 2026 21:26 View deployment

vercel Bot deployed to Production April 3, 2026 21:31 View deployment

vercel Bot deployed to Preview April 3, 2026 21:35 View deployment

github-actions Bot deployed to datahub-project-web-react (Preview) April 3, 2026 23:35 View deployment

vercel Bot deployed to Preview April 3, 2026 23:46 View deployment

maggiehays added the needs-review Label for PRs that need review from a maintainer. label Apr 4, 2026

github-actions Bot deployed to datahub-project-web-react (Preview) April 6, 2026 15:55 View deployment

vercel Bot deployed to Preview April 6, 2026 16:07 View deployment

alexsku reviewed Apr 6, 2026

View reviewed changes

github-actions Bot deployed to datahub-project-web-react (Preview) April 6, 2026 18:37 View deployment

vercel Bot deployed to Preview April 6, 2026 18:49 View deployment

github-actions Bot deployed to datahub-project-web-react (Preview) April 6, 2026 19:08 View deployment

vercel Bot deployed to Preview April 6, 2026 19:20 View deployment

github-actions Bot deployed to datahub-project-web-react (Preview) April 6, 2026 19:45 View deployment

vercel Bot deployed to Preview April 6, 2026 19:54 View deployment

github-actions Bot deployed to datahub-wheels (Preview) April 7, 2026 19:13 View deployment

github-actions Bot deployed to datahub-project-web-react (Preview) April 7, 2026 19:16 View deployment

vercel Bot had a problem deploying to Preview April 7, 2026 19:19 Failure

jjoyce0510 force-pushed the jj--support-context-document-import-oss branch from 4484385 to d8a61f3 Compare May 26, 2026 23:46

github-actions Bot deployed to datahub-project-web-react (Preview) May 26, 2026 23:51 View deployment

vercel Bot had a problem deploying to Preview May 26, 2026 23:53 Failure

github-actions Bot deployed to datahub-project-web-react (Preview) May 27, 2026 01:09 View deployment

vercel Bot had a problem deploying to Preview May 27, 2026 01:12 Failure

github-actions Bot deployed to datahub-project-web-react (Preview) May 28, 2026 23:13 View deployment

vercel Bot had a problem deploying to Preview May 28, 2026 23:14 Failure

github-actions Bot had a problem deploying to preview May 29, 2026 15:38 Failure

github-actions Bot had a problem deploying to preview May 29, 2026 22:05 Failure

jjoyce0510 force-pushed the jj--support-context-document-import-oss branch from 50b3b11 to 9b1ba47 Compare June 19, 2026 03:03

github-actions Bot deployed to datahub-wheels (Preview) June 19, 2026 03:05 View deployment

github-actions Bot deployed to datahub-project-web-react (Preview) June 19, 2026 03:08 View deployment

vercel Bot had a problem deploying to Preview June 19, 2026 03:10 Failure

github-actions Bot deployed to datahub-wheels (Preview) June 19, 2026 17:17 View deployment

github-actions Bot deployed to datahub-project-web-react (Preview) June 19, 2026 17:20 View deployment

vercel Bot had a problem deploying to Preview June 19, 2026 17:22 Failure

github-actions Bot deployed to datahub-wheels (Preview) June 19, 2026 18:54 View deployment

github-actions Bot deployed to datahub-project-web-react (Preview) June 19, 2026 18:59 View deployment

vercel Bot deployed to Preview June 19, 2026 19:07 View deployment

John Joyce and others added 6 commits June 22, 2026 09:16

Committing changes

2358318

Final fixups

a4f03a0

fix ci

5dfe6e7

Adding ci fixes

d38fcad

fix(deps): pin @xmldom/xmldom to 0.8.13 for CVE-2026-41672

2961e5f

Force the patched version pulled in transitively by mammoth for docx import. Co-authored-by: Cursor <cursoragent@cursor.com>

jjoyce0510 force-pushed the jj--support-context-document-import-oss branch from 674f412 to 2961e5f Compare June 22, 2026 16:45

github-actions Bot deployed to datahub-wheels (Preview) June 22, 2026 16:48 View deployment

github-actions Bot deployed to datahub-project-web-react (Preview) June 22, 2026 16:50 View deployment

shirshanka approved these changes Jun 22, 2026

View reviewed changes

vercel Bot deployed to Preview June 22, 2026 17:00 View deployment

jjoyce0510 merged commit c78c7d5 into master Jun 22, 2026
110 of 114 checks passed

jjoyce0510 deleted the jj--support-context-document-import-oss branch June 22, 2026 18:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(): Support context document import from local filesystem and GitHub, Notion, Confluence#16903

feat(): Support context document import from local filesystem and GitHub, Notion, Confluence#16903
jjoyce0510 merged 6 commits into
masterfrom
jj--support-context-document-import-oss

jjoyce0510 commented Apr 3, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 3, 2026 •

edited

Loading

Assets Changed:

Uh oh!

alwaysmeticulous Bot commented Apr 3, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 3, 2026 •

edited

Loading

Uh oh!

alexsku left a comment

Uh oh!

datahub-connector-tests Bot commented Jun 19, 2026 •

edited

Loading

Uh oh!

shirshanka left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jjoyce0510 commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Video Walkthrough

Screenshots

Status

Uh oh!

codecov Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bundle Report

Affected Assets, Files, and Routes:

Assets Changed:

Uh oh!

alwaysmeticulous Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

alexsku left a comment

Choose a reason for hiding this comment

PR Review

Risk Assessment

Blocking Issues

High-Priority Issues

[HIGH] GitHub token passed through GraphQL — potential for logging/persistence

[HIGH] No limit on number of files fetched from GitHub — unbounded server-side work

Other Issues

What's Missing

Test Plan

Questions for Author

Uh oh!

datahub-connector-tests Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Connector Tests Results

Uh oh!

shirshanka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jjoyce0510 commented Apr 3, 2026 •

edited

Loading

codecov Bot commented Apr 3, 2026 •

edited

Loading

alwaysmeticulous Bot commented Apr 3, 2026 •

edited

Loading

codecov Bot commented Apr 3, 2026 •

edited

Loading

datahub-connector-tests Bot commented Jun 19, 2026 •

edited

Loading