Skip to content

Feature/generic api connector#13545

Open
ahmadintisar wants to merge 43 commits intoinfiniflow:mainfrom
Attili-sys:feature/generic-api-connector
Open

Feature/generic api connector#13545
ahmadintisar wants to merge 43 commits intoinfiniflow:mainfrom
Attili-sys:feature/generic-api-connector

Conversation

@ahmadintisar
Copy link
Copy Markdown
Contributor

@ahmadintisar ahmadintisar commented Mar 11, 2026

feat: Add Generic REST API Connector

What problem does this PR solve?

RAGFlow supports many specific data source connectors (MySQL, Slack, Google Drive, etc.), but there was no way to connect an arbitrary REST API as a data source. Users with custom or third-party APIs had to write a new connector class for each one.

This PR adds a generic, configuration-driven REST API connector that lets users connect any REST API as a data source entirely through the UI — no code changes needed per API.


Features

Core Connector (common/data_source/rest_api_connector.py)

  • Implements LoadConnector and PollConnector interfaces for full and incremental sync
  • Configurable authentication: None, API Key (custom header), Bearer Token, Basic Auth
  • Pluggable pagination: Page-based, Offset-based, Cursor-based, or None
  • Smart page-size inference from user's query parameters to avoid duplicate/conflicting params
  • Configurable request delay between pages to prevent API rate limiting
  • Auto-detection of the items array in JSON responses (items, results, data, records, or first list found)
  • Advanced field mapping with dot-notation (country.name), array wildcards (newsType[*].name), type hints, and default values
  • Optional content template rendering ("Title: {title}\nBody: {body}")
  • HTML stripping for content fields
  • Stable document IDs via hash128 from a configurable ID field or auto-generated from item content
  • Pydantic configuration schema with automatic coercion of UI string inputs to dicts/lists

Backend Registration (rag/svr/sync_data_source.py, common/constants.py, common/data_source/config.py)

  • REST_API sync class wired into RAGFlow's func_factory
  • Full sync (load_from_state) and incremental polling (poll_source) support
  • Credentials and config passed from task to connector following existing patterns (MySQL, SeaFile, etc.)

Test Connection Endpoint (api/apps/connector_app.py)

  • POST /v1/connector/<id>/test validates config schema, authentication, and API connectivity without triggering a sync
  • Clear error messages for auth failures vs. config issues

Frontend UI (web/src/pages/user-setting/data-source/constant/)

  • Postman-style configuration: Base URL, Query Parameters (key=value per line), Auth, Content Fields, Metadata Fields, Pagination Type
  • Auth-type-aware form: fields for API key header/value, Bearer token, or Basic username/password appear only when relevant
  • Advanced Settings toggle for: Custom Headers, Max Pages, Request Delay, Poll Timestamp Field, Request Body (POST)
  • Connector icon (SVG) and i18n strings (English)
  • "Test Connection" button to validate before syncing

Controls & Safety

  • Configurable max pages safety cap (default: 1000, adjustable in UI)
  • Configurable request delay between pages (default: 0.5s, adjustable in UI)
  • Auth errors (401/403) fail immediately without retries; transient errors retry with exponential backoff
  • Diagnostic logging: auth setup confirmation, request details on failure, content field extraction status

Type of change

  • New Feature (non-breaking change which adds functionality)

##Visual Screenshots of Features
Screenshot 2026-03-11 at 5 19 52 PM
(Connector can be configured within the external data sources tab)

Configuration Parameters:
Screenshot 2026-03-11 at 5 20 46 PM
Screenshot 2026-03-11 at 5 20 54 PM

Connection can be tested before attaching to dataset:
Screenshot 2026-03-11 at 5 21 40 PM

Ingestion tested with API connector (works perfectly fine):
Screenshot 2026-03-11 at 5 22 30 PM

Search & Retrieval works as well with metadata flow:
Screenshot 2026-03-11 at 5 23 05 PM

Ahmad Intisar and others added 29 commits March 9, 2026 15:01
…array join, type hints, defaults, templates)
Separate base URL from query params to prevent pagination from injecting
duplicate keys (e.g. &page=1&page=1). Add a dedicated query_params field
(key=value per line, like Postman) so users no longer embed params in the
URL.
@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Mar 11, 2026
@yingfeng yingfeng added the ci Continue Integration label Mar 12, 2026
@yingfeng yingfeng marked this pull request as draft March 12, 2026 05:38
@yingfeng yingfeng marked this pull request as ready for review March 12, 2026 05:38
@yingfeng yingfeng requested a review from Magicbook1108 March 12, 2026 05:57
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.52%. Comparing base (af40be6) to head (1ac432c).

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #13545   +/-   ##
=======================================
  Coverage   96.52%   96.52%           
=======================================
  Files          10       10           
  Lines         690      690           
  Branches      108      108           
=======================================
  Hits          666      666           
  Misses          8        8           
  Partials       16       16           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ahmadintisar
Copy link
Copy Markdown
Contributor Author

@Magicbook1108 Hi, Can you please review the PR?

@Magicbook1108
Copy link
Copy Markdown
Contributor

Magicbook1108 commented Mar 16, 2026

Hello, I tried to list dataset from ragflow. Can you help me with this issue?
image

image image

@ahmadintisar
Copy link
Copy Markdown
Contributor Author

@Magicbook1108
Thanks for testing! I reviewed your attached screenshots.

The issue is with the field mapping configuration. The RAGFlow datasets API returns items in a data array, where code is just the status code (0 = success), not a content field. That's why your documents show as None.txt with 0 bytes — the connector couldn't find matching fields in the response items.

Suggested Fix

Try this configuration:

Field Value
Content Fields name (or name,description) — these should be the fields containing the actual text content you want to vectorize
Metadata Fields chunk_num,document_count
ID Field id (under Advanced Settings)

The connector auto-detects the items array from the response (data, results, stories, etc.).


Reference Example

Here's how I configured it for a news API that returns this structure:

Sample API Response ```json { "totalStories": 20, "stories": [ { "id": 27856, "title": "رئيس الوزراء العراقي...", "content": "

رئيس الوزراء العراقي: استمرار الصراع...

", "published": "2026-03-15T15:47:13+03:00", "category": "أخبار", "country": { "id": "108", "name": "العراق", "ISO2": "IQ" }, "newsType": [{ "name": "عاجل" }], "mediaType": "نص", "source": "تيليجرام" } ] } ```

My configuration:

Field Value
Content Fields title,content
Metadata Fields category,country.name,country.ISO2,mediaType,source,newsType[*].name
ID Field id (under Advanced Settings)
Pagination Type Page-based (?page=1&story_per_page=20)
Poll Timestamp Field published (under Advanced Settings, for incremental sync)

Note: The connector supports dot-notation for nested fields (country.name) and array wildcards (newsType[*].name) to extract values from nested objects and arrays.


How I Can Help

Could you share the JSON structure of your API response?. I can then tell you exactly which fields to map for Content Fields, Metadata Fields, and ID Field.

Also, if your API returns paginated results, make sure to set the Pagination Type accordingly (Page-based, Offset-based, or Cursor-based) — you currently have it set to None.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a generic, configuration-driven REST API connector that allows users to connect any REST API as a data source through the UI without code changes. It implements full and incremental sync, configurable authentication, pagination, and field mapping.

Changes:

  • New RestAPIConnector class with support for multiple auth types, pagination strategies, and flexible field extraction
  • Backend registration in sync_data_source.py and a test-connection endpoint in connector_app.py
  • Frontend UI forms, i18n strings, and SVG icon for the REST API data source

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
common/data_source/rest_api_connector.py Core connector implementation with auth, pagination, field mapping
common/data_source/config.py Adds REST_API to DocumentSource enum
common/data_source/init.py Exports RestAPIConnector
common/constants.py Adds REST_API to FileSource enum
rag/svr/sync_data_source.py Wires REST_API sync class into the factory
api/apps/connector_app.py Test connection endpoint
web/src/pages/user-setting/data-source/constant/index.tsx Form fields, defaults, and data source info for REST API
web/src/pages/user-setting/data-source/hooks.ts useTestDataSource hook
web/src/pages/user-setting/data-source/data-source-detail-page/index.tsx Test Connection button
web/src/services/data-source-service.ts testDataSource service call
web/src/utils/api.ts Test endpoint URL
web/src/locales/en.ts English i18n strings
web/src/assets/svg/data-source/rest-api.svg Connector icon

You can also share your feedback on Copilot code review. Take the survey.

@ahmadintisar
Copy link
Copy Markdown
Contributor Author

@yingfeng @Magicbook1108

All comments have been resolved, ci also working great. Feel free to review :)

ahmadintisar and others added 4 commits March 19, 2026 20:21
GitHub Actions runners time out cloning infinity resources from Gitee.
Default Dockerfile already uses GitHub; workflows were overriding with
NEED_MIRROR=1. Set NEED_MIRROR=0 in tests and release workflows.
@ahmadintisar
Copy link
Copy Markdown
Contributor Author

CI failure is unrelated to this PR — the Build ragflow:nightly step fails on apt install with a 502 Bad Gateway from archive.ubuntu.com while fetching libglvnd0. This is a transient Ubuntu mirror issue. Ruff and Go build steps pass. @Magicbook1108 Could you re-run the job when the mirror is back?

@Magicbook1108
Copy link
Copy Markdown
Contributor

We will restart ci for you, please revert this changes.
image

@ahmadintisar
Copy link
Copy Markdown
Contributor Author

@Magicbook1108 I have reverted the changes for CI. Please restart the CI!

@ahmadintisar
Copy link
Copy Markdown
Contributor Author

@Magicbook1108 @yingfeng Could you please complete the review?

@ahmadintisar
Copy link
Copy Markdown
Contributor Author

@Magicbook1108 @yingfeng Could you please complete the review?

!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Continue Integration 💞 feature Feature request, pull request that fullfill a new feature. 🌈 python Pull requests that update Python code size:XXL This PR changes 1000+ lines, ignoring generated files. 🧰 typescript Pull requests that update Typescript code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants