fix(salesforce): lock-aware token refresh for RTR safety#97
Open
malvfr wants to merge 48 commits into
Open
Conversation
fix: add a hard 12 month limit to tasks extraction
fix: fields limit
fix: increase maximum fields limit to 500 and ensure unique field selection
fix (salesforce): prioritise pull config selected fields
…17-09-25 Security: Fix requests vulnerability (GHSA-9hjg-9r4m-mvj7)
RGI-368 : Salesforce Quota Tracking
RGI-552 - tap-salesforce : Fix concurrent write race condition and bookmark corruption
RGI:646 -Extended bisection logic to also cover OPERATION_TOO_LARGE errors.
fix: Switch from queryAll to query endpoint to exclude soft-deleted records. This will continue bisect until it hits the 1 hour floor
RGI: 755 - MDI since handling the tasks queryall / data too large issue, there is duplication in events being extracted
https://hgdata.atlassian.net/browse/RGI-765 When a new column/attribute is added to a Salesforce pull config, the system previously had no way to automatically backfill historical data for that column. The start date was also set too recently (2025-01-01) to capture full history for core entity objects. The objective it to have a full history for data from start date(2000-01-01) for Lead, Contact, Account, Opportunity, User and 9 months max data for Task,Campaign,CampaignMember when a new column is in added in UI. This also triggers dbt full refresh when a new column is added
RGI-964 : Add per-SFDC-call observability to surface composite-batch cost
feat: add support for login url overrides for simulators
Replaces the fire-and-forget login() with a distributed-lock-aware implementation that coordinates with Argo (same lock protocol used by mk-node-libs lockAwareRefreshFn). Fixes the 15-min crash where the refresh timer reused a revoked refresh_token after Salesforce RTR rotation. - Acquires Argo refresh-lock before calling SF - Re-reads credentials after lock acquire (adopts if another service already refreshed) - Persists new AT + RT to Argo atomically via lock-release endpoint - Falls back to simple (in-memory-only) login when ARGO_URL/TENANT env vars are absent Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
4be897b to
69fb4ff
Compare
…rd, tests - _release_lock_with_tokens: retry 3× with backoff; on all failures log CRITICAL with the new refresh_token (manual recovery path) then raise - lock_released set optimistically before adopt-path _release_lock call so the except block never double-releases when Argo is down (I4 from review) - Remove uuid alias (no clash), inline _read_credentials_from_argo (called once), dict comprehension for token body, delete ascii-art banners, log polling event in _acquire_lock, fix timeout check to cap sleep at remaining budget, warn on empty ARGO_CONNECTOR_API_KEY - Add tests/test_credentials.py: 8 tests covering simple login, happy path, adopt path, first-login skip, persist retry, double-release guard, polling Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…rt; 22 tests Logging: - Log SF response body (status + text) before raise_for_status on SF 4xx — critical for invalid_grant diagnosis in RTR incidents - Log event=sf_rtr_disabled when SF returns no new refresh_token (RTR off) - Log event=adopt_no_rt WARNING when Argo returns new AT without refresh_token (in-memory RT may be stale; next cycle may get invalid_grant) - Lock polling now includes heldBy/heldByService/ttlRemainingMs from Argo reject body in both lock_polling and lock_acquire_timeout events - Log event=token_persist_rejected on 409 (no retry warranted) Logic: - _release_lock_with_tokens: fast-fail on Argo 409 — CAS/lock mismatch is permanent, retrying 3x wastes 4.5s and emits misleading warnings Tests (22 total, up from 8): - T6: 409 fast-fail verified (no sleep, no retry) - T7: adopt-path no RT — old RT preserved, warning fired - T8: persist all fail + cleanup release also fails - SF timeout path, Argo GET fails after lock, SF 4xx body, timer restart on failure, acquire timeout with heldBy in message, API key warning Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ignores Argo validates requests with Authorization: Basic base64(key + ":"). All four Argo calls (_acquire_lock, re-read GET, _release_lock, _release_lock_with_tokens) were sending X-Api-Key and would have received 401 from staging/prod. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- _simple_login: add timeout=30 to match lock-aware path - RT in CRITICAL log: log only first 8 chars (hint) not the full value — refresh tokens are credentials, must not land in shared log systems - Replace # ponytail: comment with a real explanation of intent - SF timeout log: clarify that cleanup release is attempted (was misleading) - lock_released=True pre-call: expand comment explaining the double-release guard is intentional (Copilot flagged as wrong; it is correct by design) - wait_ms/total_ms added to lock_acquired and tokens_persisted log events - test: monotonic mock gets 3 values after acquire_start call was added Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- old_rt: add comment that it must be captured before any _credentials mutation — it is the CAS value sent to Argo's lock-release endpoint - max_wait_s → poll_budget_s: the constant bounds polling budget, not hard wall-clock time (each HTTP call can consume up to timeout=10s before elapsed is checked); rename + comment avoids false precision Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SalesforceAuthOAuth.login()ignored the rotatedrefresh_tokenreturned by Salesforce, leaving a revoked token in memory. The 15-minute refresh timer would then call SF again with that revoked token and getinvalid_grant.lockAwareRefreshFnin mk-node-libs) so token rotation is coordinated across mk-tap-salesforce, mk-bongo, mk-push, and mk-pull.ARGO_URLorTENANTenv vars are absent — maintains backward-compat for local/test runs.What changed
tap_salesforce/salesforce/credentials.py—SalesforceAuthOAuthonly:login()now dispatches to_lock_aware_login()(when Argo env vars present) or_simple_login()(fallback)._simple_login()— original behavior plus capturesrefresh_tokenfrom SF response in-memory._lock_aware_login()— acquires Argo refresh-lock → re-reads credentials (adopts if another service already refreshed) → calls SF → persists new AT + RT to Argo atomically via lock-release endpoint → releases lock._acquire_lock,_read_credentials_from_argo,_release_lock,_release_lock_with_tokens.All other classes (
SalesforceAuth,SalesforceAuthPassword,OAuthCredentials,PasswordCredentials,parse_credentials) are unchanged.Argo lock endpoints used
POST /v1/tenant/{tenant}/connectors/salesforce/refresh-lock— acquireGET /v1/tenant/{tenant}/connectors/salesforce— re-read after lockPUT /v1/tenant/{tenant}/connectors/salesforce/refresh-lock/release— release (with or without tokens)Test plan
ARGO_URL/TENANTunset → confirm warning log + simple login path (existing behavior)ARGO_URL/TENANT/ARGO_CONNECTOR_API_KEYset against staging Argo → confirm lock acquired, tokens written to Argo on refreshinvalid_grantno longer occurs after 15 minutes when RTR is enabled on the SF connected app🤖 Generated with Claude Code