Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement HTTP caching with mitmproxy's native format #646

Draft
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

devin-ai-integration[bot]
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot commented Mar 30, 2025

This PR implements HTTP caching for connectors using mitmproxy. It supports four cache modes, separate read/write directories, and configurable serialization formats including mitmproxy's native format for better interoperability.

Link to Devin run: https://app.devin.ai/sessions/9bbcc89c5dc047cabfe064370d8ca798
Requested by: Aaron ("AJ") Steers ([email protected])

Copy link
Contributor Author

🤖 Devin AI Engineer

Original prompt from Aaron:

# Task: Implement HTTP Caching in PyAirbyte

## Background
We need to port the MITM (Man-In-The-Middle) proxy functionality from the Airbyte CI system into the PyAirbyte repository. This functionality allows for caching and replaying HTTP requests, which is useful for working around rate limits and situations where credentials might not be available.

## Requirements

### 1. Create a new module called `http_caching`
- Create a new module in the PyAirbyte repository called `http_caching` (separate from the existing `caches` module)
- The main class should be called `AirbyteConnectorCache`
- Implement the functionality using mitmproxy's Python API (not the command-line interface)

### 2. Configuration
- By default, cache files should be stored in a local directory called `.airbyte-http-cache`
- Create a constant that points to this directory
- Add support for an environment variable to override the cache location
- The primary means of sending traffic to this proxy should be via the HTTP_PROXY environment variable

### 3. Modify `get_source` implementation
- Add a new optional parameter called `http_cache` to the `get_source` function in `airbyte/sources/util.py`
- When this parameter is specified, configure the source to use the HTTP caching functionality

### 4. Core Functionality
The implementation should support:
- Recording HTTP traffic between connectors and sources
- Replaying requests to the same URL from a previous run
- Avoiding rate limiting issues
- Handling certificate management for HTTPS interception
- Proper handling of sensitive data in cached responses

## Reference Files
- `airbyte/sources/util.py` - Contains the `get_source` function that needs modification
- `airbyte/sources/registry.py` - Related to source registration
- `airbyte/__init__.py` - Main module initialization

## Notes
- This is separate from the existing data caching system in `airbyte.caches`
- Do not implement Docker wrapper functionality at this time
- Focus on making a clean Python-based implementation

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add "(aside)" to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Copy link
Contributor

@aaronsteers aaronsteers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the binary format, you used pickle. I was expecting this to be a format that is natively readable by another tool. Please check Airbyte repo's live-tests implementation for the correct format to use here. I don't recall specifically, but you can take this opportunity to check there and migrate similarly here if possible.

Also, they used something like "hal" or similar for their JSON format. I don't know what that is, so please explain it as well (but don't migrate to it yet).

@devin-ai-integration devin-ai-integration bot changed the title feat: implement HTTP caching for connectors (do not merge) feat: implement HTTP caching with mitmproxy's native format Mar 30, 2025
) -> Iterator[str]:
"""Execute the declarative source."""
_ = stdin # Not used
_ = stdin, env # Not used
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: We may need to (later, not here) implement a cache option for declarative execution. This runs through our same Python process, so we may either need to patch the requests library or have another implementation. Lmk if you have thoughts, but don't implement.

Co-Authored-By: Aaron <AJ> Steers <[email protected]>
Copy link
Contributor Author

Devin is archived and cannot be woken up. Please unarchive Devin if you want to continue using it.

def find_free_port() -> int:
"""Find a free port to use for the proxy."""
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(("", 0))

Check warning

Code scanning / CodeQL

Binding a socket to all network interfaces Medium

'' binds a socket to all interfaces.

Copilot Autofix

AI about 9 hours ago

To fix the problem, we need to bind the socket to a specific interface instead of all interfaces. This can be achieved by using a dedicated interface, such as localhost, which limits the socket to accept connections only from the local machine. This change will enhance the security of the application by preventing external connections to the socket.

Suggested changeset 1
airbyte/http_caching/mitm_proxy.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/airbyte/http_caching/mitm_proxy.py b/airbyte/http_caching/mitm_proxy.py
--- a/airbyte/http_caching/mitm_proxy.py
+++ b/airbyte/http_caching/mitm_proxy.py
@@ -31,3 +31,3 @@
     with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
-        s.bind(("", 0))
+        s.bind(("localhost", 0))
         return s.getsockname()[1]
EOF
@@ -31,3 +31,3 @@
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(("", 0))
s.bind(("localhost", 0))
return s.getsockname()[1]
Copilot is powered by AI and may make mistakes. Always verify output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant