Skip to content

Commit 974436b

Browse files
Merge branch 'main' into koen/enclave-datasets
2 parents a3026b6 + 44ed2e2 commit 974436b

File tree

11 files changed

+2211
-5
lines changed

11 files changed

+2211
-5
lines changed

README.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,13 @@
88

99
Syft client lets data scientists submit computations which are ran by data owners on private data — all through cloud storage their organizations already use (Google Drive, Microsoft 365, etc.). No new infrastructure required.
1010

11+
## Docs
12+
13+
- [Workflow](docs/workflow.md) — End-to-end privacy-preserving data analysis workflow
14+
- [API Reference](docs/API.md) — All public client methods and properties
15+
- [Authentication & Setup](docs/auth.md) — Google Cloud OAuth setup for local/Jupyter usage
16+
- [Background Services](packages/syft-bg/README.md) — Email notifications, auto-approval, and TUI dashboard
17+
1118
## Features
1219

1320
- **Privacy-preserving** — Private data never leaves the data owner's machine; only approved results are shared
@@ -28,7 +35,7 @@ import syft_client as sc
2835
```
2936

3037
```python
31-
# Login
38+
# Login (colab auth, for non-colab pass token_path)
3239
do = sc.login_do(email="do@org.com")
3340
ds = sc.login_ds(email="ds@org.com")
3441

docs/API.md

Lines changed: 199 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,199 @@
1+
# Client API Reference
2+
3+
## Creating a Client
4+
5+
### `login_do(email, token_path=None)`
6+
7+
Create a Data Owner client.
8+
9+
```python
10+
# Google Colab
11+
do_client = login_do(email="owner@example.com")
12+
13+
# Jupyter Lab (local)
14+
do_client = login_do(email="owner@example.com", token_path="path/to/token.json")
15+
```
16+
17+
### `login_ds(email, token_path=None)`
18+
19+
Create a Data Scientist client.
20+
21+
```python
22+
# Google Colab
23+
ds_client = login_ds(email="scientist@example.com")
24+
25+
# Jupyter Lab (local)
26+
ds_client = login_ds(email="scientist@example.com", token_path="path/to/token.json")
27+
```
28+
29+
---
30+
31+
## Properties
32+
33+
### `client.email`
34+
35+
The email address of the client.
36+
37+
### `client.peers`
38+
39+
Get the list of peers. Auto-syncs before returning.
40+
41+
- **DO**: Returns approved peers followed by pending peer requests.
42+
- **DS**: Returns all connected peers.
43+
44+
Returns a `PeerList`.
45+
46+
### `client.jobs`
47+
48+
Get the list of jobs. Auto-syncs before returning.
49+
50+
Returns a `JobsList`.
51+
52+
### `client.datasets`
53+
54+
Get the dataset manager. Auto-syncs before returning.
55+
56+
Returns a `SyftDatasetManager`. Use `.get_all()` or `.get(name, datasite)` to query datasets.
57+
58+
---
59+
60+
## Peer Management
61+
62+
### `client.add_peer(peer_email)`
63+
64+
Request a peer connection.
65+
66+
- **DS** calls this to request access to a DO.
67+
- The DO must approve the request before syncing is enabled.
68+
69+
```python
70+
ds_client.add_peer("owner@example.com")
71+
```
72+
73+
### `client.load_peers()`
74+
75+
Reload the peer list from the transport layer.
76+
77+
### `client.approve_peer_request(email_or_peer)`
78+
79+
Approve a pending peer request. **DO only.**
80+
81+
```python
82+
do_client.approve_peer_request("scientist@example.com")
83+
```
84+
85+
### `client.reject_peer_request(email_or_peer)`
86+
87+
Reject a pending peer request. **DO only.**
88+
89+
```python
90+
do_client.reject_peer_request("scientist@example.com")
91+
```
92+
93+
---
94+
95+
## Syncing
96+
97+
### `client.sync(auto_checkpoint=True, checkpoint_threshold=50)`
98+
99+
Sync local state with Google Drive.
100+
101+
- **DO**: Pulls incoming messages from approved peers and optionally creates a checkpoint.
102+
- **DS**: Pushes pending changes and pulls results from peers.
103+
104+
```python
105+
client.sync()
106+
```
107+
108+
---
109+
110+
## Datasets
111+
112+
### `client.create_dataset(name, mock_path, private_path=None, summary=None, users=None, upload_private=False)`
113+
114+
Create and upload a dataset. **DO only.**
115+
116+
- `mock_path`: Path to public mock data (shared with approved peers).
117+
- `private_path`: Path to private data (never leaves the DO).
118+
- `users`: List of emails to share with, or `"any"` for all approved peers.
119+
120+
```python
121+
do_client.create_dataset(
122+
name="my dataset",
123+
mock_path="/path/to/mock.csv",
124+
private_path="/path/to/private.csv",
125+
summary="Example dataset",
126+
users=["scientist@example.com"],
127+
)
128+
```
129+
130+
### `client.delete_dataset(name, datasite)`
131+
132+
Delete a dataset. **DO only.**
133+
134+
```python
135+
do_client.delete_dataset(name="my dataset", datasite="owner@example.com")
136+
```
137+
138+
### `client.share_dataset(tag, users)`
139+
140+
Share an existing dataset with additional users. **DO only.**
141+
142+
- `tag`: Dataset name.
143+
- `users`: List of email addresses or `"any"`.
144+
145+
```python
146+
do_client.share_dataset("my dataset", users=["new_user@example.com"])
147+
```
148+
149+
---
150+
151+
## Jobs
152+
153+
### `client.submit_python_job(user, code_path, job_name=None, entrypoint=None)`
154+
155+
Submit a Python job to a Data Owner. **DS only.**
156+
157+
- `user`: DO email to submit the job to.
158+
- `code_path`: Path to a Python script or folder.
159+
- `entrypoint`: Entry script (auto-detected if `main.py` exists in folder).
160+
161+
```python
162+
ds_client.submit_python_job(
163+
user="owner@example.com",
164+
code_path="/path/to/script.py",
165+
)
166+
```
167+
168+
### `client.submit_bash_job(user, code_path, job_name=None)`
169+
170+
Submit a bash job to a Data Owner. **DS only.**
171+
172+
```python
173+
ds_client.submit_bash_job(
174+
user="owner@example.com",
175+
code_path="/path/to/script.sh",
176+
)
177+
```
178+
179+
### `client.process_approved_jobs(stream_output=True, timeout=None, force_execution=False)`
180+
181+
Run all approved jobs. **DO only.**
182+
183+
- `stream_output`: Stream stdout/stderr in real-time.
184+
- `timeout`: Timeout in seconds per job (default: 300).
185+
- `force_execution`: Skip version compatibility checks.
186+
187+
```python
188+
do_client.process_approved_jobs()
189+
```
190+
191+
---
192+
193+
## Cleanup
194+
195+
### `client.delete_syftbox(verbose=True, broadcast_delete_events=True)`
196+
197+
Delete all SyftBox state: Google Drive files, local caches, and local folder.
198+
199+
- `broadcast_delete_events`: Notify approved peers about deleted files before cleanup.

docs/auth.md

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# Authentication
2+
3+
When you log in with a Gmail account in Google Colab, Colab handles authentication automatically via a browser pop-up. Once authenticated, Syft Client uses Google Drive as its communication protocol — all messages, events, and files are synced through the Drive API.
4+
5+
**If you're using Google Colab, you can skip the rest of this page.**
6+
7+
## Local / Jupyter Lab Setup
8+
9+
To use Syft Client outside of Google Colab, you need to set up a Google Cloud project with OAuth credentials.
10+
11+
## Step 1: Create a Google Cloud Project
12+
13+
1. Go to [Google Cloud Console](https://console.cloud.google.com/)
14+
2. Click **Select a project** in the top navigation bar
15+
3. Click **New Project** in the dialog that appears
16+
4. Enter a project name (e.g., "Syft Client")
17+
5. Click **Create**
18+
6. Wait for the project to be created, then select it
19+
20+
## Step 2: Enable the Google Drive API
21+
22+
1. In your project, go to **APIs & Services** > **Library**
23+
2. Search for "Google Drive API"
24+
3. Click on **Google Drive API**
25+
4. Click **Enable**
26+
27+
## Step 3: Configure OAuth Consent Screen
28+
29+
1. Go to **APIs & Services** > **OAuth consent screen**
30+
2. Select **External** user type (unless you have a Google Workspace organization)
31+
3. Click **Create**
32+
4. Fill in the required fields:
33+
- **App name**: "Syft Client" (or your preferred name)
34+
- **User support email**: Your email address
35+
- **Developer contact information**: Your email address
36+
5. Click **Save and Continue**
37+
6. On the **Scopes** page:
38+
- Click **Add or Remove Scopes**
39+
- Search for and select `https://www.googleapis.com/auth/drive`
40+
- Click **Update**
41+
- Click **Save and Continue**
42+
7. On the **Test users** page:
43+
- Click **Add Users**
44+
- Add the email addresses of users who will test the app
45+
- Click **Save and Continue**
46+
8. Review the summary and click **Back to Dashboard**
47+
48+
## Step 4: Create OAuth Client Credentials
49+
50+
1. Go to **APIs & Services** > **Credentials**
51+
2. Click **Create Credentials** > **OAuth client ID**
52+
3. Select **Desktop app** as the application type
53+
4. Enter a name (e.g., "Syft Client Desktop")
54+
5. Click **Create**
55+
6. **Download the JSON file** - this contains your client credentials
56+
7. Save this file securely (e.g., as `credentials.json`)
57+
58+
## Step 5: Publish the App
59+
60+
For testing, your app can remain in "Testing" mode with up to 100 test users. To allow any Google user to authenticate:
61+
62+
1. Go to **APIs & Services** > **OAuth consent screen**
63+
2. Click **Publish App**
64+
3. Confirm the publishing
65+
66+
**Important:** If your app is not published (i.e., remains in "Testing" mode), OAuth tokens expire every 7 days and users will need to re-authenticate. Publishing the app removes this limitation.
67+
68+
> **Note:** Publishing may require verification for apps requesting sensitive scopes like Google Drive access.
69+
70+
## Generating a Token
71+
72+
Once you've completed the Google Cloud Console setup, generate a token:
73+
74+
```bash
75+
python scripts/create_token.py --credentials path/to/credentials.json --output token.json
76+
```
77+
78+
Then pass the token path when logging in:
79+
80+
```python
81+
do_client = login_do(email="your@email.com", token_path="path/to/token.json")
82+
```
83+
84+
If your app is not published, tokens expire every 7 days and you'll need to regenerate them.

docs/workflow.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Privacy-Preserving Data Analysis Workflow
2+
3+
The following diagram demonstrates the complete workflow for privacy-preserving data analysis using Beach Notebooks, involving both the Data Owner (DO) and Data Scientist (DS).
4+
5+
```mermaid
6+
sequenceDiagram
7+
participant DO as Data Owner
8+
participant DON as DO Notebook
9+
participant DSN as DS Notebook
10+
participant DS as Data Scientist
11+
12+
Note over DO,DON: 1. Dataset Publication
13+
DON->>DO: Create & publish dataset
14+
DO-->>DS: Dataset available
15+
16+
Note over DS,DSN: 2. Mock Data Testing
17+
DSN->>DS: Download mock data
18+
DS->>DSN: Test analysis code
19+
20+
Note over DS,DSN: 3. Job Submission
21+
DSN->>DO: Submit analysis job
22+
23+
Note over DO,DON: 4. Job Review
24+
DON->>DO: View pending jobs
25+
DO->>DON: Review code
26+
DON->>DO: Approve job
27+
28+
Note over DO,DON: 5. Job Processing
29+
DON->>DO: Process approved jobs
30+
DO->>DS: Results available
31+
32+
Note over DS,DSN: 6. View Results
33+
DSN->>DS: Retrieve results
34+
```
35+
36+
## Workflow Steps
37+
38+
1. **Dataset Publication**: The Data Owner publishes a dataset with both mock (public) and private components.
39+
2. **Mock Data Testing**: The Data Scientist downloads the mock data to explore the structure and test their analysis code locally.
40+
3. **Job Submission**: Once satisfied with the code on mock data, the Data Scientist submits the analysis job to the Data Owner.
41+
4. **Job Review**: The Data Owner views pending jobs, reviews the code for safety and privacy, and approves it.
42+
5. **Job Processing**: The Data Owner processes the approved jobs, executing the code on the private data in a controlled environment.
43+
6. **View Results**: The Data Scientist retrieves the results of the analysis.

0 commit comments

Comments
 (0)