A robust Python application that syncs Gmail messages to a local SQLite database for analysis and archival purposes.
- Incremental Sync: Only downloads new messages by default
- Full Sync: Option to download all messages and detect deletions
- Parallel Processing: Multi-threaded message fetching for improved performance
- Robust Error Handling: Automatic retries with exponential backoff
- Graceful Shutdown: Handles interruption signals cleanly
- Type Safety: Comprehensive type hints throughout the codebase
- Python 3.8 or higher
- Google Cloud Project with Gmail API enabled
- OAuth 2.0 credentials file (
credentials.json
)
-
Clone the repository:
git clone https://github.com/marcboeker/gmail-to-sqlite.git cd gmail-to-sqlite
-
Install dependencies:
# Using uv uv sync
-
Set up Gmail API credentials:
- Go to the Google Cloud Console
- Create a new project or select an existing one
- Enable the Gmail API
- Create OAuth 2.0 credentials (Desktop application)
- Download the credentials file and save it as
credentials.json
in the project root
# Incremental sync (default)
python main.py sync --data-dir ./data
# Full sync with deletion detection
python main.py sync --data-dir ./data --full-sync
# Sync a specific message
python main.py sync-message --data-dir ./data --message-id MESSAGE_ID
# Detect and mark deleted messages only
python main.py sync-deleted-messages --data-dir ./data
# Use custom number of worker threads
python main.py sync --data-dir ./data --workers 8
command
: Required. One ofsync
,sync-message
, orsync-deleted-messages
--data-dir
: Required. Directory where the SQLite database will be stored--full-sync
: Optional. Forces a complete sync of all messages--message-id
: Required forsync-message
. The ID of a specific message to sync--workers
: Optional. Number of worker threads (default: number of CPU cores)
The application supports graceful shutdown when you press CTRL+C:
- Stops accepting new tasks
- Waits for currently running tasks to complete
- Saves progress of completed work
- Exits cleanly
Pressing CTRL+C a second time will force an immediate exit.
The application creates a SQLite database with the following schema:
Field | Type | Description |
---|---|---|
message_id | TEXT | Unique Gmail message ID |
thread_id | TEXT | Gmail thread ID |
sender | JSON | Sender information (name, email) |
recipients | JSON | Recipients by type (to, cc, bcc) |
labels | JSON | Array of Gmail labels |
subject | TEXT | Message subject |
body | TEXT | Message body (plain text) |
size | INTEGER | Message size in bytes |
timestamp | DATETIME | Message timestamp |
is_read | BOOLEAN | Read status |
is_outgoing | BOOLEAN | Whether sent by user |
is_deleted | BOOLEAN | Whether deleted from Gmail |
last_indexed | DATETIME | Last sync timestamp |
SELECT sender->>'$.email', COUNT(*) AS count
FROM messages
GROUP BY sender->>'$.email'
ORDER BY count DESC
This is great to determine who is spamming you the most with uninteresting emails.
SELECT sender->>'$.email', COUNT(*) AS count
FROM messages
WHERE is_read = 0
GROUP BY sender->>'$.email'
ORDER BY count DESC
- For years:
strftime('%Y', timestamp)
- For months in a year:
strftime('%m', timestamp)
- For days in a month:
strftime('%d', timestamp)
- For weekdays:
strftime('%w', timestamp)
- For hours in a day:
strftime('%H', timestamp)
SELECT strftime('%Y', timestamp) AS period, COUNT(*) AS count
FROM messages
GROUP BY period
ORDER BY count DESC
This is an amateurish way to find all newsletters and group them by sender. It's not perfect, but it's a start. You could also use
SELECT sender->>'$.email', COUNT(*) AS count
FROM messages
WHERE body LIKE '%newsletter%' OR body LIKE '%unsubscribe%'
GROUP BY sender->>'$.email'
ORDER BY count DESC
SELECT sender->>'$.email', sum(size)/1024/1024 AS size
FROM messages
GROUP BY sender->>'$.email'
ORDER BY size DESC
SELECT count(*)
FROM messages
WHERE EXISTS (
SELECT 1
FROM json_each(messages.recipients->'$.to')
WHERE json_extract(value, '$.email') = '[email protected]'
)
AND sender->>'$.email' = '[email protected]'
SELECT sender->>'$.email', sum(size)/1024/1024 as total_size
FROM messages
WHERE is_outgoing=false
GROUP BY sender->>'$.email'
ORDER BY total_size DESC
SELECT message_id, subject, timestamp
FROM messages
WHERE is_deleted=1
ORDER BY timestamp DESC