A Python + React platform for collecting typing speed and keystroke data, designed to replicate the data collection methodology from the 136 Million Keystrokes project.
The frontend used for keystroke data collection looks like this:
Screen.Recording.2025-11-15.at.7.45.05.PM.mov
Each keystroke is captured with fine-grained timing and metadata.
The dataset includes the following fields:
- PARTICIPANT_ID — Unique identifier for each participant.
- TEST_SECTION_ID — The test or section during which data was recorded.
- SENTENCE — The sentence the participant was instructed to type.
- KEYSTROKE_ID — Sequential index of each keystroke.
- PRESS_TIME — Timestamp of when a key was pressed.
- RELEASE_TIME — Timestamp of when a key was released.
- LETTER — The typed character or key (e.g., letters, SHIFT, BKSP).
- KEYCODE — Numerical keycode for the corresponding physical key.
This platform captures detailed keystroke metrics including:
- Key press and release timestamps
- Key codes and character values
- Typing patterns and inter-key intervals
- Session and participant tracking
Data is stored both locally in CSV files (organized by user and session) and in Databricks Delta tables for real-time analytics and ML model training.
- Backend: FastAPI (Python) - REST API with JWT authentication for receiving and storing keystroke data
- Frontend: React - Typing test interface with keystroke event capture
- Storage:
- CSV files organized by user and session:
data/{user_id}/{timestamp}/ - Databricks Delta tables for real-time ingestion and analytics
- CSV files organized by user and session:
- Authentication: JWT-based user authentication with secure password hashing
cognitive-load/
├── backend/ # Python FastAPI backend
│ ├── main.py # API endpoints
│ ├── models.py # Pydantic data models
│ ├── config.py # Configuration (loads from .env)
│ ├── auth.py # JWT authentication logic
│ ├── databricks_client/ # Databricks integration
│ │ ├── client.py # Databricks SQL client
│ │ └── ingestion.py # Data ingestion pipeline
│ ├── storage/
│ │ └── csv_writer.py # CSV persistence layer
│ ├── test/ # Test scripts
│ │ ├── test_databricks_connection.py
│ │ ├── test_data_insertion.py
│ │ └── ...
│ ├── upload_csv_to_databricks.py # Standalone CSV upload script
│ ├── .env.example # Environment variables template
│ ├── requirements.txt # Python dependencies
│ └── run.sh # Backend startup script
└── frontend/ # React frontend
├── src/
│ ├── components/
│ │ ├── Auth.js # Authentication component
│ │ └── TypingTest.js # Main typing test component
│ └── App.js
└── package.json
Note: The following directories are created at runtime and are not tracked in git:
data/- CSV data files organized by user and sessionusers.json- User database filevenv/- Python virtual environment
- Python 3.8+
- Node.js 16+ and npm
- Databricks account (for real-time data ingestion)
- Navigate to the backend directory:
cd backend- Create a virtual environment (recommended):
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt-
Configure environment variables (required):
Create a
.envfile in thebackend/directory:cp .env.example .env
Edit
.envand add your Databricks credentials:# Databricks Configuration (REQUIRED) DATABRICKS_SERVER_HOSTNAME=your-databricks-server-hostname.cloud.databricks.com DATABRICKS_HTTP_PATH=/sql/1.0/warehouses/your-warehouse-id DATABRICKS_ACCESS_TOKEN=your-databricks-access-token # JWT Configuration JWT_SECRET_KEY=your-secret-key-change-in-production # Optional: Data directory (defaults to ../data) DATA_DIR=../data # Optional: Users database path (defaults to users.json) USERS_DB_PATH=users.json
Important: Never commit the
.envfile to version control. It's already in.gitignore. -
Run the backend server (from project root):
cd .. # Return to project root
uvicorn backend.main:app --reload --port 8000The API will be available at http://localhost:8000
API documentation: http://localhost:8000/docs
Note: The backend will fail to start if Databricks credentials are not set in .env. See SETUP_DATABRICKS.md for detailed Databricks setup instructions.
- Navigate to the frontend directory:
cd frontend- Install dependencies:
npm install- Set environment variables (optional):
export REACT_APP_API_URL=http://localhost:8000- Start the development server:
npm startThe frontend will be available at http://localhost:3000
- Start the backend server (see Backend Setup above)
- Start the frontend server (see Frontend Setup above)
- Open
http://localhost:3000in your browser - Register or Login: Create an account or login with existing credentials
- Read and accept the consent form
- Click "Start Test" to begin
- Type the displayed sentences
- Complete all sentences to finish the test
Note: All API endpoints (except registration) require authentication. Users must register/login before starting a test.
CSV data files are automatically created in the data/ directory during test sessions, organized by user and session timestamp.
CSV files are organized by user and session in the data/ directory:
data/
{user_id_1}/
20241115_143022/
keystrokes.csv
sessions.csv
20241115_150145/
keystrokes.csv
sessions.csv
{user_id_2}/
20241115_160000/
keystrokes.csv
sessions.csv
Each session creates a unique timestamped folder containing:
- keystrokes.csv: Individual keystroke events for that session
- sessions.csv: Session summary statistics
This organization allows for:
- Per-user data isolation
- Easy session tracking
- Simple batch uploads to Databricks
Each keystroke event includes:
PARTICIPANT_ID: Unique participant identifierTEST_SECTION_ID: Unique session identifierSENTENCE: Target sentence being typedUSER_INPUT: Actual user input at capture timeKEYSTROKE_ID: Sequential keystroke identifierPRESS_TIME: Key press timestamp (milliseconds)RELEASE_TIME: Key release timestamp (milliseconds)LETTER: Character or key name (e.g., 'a', 'SHIFT', 'BKSP')KEYCODE: JavaScript keyCode value
Register a new user account.
Request:
{
"email": "user@example.com",
"password": "secure-password"
}Response:
{
"user_id": "uuid-here",
"email": "user@example.com",
"created_at": "2024-01-15T10:30:00"
}Login and receive JWT access token.
Request:
{
"email": "user@example.com",
"password": "secure-password"
}Response:
{
"access_token": "jwt-token-here",
"token_type": "bearer",
"user": {
"user_id": "uuid-here",
"email": "user@example.com",
"created_at": "2024-01-15T10:30:00"
}
}Get current authenticated user info (requires Bearer token in Authorization header).
All test endpoints require authentication (Bearer token in Authorization header).
Create a new typing test session.
Request:
{
"question_count": 10 // Number of sentences in the test
}Response:
{
"participant_id": "uuid-here",
"test_section_id": "uuid-here",
"message": "Session created successfully with 10 questions"
}Create a new test section for a sentence.
Request:
{
"participant_id": "uuid",
"sentence": "The quick brown fox..."
}Submit a batch of keystroke events.
Request:
{
"participant_id": "uuid",
"test_section_id": "uuid",
"sentence": "The quick brown fox...",
"user_input": "The quick brown fox...",
"keystrokes": [
{
"press_time": 1473284537607,
"release_time": 1473284537771,
"keycode": 84,
"letter": "T"
}
]
}Mark a sentence as complete and trigger Databricks ingestion.
End the test session and finalize all data.
Get statistics for a session.
Health check endpoint (no authentication required).
This platform includes real-time Databricks integration for data ingestion and analytics.
- Real-time Ingestion: Data is automatically sent to Databricks after each sentence completion
- Delta Tables: Data is stored in Delta tables (
keystrokesandsessions) for efficient querying - Upsert Logic: Re-running tests replaces existing data (not appends)
- Automatic Table Creation: Tables are created automatically on first use
- Configure Databricks credentials in
backend/.env(see Backend Setup above) - Start SQL Warehouse: Ensure your Databricks SQL warehouse is running
- Test Connection: Run the connection test script:
cd backend python test/test_databricks_connection.py
Data is automatically uploaded to Databricks during the typing test. No manual steps required.
Upload existing CSV files using the standalone script:
cd backend
python upload_csv_to_databricks.py ../data/{user_id}/{timestamp}/keystrokes.csvYou can create a Databricks notebook for batch processing CSV files from the data directory.
keystrokes table:
- participant_id, test_section_id, sentence, user_input
- keystroke_id, press_time, release_time, letter, keycode
- session_timestamp, created_at
sessions table:
- participant_id, test_section_id, created_at
- sentence_count, total_keystrokes, average_wpm
- session_timestamp
Troubleshooting Databricks Connection:
- Ensure your SQL warehouse is running in Databricks
- Verify credentials in
.envfile are correct - Check firewall/network connectivity to Databricks
- Verify the access token has not expired
Backend Connection Tests:
cd backend
# Test Databricks connection
python test/test_databricks_connection.py
# Test data insertion
python test/test_data_insertion.py
# Full integration test
python test/test_databricks.pyFrontend tests:
cd frontend
npm testBackend: Follow PEP 8, use Black formatter Frontend: Follow ESLint rules from react-scripts
- Users must provide explicit consent before data collection
- Data collection is transparent and clearly explained
- Participants can decline to participate
- Data should be anonymized before sharing or analysis
- Follow applicable data protection regulations (GDPR, etc.)
MIT License - See LICENSE file for details
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
- Environment Variables: All sensitive credentials (Databricks tokens, JWT secrets) must be stored in
.envfile - Never Commit Secrets: The
.envfile is in.gitignore- never commit it to version control - Use
.env.example: Copy.env.exampleto.envand fill in your actual values - JWT Secret: Change the default JWT secret key in production
For issues or questions, please open an issue on GitHub.