Dataflow

Dataflow is a comprehensive data processing platform that enables users to build, orchestrate, and execute automated data pipelines through visual pipeline design, code execution, and data transformation capabilities.

中文文档

Overview

Dataflow provides a complete solution for enterprise data processing needs. Whether you need to process large volumes of data, transform and analyze information, or integrate complex data sources, Dataflow provides the tools and services to accomplish your goals efficiently.

Core Capabilities

Build and execute automated data pipelines with visual pipeline design, code execution, and data transformation capabilities.

Key Features:

Visual pipeline designer for data flows
Sandboxed Python code execution
Data transformation and analysis
Document processing (Word, Excel, PDF)
OCR and text extraction
Scheduled and event-driven execution
Real-time data streaming

Use Cases:

ETL (Extract, Transform, Load) pipelines
Data quality validation and cleansing
Automated report generation
Document processing and analysis
Image and text recognition pipelines

Architecture

Dataflow is built as a microservices architecture with the following components:

┌─────────────────────────────────────────────────────────┐
│                  Frontend Layer                         │
│  - dia-flow-web: Data flow visual designer              │
└─────────────────────────────────────────────────────────┘
                           ↕
┌─────────────────────────────────────────────────────────┐
│               Data Processing Services                  │
│  - flow-automation: Data flow orchestration             │
│  - coderunner: Sandboxed code execution                 │
│  - flow-stream-data-pipeline: Real-time streaming       │
│  - ecron: Scheduled task management                     │
└─────────────────────────────────────────────────────────┘
                           ↕
┌─────────────────────────────────────────────────────────┐
│                 Shared Libraries                        │
│  - ide-go-lib: Common Go libraries                      │
└─────────────────────────────────────────────────────────┘

Services Overview

Data Processing Services

flow-automation

Core data flow orchestration service that manages the complete lifecycle of data pipeline executions.

Language: Go
Framework: Gin
Key Features: DAG management, executor management, trigger system, data connections

coderunner

Sandboxed Python code execution service for running custom data processing logic.

Language: Python 3.9
Key Features: RestrictedPython execution, package management, document processing, OCR

flow-stream-data-pipeline

Real-time data streaming pipeline service.

Key Features: Stream processing, real-time data transformation

ecron

Distributed cron job scheduling and execution service.

Language: Go
Key Features: Cron-based scheduling, immediate execution, task monitoring, multi-node support

Frontend Applications

dia-flow-web

Visual designer for building data processing flows.

Technology: Modern web framework
Features: Drag-and-drop pipeline design, node configuration, execution monitoring

Shared Libraries

ide-go-lib

Common Go libraries shared across Go-based services.

Tech Stack

Backend Services

Go: flow-automation, ecron, ide-go-lib
Python: coderunner, flow-stream-data-pipeline

Frameworks & Libraries

Go: Gin, MongoDB, Redis, Kafka
Python: Tornado, RestrictedPython, pandas, SQLAlchemy

Infrastructure

Databases: MongoDB, MySQL/MariaDB, Redis
Message Queues: Kafka, NSQ
Container Orchestration: Kubernetes (Helm)
Authentication: OAuth2

Getting Started

Prerequisites

Docker and Docker Compose
Kubernetes cluster (for production deployment)
MongoDB, MySQL/MariaDB, Redis
Kafka or NSQ message queue

Quick Start with Docker Compose

# Clone the repository
git clone <repository-url>
cd dataflow

# Start all services
docker-compose up -d

# Access the application
# Data Flow Designer: http://localhost:3000

Individual Service Setup

Each service can be run independently. Refer to the README in each service directory for specific setup instructions:

Deployment

Kubernetes Deployment

Each service includes Helm charts for Kubernetes deployment:

# Deploy flow-automation
cd flow-automation/helm
helm install flow-automation . -f values.yaml

# Deploy coderunner
cd coderunner/helm
helm install coderunner . -f values.yaml

# Deploy ecron
cd ecron/helm
helm install ecron . -f values.yaml

Configuration

Each service uses environment variables or configuration files. Key configuration areas:

Database connections: MongoDB, MySQL, Redis
Message queue: Kafka/NSQ endpoints
Authentication: OAuth2 service endpoints
Service discovery: Internal service URLs

Integration

External System Integration

OAuth2 Authentication: Integrate with external identity providers
Message Queues: Connect to Kafka/NSQ for event-driven architectures
REST APIs: All services expose RESTful APIs for integration
Webhooks: Configure webhooks for event notifications

Development

Project Structure

dataflow/
├── flow-automation/       # Data flow orchestration (Go)
├── coderunner/           # Code execution service (Python)
├── ecron/                # Scheduled tasks (Go)
├── flow-stream-data-pipeline/  # Streaming pipeline (Python)
├── dia-flow-web/         # Data flow UI
└── ide-go-lib/          # Shared Go libraries

Contributing

Choose the service you want to contribute to
Follow the development guide in the service's README
Write tests for your changes
Submit a pull request

Code Style

Go: Follow Go standard conventions, use golangci-lint
Python: Follow PEP 8, use black and pylint

Documentation

Use Case Examples

Example 1: Automated Data Processing Pipeline

Design a data flow in dia-flow-web
Configure data source connections
Add transformation nodes with custom Python code
Set up scheduled execution via ecron
Monitor execution in flow-automation dashboard

Example 2: Document Processing Pipeline

Create a data flow for document ingestion
Add OCR and text extraction nodes
Configure data transformation and validation
Set up automated report generation
Monitor processing results

Example 3: Real-time Data Streaming

Configure data sources for streaming
Design transformation pipeline
Set up real-time processing rules
Monitor streaming data flow
Export processed results

Monitoring and Observability

Health Checks: All services expose health endpoints
Metrics: Prometheus metrics for monitoring
Logging: Structured logging across all services
Tracing: Distributed tracing support

Security

Authentication: OAuth2-based authentication
Authorization: Role-based access control
Code Execution: Sandboxed execution environment
Data Isolation: Multi-tenant data isolation
Audit Trails: Comprehensive audit logging

Support

For questions and support:

Check service-specific README files
Review API documentation
Contact the development team

FilesExpand file tree

README.md

Latest commit

History