Dataflow is a comprehensive data processing platform that enables users to build, orchestrate, and execute automated data pipelines through visual pipeline design, code execution, and data transformation capabilities.
Dataflow provides a complete solution for enterprise data processing needs. Whether you need to process large volumes of data, transform and analyze information, or integrate complex data sources, Dataflow provides the tools and services to accomplish your goals efficiently.
Build and execute automated data pipelines with visual pipeline design, code execution, and data transformation capabilities.
Key Features:
- Visual pipeline designer for data flows
- Sandboxed Python code execution
- Data transformation and analysis
- Document processing (Word, Excel, PDF)
- OCR and text extraction
- Scheduled and event-driven execution
- Real-time data streaming
Use Cases:
- ETL (Extract, Transform, Load) pipelines
- Data quality validation and cleansing
- Automated report generation
- Document processing and analysis
- Image and text recognition pipelines
Dataflow is built as a microservices architecture with the following components:
┌─────────────────────────────────────────────────────────┐
│ Frontend Layer │
│ - dia-flow-web: Data flow visual designer │
└─────────────────────────────────────────────────────────┘
↕
┌─────────────────────────────────────────────────────────┐
│ Data Processing Services │
│ - flow-automation: Data flow orchestration │
│ - coderunner: Sandboxed code execution │
│ - flow-stream-data-pipeline: Real-time streaming │
│ - ecron: Scheduled task management │
└─────────────────────────────────────────────────────────┘
↕
┌─────────────────────────────────────────────────────────┐
│ Shared Libraries │
│ - ide-go-lib: Common Go libraries │
└─────────────────────────────────────────────────────────┘
Core data flow orchestration service that manages the complete lifecycle of data pipeline executions.
- Language: Go
- Framework: Gin
- Key Features: DAG management, executor management, trigger system, data connections
Sandboxed Python code execution service for running custom data processing logic.
- Language: Python 3.9
- Key Features: RestrictedPython execution, package management, document processing, OCR
Real-time data streaming pipeline service.
- Key Features: Stream processing, real-time data transformation
Distributed cron job scheduling and execution service.
- Language: Go
- Key Features: Cron-based scheduling, immediate execution, task monitoring, multi-node support
Visual designer for building data processing flows.
- Technology: Modern web framework
- Features: Drag-and-drop pipeline design, node configuration, execution monitoring
Common Go libraries shared across Go-based services.
- Go: flow-automation, ecron, ide-go-lib
- Python: coderunner, flow-stream-data-pipeline
- Go: Gin, MongoDB, Redis, Kafka
- Python: Tornado, RestrictedPython, pandas, SQLAlchemy
- Databases: MongoDB, MySQL/MariaDB, Redis
- Message Queues: Kafka, NSQ
- Container Orchestration: Kubernetes (Helm)
- Authentication: OAuth2
- Docker and Docker Compose
- Kubernetes cluster (for production deployment)
- MongoDB, MySQL/MariaDB, Redis
- Kafka or NSQ message queue
# Clone the repository
git clone <repository-url>
cd dataflow
# Start all services
docker-compose up -d
# Access the application
# Data Flow Designer: http://localhost:3000Each service can be run independently. Refer to the README in each service directory for specific setup instructions:
Each service includes Helm charts for Kubernetes deployment:
# Deploy flow-automation
cd flow-automation/helm
helm install flow-automation . -f values.yaml
# Deploy coderunner
cd coderunner/helm
helm install coderunner . -f values.yaml
# Deploy ecron
cd ecron/helm
helm install ecron . -f values.yamlEach service uses environment variables or configuration files. Key configuration areas:
- Database connections: MongoDB, MySQL, Redis
- Message queue: Kafka/NSQ endpoints
- Authentication: OAuth2 service endpoints
- Service discovery: Internal service URLs
- OAuth2 Authentication: Integrate with external identity providers
- Message Queues: Connect to Kafka/NSQ for event-driven architectures
- REST APIs: All services expose RESTful APIs for integration
- Webhooks: Configure webhooks for event notifications
dataflow/
├── flow-automation/ # Data flow orchestration (Go)
├── coderunner/ # Code execution service (Python)
├── ecron/ # Scheduled tasks (Go)
├── flow-stream-data-pipeline/ # Streaming pipeline (Python)
├── dia-flow-web/ # Data flow UI
└── ide-go-lib/ # Shared Go libraries
- Choose the service you want to contribute to
- Follow the development guide in the service's README
- Write tests for your changes
- Submit a pull request
- Go: Follow Go standard conventions, use
golangci-lint - Python: Follow PEP 8, use
blackandpylint
- Design a data flow in dia-flow-web
- Configure data source connections
- Add transformation nodes with custom Python code
- Set up scheduled execution via ecron
- Monitor execution in flow-automation dashboard
- Create a data flow for document ingestion
- Add OCR and text extraction nodes
- Configure data transformation and validation
- Set up automated report generation
- Monitor processing results
- Configure data sources for streaming
- Design transformation pipeline
- Set up real-time processing rules
- Monitor streaming data flow
- Export processed results
- Health Checks: All services expose health endpoints
- Metrics: Prometheus metrics for monitoring
- Logging: Structured logging across all services
- Tracing: Distributed tracing support
- Authentication: OAuth2-based authentication
- Authorization: Role-based access control
- Code Execution: Sandboxed execution environment
- Data Isolation: Multi-tenant data isolation
- Audit Trails: Comprehensive audit logging
For questions and support:
- Check service-specific README files
- Review API documentation
- Contact the development team