Skip to content

Develop sreemanth #980

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"java.configuration.updateBuildConfiguration": "automatic"
}
115 changes: 115 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,64 @@ rates below are specified as *records/second*.
| High (167 Directives) | 426 | 127,946,398 | 82,677,845,324 | 106,367.27 |
| High (167 Directives) | 426 | 511,785,592 | 330,711,381,296 | 105,768.93 |

## Byte Size and Time Duration Parsers

The Wrangler library now includes built-in support for parsing and aggregating byte sizes and time durations. This feature allows you to easily work with data that includes size measurements (e.g., "1.5MB", "2GB") and time intervals (e.g., "500ms", "2.5h").

### Supported Units

#### Byte Size Units
- B (Bytes)
- KB (Kilobytes)
- MB (Megabytes)
- GB (Gigabytes)
- TB (Terabytes)
- PB (Petabytes)

#### Time Duration Units
- ns (Nanoseconds)
- us (Microseconds)
- ms (Milliseconds)
- s (Seconds)
- m (Minutes)
- h (Hours)
- d (Days)

### Using the Aggregate Stats Directive

The `aggregate-stats` directive allows you to aggregate byte sizes and time durations across rows. Here's the syntax:

```
aggregate-stats :size_column :time_column total_size_column total_time_column [output_size_unit] [output_time_unit]
```

Parameters:
- `:size_column` - Column containing byte sizes (e.g., "1.5MB", "2GB")
- `:time_column` - Column containing time durations (e.g., "500ms", "2.5h")
- `total_size_column` - Name of the output column for total size
- `total_time_column` - Name of the output column for total time
- `output_size_unit` - (Optional) Unit for the output size (default: "MB")
- `output_time_unit` - (Optional) Unit for the output time (default: "s")

Example:
```
# Input data:
# | data_size | response_time |
# |-----------|---------------|
# | 1.5MB | 500ms |
# | 2.5MB | 750ms |
# | 1MB | 250ms |

# Directive:
aggregate-stats :data_size :response_time total_size total_time MB s

# Output:
# | total_size | total_time |
# |------------|------------|
# | 5.0 | 1.5 |
```

The directive automatically handles mixed units in the input data, converting everything to a common base unit (bytes for sizes, nanoseconds for times) before aggregating and then converting to the requested output units.

## Contact

Expand Down Expand Up @@ -216,3 +274,60 @@ Cask is a trademark of Cask Data, Inc. All rights reserved.

Apache, Apache HBase, and HBase are trademarks of The Apache Software Foundation. Used with
permission. No endorsement by The Apache Software Foundation is implied by the use of these marks.

# Wrangler

A data preparation tool for cleaning, transforming, and preparing data for analysis.

## Unit Parsers

### Byte Size Parser
The byte size parser supports the following units:
- B (bytes)
- KB (kilobytes)
- MB (megabytes)
- GB (gigabytes)
- TB (terabytes)
- PB (petabytes)

Example usage:
```
1B // 1 byte
1KB // 1 kilobyte
1MB // 1 megabyte
1GB // 1 gigabyte
1TB // 1 terabyte
1PB // 1 petabyte
```

### Time Duration Parser
The time duration parser supports the following units:
- ns (nanoseconds)
- us (microseconds)
- ms (milliseconds)
- s (seconds)
- m (minutes)
- h (hours)
- d (days)

Example usage:
```
1ns // 1 nanosecond
1us // 1 microsecond
1ms // 1 millisecond
1s // 1 second
1m // 1 minute
1h // 1 hour
1d // 1 day
```

### Usage in Directives
Both byte size and time duration values can be used in directives for data transformation and aggregation:

```
// Aggregate byte sizes
aggregate-stats :column1 sum as total_size;

// Aggregate time durations
aggregate-stats :column2 average as avg_duration;
```
137 changes: 137 additions & 0 deletions clickhouse-flatfile-ingestion/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# ClickHouse Flat File Ingestion Tool

A web-based application for bidirectional data ingestion between ClickHouse database and Flat File platform.

## Features

- Bidirectional data ingestion between ClickHouse and Flat Files
- JWT-based authentication
- Schema discovery and validation
- Progress tracking for large data transfers
- Support for various file formats (CSV, JSON, etc.)
- Configurable data mapping
- Error handling and logging

## Technology Stack

### Backend
- Spring Boot
- Spring Security with JWT
- ClickHouse JDBC Driver
- Apache Commons CSV
- Jackson for JSON processing

### Frontend
- React
- Material-UI
- Axios
- React Router
- React Query

## Prerequisites

- Java 17 or higher
- Node.js 16 or higher
- ClickHouse server
- PostgreSQL (for user management)

## Environment Variables

Create a `.env` file in the backend directory with the following variables:

```properties
# Database Configuration
DB_URL=jdbc:postgresql://localhost:5432/ingestion_db
DB_USERNAME=your_db_username
DB_PASSWORD=your_db_password

# ClickHouse Configuration
CLICKHOUSE_HOST=your_clickhouse_host
CLICKHOUSE_PORT=8443
CLICKHOUSE_DATABASE=your_database
CLICKHOUSE_USER=your_username
CLICKHOUSE_PASSWORD=your_password

# JWT Configuration
JWT_SECRET=your_jwt_secret_key

# File Upload Configuration
UPLOAD_DIR=./uploads
```

## Installation

1. Clone the repository
2. Set up environment variables
3. Build and run the backend:
```bash
cd backend
./mvnw clean install
./mvnw spring-boot:run
```
4. Build and run the frontend:
```bash
cd frontend
npm install
npm start
```

## Usage

1. Access the application at `http://localhost:3000`
2. Log in with your credentials
3. Select source (ClickHouse or Flat File)
4. Configure connection parameters
5. Select tables and columns
6. Preview data
7. Start ingestion process

## Security Considerations

- All sensitive information is stored in environment variables
- JWT tokens expire after 24 hours
- Passwords are hashed using BCrypt
- SSL/TLS encryption for database connections
- Input validation and sanitization
- Rate limiting on API endpoints

## API Documentation

### Authentication
- POST /api/auth/login - Login endpoint
- POST /api/auth/refresh - Refresh token endpoint

### Ingestion
- POST /api/ingestion/export - Export data from ClickHouse to file
- POST /api/ingestion/import - Import data from file to ClickHouse
- GET /api/ingestion/progress/{jobId} - Get ingestion progress
- GET /api/ingestion/schema - Get table schema
- GET /api/ingestion/preview - Get data preview

## Error Handling

The application includes comprehensive error handling for:
- Invalid credentials
- Connection failures
- Schema mismatches
- File format errors
- Data validation errors
- Network timeouts

## Logging

- Application logs are stored in `logs/application.log`
- Log levels can be configured in `application.properties`
- Structured logging format for better analysis

## Contributing

1. Fork the repository
2. Create a feature branch
3. Commit your changes
4. Push to the branch
5. Create a Pull Request

## License

This project is licensed under the MIT License - see the LICENSE file for details.
100 changes: 100 additions & 0 deletions clickhouse-flatfile-ingestion/backend/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>2.7.0</version>
</parent>
<groupId>com.wrangler</groupId>
<artifactId>clickhouse-flatfile-ingestion</artifactId>
<version>1.0-SNAPSHOT</version>
<name>clickhouse-flatfile-ingestion</name>
<description>Bidirectional ClickHouse &amp; Flat File Data Ingestion Tool</description>

<properties>
<java.version>11</java.version>
<clickhouse-jdbc.version>0.3.2</clickhouse-jdbc.version>
<jjwt.version>0.11.5</jjwt.version>
<commons-csv.version>1.9.0</commons-csv.version>
<lombok.version>1.18.24</lombok.version>
</properties>

<dependencies>
<!-- Spring Boot -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-validation</artifactId>
</dependency>

<!-- ClickHouse -->
<dependency>
<groupId>com.clickhouse</groupId>
<artifactId>clickhouse-jdbc</artifactId>
<version>${clickhouse-jdbc.version}</version>
</dependency>

<!-- JWT -->
<dependency>
<groupId>io.jsonwebtoken</groupId>
<artifactId>jjwt-api</artifactId>
<version>${jjwt.version}</version>
</dependency>
<dependency>
<groupId>io.jsonwebtoken</groupId>
<artifactId>jjwt-impl</artifactId>
<version>${jjwt.version}</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>io.jsonwebtoken</groupId>
<artifactId>jjwt-jackson</artifactId>
<version>${jjwt.version}</version>
<scope>runtime</scope>
</dependency>

<!-- CSV Processing -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-csv</artifactId>
<version>${commons-csv.version}</version>
</dependency>

<!-- Lombok -->
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>${lombok.version}</version>
<scope>provided</scope>
</dependency>

<!-- Test -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
</dependencies>

<build>
<plugins>
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
<configuration>
<excludes>
<exclude>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
</exclude>
</excludes>
</configuration>
</plugin>
</plugins>
</build>
</project>
Loading