Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ These directives are currently available:
| [Write JSON Object](wrangler-docs/directives/write-as-json-object.md) | Composes a JSON object based on the fields specified. |
| [Format as Currency](wrangler-docs/directives/format-as-currency.md) | Formats a number as currency as specified by locale. |
| **Transformations** | |
| [Aggregate Stats](wrangler-docs/directives/aggregate-stats.md) | Analyzes byte size and time duration values, generating statistics |
| [Changing Case](wrangler-docs/directives/changing-case.md) | Changes the case of column values |
| [Cut Character](wrangler-docs/directives/cut-character.md) | Selects parts of a string value |
| [Set Column](wrangler-docs/directives/set-column.md) | Sets the column value to the result of an expression execution |
Expand Down Expand Up @@ -175,6 +176,28 @@ rates below are specified as *records/second*.
| High (167 Directives) | 426 | 127,946,398 | 82,677,845,324 | 106,367.27 |
| High (167 Directives) | 426 | 511,785,592 | 330,711,381,296 | 105,768.93 |

## Byte Size and Time Duration Support

The Wrangler library provides support for parsing and aggregating byte size and time duration values. This feature allows you to work with human-readable size and duration values directly in your recipes.

### Byte Size Units

The following byte size units are supported:

- B: Bytes
- KB: Kilobytes (1024 bytes)
- MB: Megabytes (1024 \* 1024 bytes)
- GB: Gigabytes (1024 _ 1024 _ 1024 bytes)
- TB: Terabytes (1024 _ 1024 _ 1024 \* 1024 bytes)

### Time Duration Units

The following time duration units are supported:

- ms: Milliseconds
- s: Seconds
- m: Minutes
- h: Hours

## Contact

Expand Down Expand Up @@ -214,5 +237,6 @@ and limitations under the License.

Cask is a trademark of Cask Data, Inc. All rights reserved.


Apache, Apache HBase, and HBase are trademarks of The Apache Software Foundation. Used with
permission. No endorsement by The Apache Software Foundation is implied by the use of these marks.
Binary file added README.pdf
Binary file not shown.
10 changes: 10 additions & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,16 @@
<testSourceDirectory>${testSourceLocation}</testSourceDirectory>
<pluginManagement>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.22.2</version> <!-- use at least 2.22.0+ -->
<configuration>
<forkCount>1</forkCount>
<reuseForks>true</reuseForks>
<argLine>--add-opens java.base/java.lang=ALL-UNNAMED</argLine>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
Expand Down
125 changes: 125 additions & 0 deletions prompts.text
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
Prompt 1
I am modifying an ANTLR grammar for a Java project. Help me write lexer and parser rules in Directives.g4 for
two new tokens:

BYTE_SIZE → matches values like "10KB", "2.5MB", "1GB".

TIME_DURATION → matches values like "5ms", "3.2s", "1min".

Also, include helpful fragments like BYTE_UNIT, TIME_UNIT. Finally, show how to update the value parser rule
(or create byteSizeArg, timeDurationArg if needed) so the new tokens are accepted as directive arguments.



Prompt 2: Create ByteSize and TimeDuration Token Classes
I am working on a Java project where tokens represent directive arguments. Help me create two new token classes:

ByteSize.java and TimeDuration.java

Each class should:

Extend io.cdap.wrangler.api.parser.Token

Parse strings like "10KB", "2.5MB", "1GB" (for ByteSize) and "500ms", "1.2s", "3min" (for TimeDuration)

Internally store the value in canonical units (bytes for ByteSize, milliseconds or nanoseconds for TimeDuration)

Provide getter methods like getBytes() and getMilliseconds()


Prompt 3: Update Token Types and Directive Argument Support
I am extending a token parsing framework in Java for a data transformation tool. Guide me to:

Add two new token types: BYTE_SIZE and TIME_DURATION in the token registry or enum used (if any).

Update the logic that defines valid argument types in directives,
so that BYTE_SIZE and TIME_DURATION can be accepted where appropriate.

Mention any necessary updates in registration/configuration files or classes if applicable.



Prompt 4: Add Visitor Methods for New Parser Rules
In my ANTLR-based Java parser for a directive language,
I’ve added two new parser rules: byteSizeArg and timeDurationArg. Help me:

Implement visitor methods visitByteSizeArg and visitTimeDurationArg in the appropriate visitor or parser class.

These methods should return instances of ByteSize and TimeDuration tokens respectively using ctx.getText().

Ensure these token instances are added to the TokenGroup for the directive being parsed.



Prompt 5: Implement New AggregateStats Directive
I’m creating a new directive class called AggregateStats in a Java-based data transformation engine. Guide me to:

Implement the Directive interface

Accept at least 4 arguments:

Source column (byte sizes)

Source column (time durations)

Target column for total size

Target column for total/average time

Optionally accept:

Aggregation type (total, avg)

Output unit (MB, GB, seconds, minutes)

In initialize, store the argument values

In execute, use ExecutorContext.getStore() to:

Accumulate byte size and time duration values (convert to canonical units)

In finalize, return a single Row with converted results (e.g., MB, seconds)


Prompt 6: Write Unit Tests for ByteSize and TimeDuration
Help me write JUnit tests for one Java class: ByteSize and TimeDuration.
These class parse strings like "10KB" and "500ms" respectively.

Test valid cases: "10KB", "1.5MB", "1GB" for ByteSize and "500ms", "2s", "1min" for TimeDuration.

Verify that getBytes() or getMilliseconds() return the correct canonical values.

Include a few invalid input tests and assert that they throw proper exceptions.




Prompt 7: Write Parser Tests for New Grammar
I’ve added BYTE_SIZE and TIME_DURATION tokens to an ANTLR grammar. Help me write parser tests in Java to:

Validate that inputs like "10KB", "1.5MB", "5ms", "3min" are accepted in directive recipes.

Use test classes like GrammarBasedParserTest.java or RecipeCompilerTest.java.

Also test invalid values (e.g., "10KBB", "1..5MB", "ms5") and ensure they are rejected.




Prompt 8: Write Integration Test for AggregateStats Directive
I’ve created an AggregateStats directive that aggregates byte size and time duration columns. Help me write an integration test using TestingRig to:

Create input data: List<Row> with columns like data_transfer_size and response_time using values like "1MB", "500KB", "2s", "500ms".

Define recipe like:

java

String[] recipe = new String[] {
"aggregate-stats :data_transfer_size :response_time total_size_mb total_time_sec"
};
Execute with TestingRig.execute(recipe, rows)

Assert that the resulting row contains correct aggregated values (in MB and seconds)

Use a delta tolerance (e.g., 0.001) for comparing float values
31 changes: 30 additions & 1 deletion wrangler-api/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,35 @@
<version>${cdap.version}</version>
<scope>provided</scope>
</dependency>

</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.rat</groupId>
<artifactId>apache-rat-plugin</artifactId>
<configuration>
<excludesFile>rat-excludes.txt</excludesFile>
<numUnapprovedLicenses>2</numUnapprovedLicenses>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-checkstyle-plugin</artifactId>
<version>3.0.0</version>
<configuration>
<configLocation>checkstyle.xml</configLocation>
<suppressionsLocation>suppressions.xml</suppressionsLocation>
</configuration>
<executions>
<execution>
<id>validate</id>
<phase>validate</phase>
<goals>
<goal>check</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
156 changes: 156 additions & 0 deletions wrangler-api/src/main/java/io/cdap/wrangler/api/parser/ByteSize.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
/*
* Copyright © 2017-2019 Cask Data, Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License"); you may not
* use this file except in compliance with the License. You may obtain a copy of
* the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations under
* the License.
*/

package io.cdap.wrangler.api.parser;

import com.google.gson.JsonElement;
import com.google.gson.JsonObject;
import io.cdap.wrangler.api.annotations.PublicEvolving;

/**
* Represents a ByteSize token, capable of parsing strings like "10KB", "1.5MB",
* and converting them into bytes.
*/
@PublicEvolving
public class ByteSize implements Token {

// Multipliers for each unit
private static final double KILOBYTE = 1024.0;
private static final double MEGABYTE = KILOBYTE * 1024.0;
private static final double GIGABYTE = MEGABYTE * 1024.0;
private static final double TERABYTE = GIGABYTE * 1024.0;

// Parsed byte value stored as long
private final long bytesValue;

/**
* Constructs a ByteSize token by parsing the given size string.
*
* @param sizeString The string to parse (e.g., "10KB", "1.5MB").
* @throws IllegalArgumentException If the string format is invalid.
*/
public ByteSize(String sizeString) {
this.bytesValue = parseSize(sizeString);
}

/**
* Parses a size string and converts it into bytes.
*
* @param sizeString The input string representing a byte size.
* @return The size in bytes.
*/
private long parseSize(String sizeString) {
if (sizeString == null || sizeString.trim().isEmpty()) {
throw new IllegalArgumentException("Size string must not be null or empty.");
}

sizeString = sizeString.trim().toUpperCase();
String numericPart;
double multiplier;

try {
if (sizeString.endsWith("KB")) {
numericPart = sizeString.substring(0, sizeString.length() - 2);
multiplier = KILOBYTE;
} else if (sizeString.endsWith("MB")) {
numericPart = sizeString.substring(0, sizeString.length() - 2);
multiplier = MEGABYTE;
} else if (sizeString.endsWith("GB")) {
numericPart = sizeString.substring(0, sizeString.length() - 2);
multiplier = GIGABYTE;
} else if (sizeString.endsWith("TB")) {
numericPart = sizeString.substring(0, sizeString.length() - 2);
multiplier = TERABYTE;
} else if (sizeString.endsWith("B")) {
numericPart = sizeString.substring(0, sizeString.length() - 1);
multiplier = 1.0;
} else {
throw new IllegalArgumentException("Invalid byte size format or unsupported unit in string: " + sizeString);
}

if (numericPart.isEmpty()) {
throw new IllegalArgumentException("Missing numeric value in size string: " + sizeString);
}

double parsedValue = Double.parseDouble(numericPart);
if (parsedValue < 0) {
throw new IllegalArgumentException("Size value cannot be negative: " + sizeString);
}

return (long) (parsedValue * multiplier); // Truncate to long
} catch (NumberFormatException e) {
throw new IllegalArgumentException("Invalid numeric value in size string: " + sizeString, e);
}
}

/**
* @return Size in bytes.
*/
public long getBytes() {
return bytesValue;
}

/**
* @return Size in kilobytes.
*/
public double getKiloBytes() {
return bytesValue / KILOBYTE;
}

/**
* @return Size in megabytes.
*/
public double getMegaBytes() {
return bytesValue / MEGABYTE;
}

/**
* @return Size in gigabytes.
*/
public double getGigaBytes() {
return bytesValue / GIGABYTE;
}

/**
* @return Size in terabytes.
*/
public double getTeraBytes() {
return bytesValue / TERABYTE;
}

@Override
public Object value() {
return bytesValue;
}

@Override
public TokenType type() {
return TokenType.BYTE_SIZE;
}

@Override
public JsonElement toJson() {
JsonObject object = new JsonObject();
object.addProperty("type", TokenType.BYTE_SIZE.name());
object.addProperty("value", bytesValue);
return object;
}

@Override
public String toString() {
return bytesValue + "B";
}
}
Loading