Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,13 +62,16 @@ Project is divided into two modules:

### bigfiles

- [How to Run](bigfiles/README.md#how-to-run)
- bigfile is file that does not fit to RAM
- module for comparing big files
- written in Scala
- more about bigfiles module could be found in [bigfiles README](bigfiles/README.md)


### smallfiles

- [How to Run](smallfiles/README.md#how-to-run)
- smallfile is file that fits to RAM
- module for comparing small files
- written in Python
Expand Down
15 changes: 14 additions & 1 deletion bigfiles/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Scala CPS-Dataset-Comparison

This is scala implementation of the project. It is used for comparing big files.
This is scala implementation of the project. It is used for comparing big files (files that can not fit to RAM).

- [How to run](#how-to-run)
- [Requirements](#requirements)
Expand All @@ -15,6 +15,7 @@ Then run:

```bash
spark-submit target/scala-2.12/dataset-comparison-assembly-1.0.jar -o <output-path> --inputA <A-file-path> --inputB <B-file-path>

```
### Parameters:
| Parameter | Description | Required |
Expand All @@ -26,6 +27,18 @@ spark-submit target/scala-2.12/dataset-comparison-assembly-1.0.jar -o <output-pa
|`-d` or `--diff` [Row] |difference compute type| **optional**|
|`-e` or `--exclude`|columns to exclude|**optional**|

Example:
```bash
spark-submit --class africa.absa.cps.DatasetComparison \
--conf "spark.driver.extraJavaOptions=-Dconfig.file=/../bigfiles/src/main/resources/application.conf" \
target/scala-2.11/dataset-comparison-assembly-0.1.0.jar \
-o "/test_files/output_names$(date '+%Y-%m-%d_%H%M%S')" \
--inputA /test_files/namesA.parquet \
--inputB /test_files/namesB.parquet \
-d Row

```

### Run with specific config

```bash
Expand Down
4 changes: 3 additions & 1 deletion smallfiles/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# Python CPS-Dataset-Comparison

This is python implementation of the project. It is used for comparing small files.
> This module is not yet implemented.

This is python implementation of the project. It is used for comparing small files (files fitting into RAM).

- [Create and run environment](#create-and-run-environment)
- [Run main](#run-main)
Expand Down
Loading