-
Notifications
You must be signed in to change notification settings - Fork 500
[SYSTEMDS-3540] Check columns for One-Hot-Encoding before compression #2054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
smyomous
wants to merge
16
commits into
apache:main
Choose a base branch
from
smyomous:branch_backup
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 15 commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
d621272
first commit
smyomous a4d5656
Merge pull request #1 from smyomous/youssef_first_branch
smyomous 1395d7e
Add more logic
smyomous e380bcd
Merge pull request #2 from smyomous/second_branch
smyomous 78e3a1f
Fixed empty cocoders and handle transpose in compressOHE
smyomous 155fa34
remove flag
smyomous 1e04d95
dmlConfig, transpose
smyomous 2330b6b
fix pom.xml
smyomous 68d4a44
Merge branch 'apache:main' into branch_backup
smyomous 2645ca7
Use nnzCols for sample and add experiments
smyomous 0446263
Merge branch 'apache:main' into branch_backup
smyomous 51de91f
Implementation fixes, formatting changes
smyomous 1a701a0
Merge branch 'branch_backup' of https://github.com/smyomous/systemds …
smyomous bcef43f
Add documentation, parsing script, remove timing in CoCoderFactory
smyomous d2448e0
Added licenses to new files
smyomous cb7534b
Minor cleanups
smyomous File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
#------------------------------------------------------------- | ||
# | ||
# Licensed to the Apache Software Foundation (ASF) under one | ||
# or more contributor license agreements. See the NOTICE file | ||
# distributed with this work for additional information | ||
# regarding copyright ownership. The ASF licenses this file | ||
# to you under the Apache License, Version 2.0 (the | ||
# "License"); you may not use this file except in compliance | ||
# with the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, | ||
# software distributed under the License is distributed on an | ||
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
# KIND, either express or implied. See the License for the | ||
# specific language governing permissions and limitations | ||
# under the License. | ||
# | ||
#------------------------------------------------------------- | ||
|
||
log4j.rootLogger=ERROR, console | ||
|
||
log4j.logger.org.apache.sysds=INFO | ||
log4j.logger.org.apache.sysds.runtime.compress=DEBUG | ||
log4j.logger.org.apache.spark=ERROR | ||
log4j.logger.org.apache.spark.SparkContext=OFF | ||
log4j.logger.org.apache.hadoop=ERROR | ||
|
||
log4j.appender.console=org.apache.log4j.ConsoleAppender | ||
log4j.appender.console.target=System.err | ||
log4j.appender.console.layout=org.apache.log4j.PatternLayout | ||
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n | ||
smyomous marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,151 @@ | ||
<!-- | ||
{% comment %} | ||
Licensed to the Apache Software Foundation (ASF) under one or more | ||
contributor license agreements. See the NOTICE file distributed with | ||
this work for additional information regarding copyright ownership. | ||
The ASF licenses this file to you under the Apache License, Version 2.0 | ||
(the "License"); you may not use this file except in compliance with | ||
the License. You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
{% end comment %} | ||
--> | ||
|
||
# Checking One Hot Encodedness before Compression tests | ||
|
||
To run all tests for One Hot Encoding Checks: | ||
* install systemds, | ||
* make sure that the paths for SYSTEMDS_ROOT, JAVA_HOME, HADOOP_HOME, LOG4JPROP are correctly set | ||
smyomous marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
* run experiments.sh | ||
|
||
Alternatively, to run the experiment.dml script directly with OHE checks enabled, use this command: | ||
|
||
`$SYSTEMDS_ROOT/bin/systemds $SYSTEMDS_ROOT/target/SystemDS.jar experiment.dml --config ohe.xml ` | ||
|
||
Note: You can use -nvargs to set the variables rows, cols, dummy, distinct, repeats (how many times you want to generate a random matrix, transform-encode it and compress it) | ||
|
||
(Dummy is the array of column indexes that you would like to One Hot Encode, example: dummy="[1]" will One Hot Encode the first column) | ||
|
||
To collect the metrics from the logs for easier comparison, you can run `parse_logs.py` and an excel file called `combined_metrics.xlsx` will be created in this directory. | ||
--- | ||
# Documentation of Changes to codebase for Implementing OHE Checks | ||
|
||
## Flag to enable/disable OHE checks (Disabled by default) | ||
- Added ``COMPRESSED_ONEHOTDETECT = "sysds.compressed.onehotdetect"`` to ``DMLConfig`` and adjusted the relevant methods | ||
- Added attribute to ``CompressionSettings`` ``public final boolean oneHotDetect`` and adjusted the methods | ||
- Adjusted ``CompressionSettingsBuilder`` to check if ``COMPRESSED_ONEHOTDETECT`` has been set to true to enable the checks | ||
|
||
## Changes in `CoCoderFactory` | ||
|
||
### 1. Introduction of OHE Detection | ||
|
||
**Condition Addition:** | ||
- Added a condition to check for `cs.oneHotDetect` along with the existing condition `!containsEmptyConstOrIncompressable` in the `findCoCodesByPartitioning` method. This ensures that the process considers OHE detection only if it is enabled in the compression settings. | ||
- Original code only checked for `containsEmptyConstOrIncompressable` and proceeded to cocode all columns if false. The updated code includes an additional check for `cs.oneHotDetect`. | ||
|
||
### 2. New Data Structures for OHE Handling | ||
|
||
**New Lists:** Introduced two new lists to manage the OHE detection process: | ||
- `currentCandidates`: To store the current candidate columns that might form an OHE group. | ||
- `oheGroups`: To store lists of columns that have been validated as OHE groups. | ||
|
||
### 3. Filtering Logic Enhancements | ||
|
||
**Column Filtering:** Enhanced the loop that iterates over columns to identify OHE candidates: | ||
- Columns that are empty, constant, or incompressible are filtered into respective lists. | ||
- For other columns, they are added to `currentCandidates` if they are deemed candidates (via `isCandidate` function). | ||
|
||
### 4. Addition of `isHotEncoded` Function | ||
|
||
**Function Creation:** Created a new `isHotEncoded` function to evaluate if the accumulated columns form a valid OHE group. | ||
- **Parameters:** Takes a list of column groups (`colGroups`), a boolean flag (`isSample`), an array of non-zero counts (`nnzCols`), and the number of rows (`numRows`). | ||
- **Return Type:** Returns a `String` indicating the status of the current candidates: | ||
- `"POTENTIAL_OHE"`: When the current candidates could still form an OHE group. | ||
- `"NOT_OHE"`: When the current candidates cannot form an OHE group. | ||
- `"VALID_OHE"`: When the current candidates form a valid OHE group. | ||
- **Logic:** The function calculates the total number of distinct values and offsets, and checks if they meet the criteria for forming an OHE group. | ||
|
||
### 5. Enhanced Group Handling | ||
|
||
**Candidate Processing:** Within the loop, after adding a column to `currentCandidates`: | ||
- Calls `isHotEncoded` to check the status of the candidates. | ||
- If `isHotEncoded` returns `"NOT_OHE"`, moves the candidates to regular groups and clears the candidates list. | ||
- If `isHotEncoded` returns `"VALID_OHE"`, moves the candidates to `oheGroups` and clears the candidates list. | ||
- If `isHotEncoded` returns `"POTENTIAL_OHE"`, continues accumulating candidates. | ||
|
||
### 6. Final Candidate Check | ||
|
||
**Post-loop Check:** After the loop, checks any remaining `currentCandidates`: | ||
- If they form a valid OHE group, adds them to `oheGroups`. | ||
- Otherwise, adds them to regular groups. | ||
|
||
### 7. Overwrite and CoCode Groups | ||
|
||
**Overwrite Groups:** Updates `colInfos.compressionInfo` with the processed `groups`. | ||
**OHE Group Integration:** Combines indexes for validated OHE groups and adds them to the final `groups`. | ||
|
||
## One Hot Encoded Columns Compression in `ColGroupFactory` | ||
|
||
### Description | ||
|
||
The `compressOHE` function is designed to compress columns that are one-hot encoded (OHE). It validates and processes the input data to ensure it meets the criteria for one-hot encoding, and if so, it compresses the data accordingly. If the data does not meet the OHE criteria, it falls back to a direct compression method (`directCompressDDC`). | ||
|
||
### Implementation Details | ||
|
||
1. **Validation of `numVals`**: | ||
- Ensures the number of distinct values (`numVals`) in the column group is greater than 0. | ||
- Throws a `DMLCompressionException` if `numVals` is less than or equal to 0. | ||
|
||
2. **Handling Transposed Matrix**: | ||
- If the matrix is transposed (`cs.transposed` is `true`): | ||
- Creates a `MapToFactory` data structure with an additional unique value. | ||
- Iterates through the sparse block of the matrix, checking for non-one values or multiple ones in the same row. | ||
- If a column index in the sparse block is empty, or if non-one values or multiple ones are found, it falls back to `directCompressDDC`. | ||
|
||
3. **Handling Non-Transposed Matrix**: | ||
- If the matrix is not transposed (`cs.transposed` is `false`): | ||
- Creates a `MapToFactory` data structure. | ||
- Iterates through each row of the matrix: | ||
- Checks for the presence of exactly one '1' in the columns specified by `colIndexes`. | ||
- If multiple ones are found in the same row, or if no '1' is found in a sample row, it falls back to `directCompressDDC`. | ||
|
||
4. **Return Value**: | ||
- If the data meets the OHE criteria, returns a `ColGroupDDC` created with the column indexes, an `IdentityDictionary`, and the data. | ||
- If the data does not meet the OHE criteria, returns the result of `directCompressDDC`. | ||
|
||
## Add method in `ColGroupSizes` | ||
Added method ``estimateInMemorySizeOHE(int nrColumns, boolean contiguousColumns, int nrRows)`` | ||
|
||
## Add method in `AComEst` | ||
Added a getter method `getNnzCols` | ||
|
||
## Edit `distinctCountScale` method in `ComEstSample` | ||
smyomous marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
```java | ||
if(freq == null || freq.length == 0) | ||
return numOffs+1; | ||
``` | ||
And added condition: | ||
```java | ||
if(sampleFacts.numRows>sampleFacts.numOffs) | ||
est += 1; | ||
``` | ||
<span style="color:red">Warning: This Change will cause some tests to fail</span>. | ||
|
||
|
||
## Edit constructor in `CompressedSizeInfoColGroup` | ||
Added a case in switch statement for OHE | ||
|
||
## Added attribute in `CompressionStatistics` | ||
Added Sparsity of input matrix attribute ``public double sparsity;`` to add logging in ``CompressedMatrixBlockFactory`` | ||
## Fix Bug in `extractFacts` method in `SparseEncoding` | ||
Number of distinct values returned was wrong. | ||
Fix: In the return statements, changed map.getUnique() to getUnique() | ||
|
||
## Fix Bug in `outputMatrixPostProcessing` method in `MultiColumnEncoder` | ||
Instead of just recomputing nonzeroes in the else block, added `output.examSparsity(k);` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
#------------------------------------------------------------- | ||
# | ||
# Licensed to the Apache Software Foundation (ASF) under one | ||
# or more contributor license agreements. See the NOTICE file | ||
# distributed with this work for additional information | ||
# regarding copyright ownership. The ASF licenses this file | ||
# to you under the Apache License, Version 2.0 (the | ||
# "License"); you may not use this file except in compliance | ||
# with the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, | ||
# software distributed under the License is distributed on an | ||
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
# KIND, either express or implied. See the License for the | ||
# specific language governing permissions and limitations | ||
# under the License. | ||
# | ||
#------------------------------------------------------------- | ||
|
||
## This script generates a random matrix, transforms some columns to be One-Hot-Encoded, and then compresses | ||
|
||
# Set default values | ||
default_rows = 1000 | ||
default_cols = 10 | ||
default_dummy = "[1]" | ||
default_repeats = 1 | ||
default_num_distinct = 10 | ||
|
||
#nvargs | ||
rows = ifdef($rows, default_rows) | ||
cols = ifdef($cols, default_cols) | ||
dummy = ifdef($dummy, default_dummy) | ||
repeats = ifdef($repeats, default_repeats) | ||
num_distinct = ifdef($distinct, default_num_distinct) | ||
|
||
# Generate random matrix and apply transformations | ||
x = rand(rows=rows, cols=cols, min=0, max=num_distinct) | ||
x = floor(x) | ||
Fall = as.frame(x) | ||
jspec = "{ids: true, dummycode: " + dummy + "}"; | ||
for(i in 1:repeats){ | ||
[T,M] = transformencode(target=Fall, spec=jspec) | ||
xc = compress(T) | ||
} | ||
print(toString(xc)) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
#!/usr/bin/env bash | ||
#------------------------------------------------------------- | ||
# | ||
# Licensed to the Apache Software Foundation (ASF) under one | ||
# or more contributor license agreements. See the NOTICE file | ||
# distributed with this work for additional information | ||
# regarding copyright ownership. The ASF licenses this file | ||
# to you under the Apache License, Version 2.0 (the | ||
# "License"); you may not use this file except in compliance | ||
# with the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, | ||
# software distributed under the License is distributed on an | ||
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
# KIND, either express or implied. See the License for the | ||
# specific language governing permissions and limitations | ||
# under the License. | ||
# | ||
#------------------------------------------------------------- | ||
|
||
mkdir BaselineLogs | ||
mkdir OHELogs | ||
run_base() { | ||
$SYSTEMDS_ROOT/bin/systemds $SYSTEMDS_ROOT/target/SystemDS.jar experiment.dml \ | ||
--seed 42 --debug -nvargs rows=$1 cols=$2 dummy="$3" distinct=$4 > BaselineLogs/${5}_${1}_rows_${2}_cols_${3}_encoded_base.txt 2>&1 | ||
} | ||
|
||
run_ohe() { | ||
$SYSTEMDS_ROOT/bin/systemds $SYSTEMDS_ROOT/target/SystemDS.jar experiment.dml \ | ||
--seed 42 --debug --config ohe.xml -nvargs rows=$1 cols=$2 dummy="$3" distinct=$4> OHELogs/${5}_${1}_rows_${2}_cols_${3}_encoded_ohe.txt 2>&1 | ||
} | ||
|
||
# Run same experiments but checking One-Hot-Encoded columns first | ||
run_ohe 1000 1 "[1]" 10 1 | ||
run_ohe 1000 5 "[2]" 10 2 | ||
run_ohe 1000 5 "[1,2]" 10 3 | ||
run_ohe 1000 5 "[1,2,3]" 10 4 | ||
run_ohe 1000 5 "[1,2,3,4,5]" 10 5 | ||
run_ohe 1000 10 "[1,3,5]" 10 6 | ||
run_ohe 1000 10 "[1,2,5,6]" 10 7 | ||
run_ohe 100000 1 "[1]" 100 8 | ||
run_ohe 100000 5 "[1,2]" 100 9 | ||
run_ohe 100000 5 "[1,2,3]" 100 10 | ||
run_ohe 100000 100 "[1,3,50,60,70,80]" 100 11 | ||
run_ohe 100000 100 "[1,2,24,25,50,51]" 100 12 | ||
|
||
# Run baseline experiments | ||
run_base 1000 1 "[1]" 10 1 | ||
run_base 1000 5 "[2]" 10 2 | ||
run_base 1000 5 "[1,2]" 10 3 | ||
run_base 1000 5 "[1,2,3]" 10 4 | ||
run_base 1000 5 "[1,2,3,4,5]" 10 5 | ||
run_base 1000 10 "[1,3,5]" 10 6 | ||
run_base 1000 10 "[1,2,5,6]" 10 7 | ||
run_base 100000 1 "[1]" 100 8 | ||
run_base 100000 5 "[1,2]" 100 9 | ||
run_base 100000 5 "[1,2,3]" 100 10 | ||
run_base 100000 100 "[1,3,50,60,70,80]" 100 11 | ||
run_base 100000 100 "[1,2,24,25,50,51]" 100 12 | ||
smyomous marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
<!-- | ||
* Licensed to the Apache Software Foundation (ASF) under one | ||
* or more contributor license agreements. See the NOTICE file | ||
* distributed with this work for additional information | ||
* regarding copyright ownership. The ASF licenses this file | ||
* to you under the Apache License, Version 2.0 (the | ||
* "License"); you may not use this file except in compliance | ||
* with the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, | ||
* software distributed under the License is distributed on an | ||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
* KIND, either express or implied. See the License for the | ||
* specific language governing permissions and limitations | ||
* under the License. | ||
--> | ||
|
||
<root> | ||
<sysds.compressed.onehotdetect>true</sysds.compressed.onehotdetect> | ||
</root> |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.