Skip to content

Commit a97471f

Browse files
author
lsleonard
committed
Updates for random data
1. Modified random data check and added a later check for random data. 2. Added an early call to single value mode. 3. Updated unused extended string mode to calculate high bit clear during processing and check for overflow as late as possible.
1 parent 569afa7 commit a97471f

File tree

7 files changed

+189
-45
lines changed

7 files changed

+189
-45
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ td512 filename [loopCount]
77

88
loopCount (default 1) is the loop count to use for performance testing. Also see BENCHMARK_LOOP_COUNT macro in main.c.
99

10-
Tiny data compression is not usually supported by compression programs. Now with td512 you can compress data from 6 to 512 bytes. td512 is available under the GPL-3.0 License at https://github.com/lsleonard/tiny-data-compression. Although Zstandard and Snappy get better compression at 512 bytes than td512, Zstandard is very slow for tiny datasets and both programs steadily decline in compression ratio as the number of bytes decreases to 128. At 64 bytes, neither program produces compression. td512 combines the compressed output of td64 for each block of 64 bytes in the input, meaning that the compression achieved at 512 bytes is the same as that for 64 bytes. The td512 algorithm emphasizes speed, and running on a 2 GHz processor, gets 24% average compression at 272 Mbytes per second on the Squash benchmark test data (see https://quixdb.github.io/squash-benchmark/#). Although Huffman coding, with its optimal compression using frequency analysis of values, has been used effectively for many applications, for tiny datasets the compression modes used in td512 approach or exceed the results of using the Huffman algorithm. And with a focus on speed of execution, Huffman and arithmetic coding are not practical algorithms for applications of tiny data. Two areas where high-speed compression using td512 might be applied are small message text and programmatic objects.
10+
Tiny data compression is not usually supported by compression programs. Now with td512 you can compress data from 6 to 512 bytes. td512 is available under the GPL-3.0 License at https://github.com/lsleonard/tiny-data-compression. Although for some types of data, programs QuickLZ, Zstandard and Snappy can get better compression at 512 bytes than td512, all steadily decline in compression ratio as the number of bytes decreases to 128. At 64 bytes, none of these programs produces compression. td512 combines the compressed output of td64 for each block of 64 bytes in the input, meaning that the compression achieved at 512 bytes is the same as that for 64 bytes. The td512 algorithm emphasizes speed, and running on a 2 GHz processor, gets 24% average compression at 323 Mbytes per second on the Squash benchmark test data (see https://quixdb.github.io/squash-benchmark/#). Although Huffman coding, with its optimal compression using frequency analysis of values, has been used effectively for many applications, for tiny datasets the compression modes used in td512 approach or exceed the results of using the Huffman algorithm. And with a focus on speed of execution, Huffman and arithmetic coding are not practical algorithms for applications of tiny data. Two areas where high-speed compression using td512 might be applied are small message text and programmatic objects.
1111

1212
You can call the td512 and td512d functions to compress and decompress 1 to 512 bytes. The td512 interface performs compression of 6 to 512 bytes, but accepts 1 to 5 bytes and stores them without compression. td512 acts as a wrapper that uses the td64 interface to compress blocks of 64 bytes until the final block of 64 or fewer bytes is compressed. Along with the number of bytes processed, a pass/fail bit is stored for each 64-byte (or smaller) block compressed, and the compressed or uncompressed data is output.
1313

Tiny Data Compression with td512.docx

-1.52 KB
Binary file not shown.

td512.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,12 @@
6666
3. Set the initial loop in td64 to 7/16 of input values for 24 or
6767
more inputs. This provides a better result for adaptive text mode.
6868
*/
69+
// Notes for version 1.1.6
70+
/*
71+
1. Modified random data check and added a later check for random data.
72+
2. Added an early call to single value mode.
73+
3. Updated unused extended string mode to calculate high bit clear during processing and check for overflow as late as possible.
74+
*/
6975
#ifndef td512_h
7076
#define td512_h
7177

td64.c

Lines changed: 115 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
// Copyright © 2021 L. Stevan Leonard. All rights reserved.
88
#include "td64.h"
99
#include "td64_internal.h"
10+
#include "tdString.h"
1011

1112
#ifdef TD64_TEST_MODE
1213
// these globals can be used to collect info
@@ -15,6 +16,7 @@ uint32_t g_td64FailedStringMode=0;
1516
uint32_t g_td64MaxStringModeUniquesExceeded=0;
1617
uint32_t g_td64Text8bitCount=0;
1718
uint32_t g_td64AdaptiveText8bitCount=0;
19+
uint32_t g_td64StringBlocks=0;
1820
#endif
1921

2022
// fixed bit compression (fbc): for the number of uniques in input, the minimum number of input values for 25% compression
@@ -661,8 +663,9 @@ int32_t encodeAdaptiveTextMode(unsigned char *inVals, unsigned char *outVals, co
661663

662664
// save uniques for possible failure
663665
memcpy(saveUniques, outVals+1, nUniquesIn);
664-
if (predefinedTextCharCnt > nValues * 3/4)
666+
if (predefinedTextCharCnt)
665667
{
668+
// predefined text char count is high enough to guarantee compreession even if remainder of checked values contain no text chars
666669
// use standard text table, accept compression even if maxBytes exeeded
667670
outVals[0] = 0x7; // indicate text mode with standard text
668671
outVals[1] = 0; // init first value used by esmOutputBits
@@ -677,18 +680,19 @@ int32_t encodeAdaptiveTextMode(unsigned char *inVals, unsigned char *outVals, co
677680
else
678681
{
679682
// output char not predefined or adaptive
683+
if (nextOutIx > maxBytes)
684+
{
685+
// main verifies up to 1/2 of data values looked at are text
686+
// reset uniques in output array
687+
memcpy(outVals+1, saveUniques, nUniquesIn);
688+
return 0; // requested compression not met
689+
}
680690
esmOutputBits(outVals, 3, 0x5, &nextOutIx, &nextOutBit);
681691
esmOutputBits(outVals, 8, inVal, &nextOutIx, &nextOutBit); // output 8 bits
682692
#ifdef TD64_TEST_MODE
683693
g_td64Text8bitCount++;
684694
#endif
685695
}
686-
if (nextOutIx > maxBytes)
687-
{
688-
// reset uniques in output array
689-
memcpy(outVals+1, saveUniques, nUniquesIn);
690-
return 0; // requested compression not met
691-
}
692696
}
693697
}
694698
else
@@ -707,20 +711,23 @@ int32_t encodeAdaptiveTextMode(unsigned char *inVals, unsigned char *outVals, co
707711
else
708712
{
709713
// output char not predefined or adaptive
714+
if (nextOutIx > maxBytes)
715+
{
716+
// main verifies only 1/2 of data values looked at are text
717+
resetAdaptiveChars(adaptiveUsed); // prep for next time
718+
// reset uniques in output array
719+
memcpy(outVals+1, saveUniques, nUniquesIn);
720+
return 0; // requested compression not met
721+
}
710722
esmOutputBits(outVals, 3, 0x5, &nextOutIx, &nextOutBit);
711723
esmOutputBits(outVals, 8, inVal, &nextOutIx, &nextOutBit); // output 8 bits
712724
#ifdef TD64_TEST_MODE
713-
g_td64AdaptiveText8bitCount++;
725+
if ((outVals[0] & 0x37) == 7)
726+
g_td64Text8bitCount++;
727+
else
728+
g_td64AdaptiveText8bitCount++;
714729
#endif
715730
}
716-
if (nextOutIx > maxBytes)
717-
{
718-
// main verifies only 1/2 of data values looked at are text
719-
resetAdaptiveChars(adaptiveUsed); // prep for next time
720-
// reset uniques in output array
721-
memcpy(outVals+1, saveUniques, nUniquesIn);
722-
return 0; // requested compression not met
723-
}
724731
}
725732
}
726733
return nextOutIx * 8 + nextOutBit;
@@ -994,6 +1001,30 @@ int32_t encodeStringMode(const unsigned char *inVals, unsigned char *outVals, co
9941001
return 0; // not compressible
9951002
} // end encodeStringMode
9961003

1004+
static inline uint32_t getNum2char(const unsigned char *inVals, const uint32_t *uniqueOccurrence, const uint32_t nValues)
1005+
{
1006+
// unique offsets must be preset from main loop
1007+
uint32_t n2char=0;
1008+
uint32_t inPos=1; // first value preset
1009+
int32_t twoVals[32];
1010+
uint32_t UOinVal;
1011+
uint32_t UOnextIn=0; // first value is unique offset 0
1012+
memset(twoVals, 255, sizeof(twoVals));
1013+
1014+
while (inPos < nValues-1)
1015+
{
1016+
UOinVal = UOnextIn;
1017+
UOnextIn = uniqueOccurrence[inVals[inPos++]];
1018+
if (UOinVal > 31 || UOnextIn > 31)
1019+
continue; // only support up to 32 unique values
1020+
if (twoVals[UOinVal] == -1)
1021+
twoVals[UOinVal] = UOnextIn;
1022+
else if (twoVals[UOinVal] == UOnextIn)
1023+
n2char++;
1024+
}
1025+
return n2char;
1026+
} // end getNum2char
1027+
9971028
int32_t td64(unsigned char *inVals, unsigned char *outVals, const uint32_t nValues)
9981029
// td64: Compress nValues bytes. Return 0 if not compressible (no output bytes),
9991030
// -1 if error; otherwise, number of bits written to outVals.
@@ -1038,10 +1069,11 @@ int32_t td64(unsigned char *inVals, unsigned char *outVals, const uint32_t nValu
10381069
highBitCheck |= inVal; // keep watch on high bit of unique values
10391070
}
10401071
}
1041-
if (nUniqueVals > nValsInitLoop * 7/8 + 1)
1072+
if (nUniqueVals > nValsInitLoop * 7/8 - 1)
10421073
{
1043-
// supported unique values exceeded--skip this for < 16 values
1044-
if (nValues >= MIN_VALUES_7_BIT_MODE && (highBitCheck & 0x80) == 0)
1074+
// supported unique values exceeded
1075+
// check highBitCheck for high bit clear
1076+
if ((highBitCheck & 0x80) == 0 && nValues >= MIN_VALUES_7_BIT_MODE)
10451077
{
10461078
// attempt to compress based on high bit clear across all values
10471079
// confirm remaining values have high bit clear
@@ -1053,10 +1085,10 @@ int32_t td64(unsigned char *inVals, unsigned char *outVals, const uint32_t nValu
10531085
outVals[0] = 0; // indicate random data failure
10541086
return 0; // too many uniques to compress with fixed bit coding
10551087
}
1056-
if (nUniqueVals > uniqueLimit/2 && predefinedTextCharCnt > nValsInitLoop / 2)
1088+
if (nUniqueVals > uniqueLimit/2 && predefinedTextCharCnt > nValsInitLoop/2)
10571089
{
10581090
// encode in text mode if at least 11% compression expected
1059-
uint32_t retBits=encodeAdaptiveTextMode(inVals, outVals, nValues, val256, nUniqueVals, predefinedTextCharCnt, nValues-nValues/8);
1091+
uint32_t retBits=encodeAdaptiveTextMode(inVals, outVals, nValues, val256, nUniqueVals, predefinedTextCharCnt > nValsInitLoop*3/4, nValues-nValues/8);
10601092
if (retBits != 0)
10611093
return retBits;
10621094
#ifdef TD64_TEST_MODE
@@ -1084,6 +1116,28 @@ int32_t td64(unsigned char *inVals, unsigned char *outVals, const uint32_t nValu
10841116
break; // continue loop without further checking
10851117
}
10861118
}
1119+
if (singleValue >= 0 && nUniqueVals > uniqueLimit)
1120+
{
1121+
// early opportunity for single value mode
1122+
// single value mode is fast and set to get minimum 12% compression for 64 values
1123+
// only single value mode can have more then MAX_STRING_MODE_UNIQUES
1124+
return encodeSingleValueMode(inVals, outVals, nValues, singleValue);
1125+
}
1126+
const uint32_t nUniquesRandom=nValues*5/8 < MAX_STRING_MODE_UNIQUES ? nValues*5/8 : MAX_STRING_MODE_UNIQUES;
1127+
if (nUniqueVals > nUniquesRandom)
1128+
{
1129+
if ((highBitCheck & 0x80) == 0 && nValues >= MIN_VALUES_7_BIT_MODE)
1130+
{
1131+
// attempt to compress based on high bit clear across all values
1132+
// confirm remaining values have high bit clear
1133+
while (inPos < nValues)
1134+
highBitCheck |= inVals[inPos++];
1135+
if ((highBitCheck & 0x80) == 0)
1136+
return encode7bits(inVals, outVals, nValues);
1137+
}
1138+
outVals[0] = 0; // indicate random data failure
1139+
return 0; // too many uniques to compress with fixed bit coding
1140+
}
10871141
if (nUniqueVals <= uniqueLimit) // confirm unique limit has not been exceeded
10881142
{
10891143
// continue fixed bit loop with checks for high bit set and repeat counts,
@@ -1105,22 +1159,41 @@ int32_t td64(unsigned char *inVals, unsigned char *outVals, const uint32_t nValu
11051159
// fixed bit coding failed, try for other compression modes
11061160
if (singleValue >= 0)
11071161
{
1162+
// second chance for single value mode
1163+
// single value mode is fast and set to get minimum 12% compression for 64
1164+
// only single value mode can have more then MAX_STRING_MODE_UNIQUES
11081165
return encodeSingleValueMode(inVals, outVals, nValues, singleValue);
11091166
}
11101167
if ((nValues >= MIN_VALUES_STRING_MODE))
11111168
{
1112-
if ((nUniqueVals > MAX_STRING_MODE_UNIQUES))
1169+
uint32_t maxBits = ((highBitCheck & 0x80) == 0 && nValues >= MIN_VALUE_7_BIT_MODE_12_PERCENT) ? nValues*7 : nValues*7+nValues/2 ;
1170+
int32_t retBits;
1171+
#ifdef TD64_TEST_MODE
1172+
g_td64StringBlocks++;
1173+
#endif
1174+
if (nUniqueVals > MAX_STRING_MODE_UNIQUES)
11131175
{
1176+
// extended string mode supports up to 64 uniques but is slow and not guaranteed to achieve any particular compression, and is needed less than 5% of time in data tested; could be used if a quick metric to predict compression level can be found
1177+
// NOTE: more than 32 uniques is currently being labeled random data
11141178
#ifdef TD64_TEST_MODE
11151179
g_td64MaxStringModeUniquesExceeded++;
11161180
#endif
1117-
}
1181+
/*
1182+
// extended string mode
1183+
uint32_t nValuesOut;
1184+
int32_t retBits=encodeStringModeExtended(inVals, outVals, nValues, &nValuesOut);
1185+
if (retBits < 0)
1186+
return retBits;
1187+
if (retBits < maxBits)
1188+
return retBits;
1189+
#ifdef TD64_TEST_MODE
1190+
g_td64FailedStringMode++;
1191+
#endif
1192+
*/ }
11181193
else
11191194
{
1120-
// string mode for 32+ values with 32 or fewer uniques
1121-
int32_t retBits;
1195+
// string mode for 32+ values with 17 to 32 uniques
11221196
// max bits set to 12% if high bit clear and enough input values, else 6%
1123-
uint32_t maxBits = ((highBitCheck & 0x80) == 0 && nValues >= MIN_VALUE_7_BIT_MODE_12_PERCENT) ? nValues*7 : nValues*7+nValues/2 ;
11241197
if ((retBits=encodeStringMode(inVals, outVals, nValues, nUniqueVals, uniqueOccurrence, (highBitCheck & 0x80) == 0, maxBits)) != 0)
11251198
return retBits;
11261199
#ifdef TD64_TEST_MODE
@@ -1136,18 +1209,21 @@ int32_t td64(unsigned char *inVals, unsigned char *outVals, const uint32_t nValu
11361209
outVals[0] = 1; // indicate general failure to compress
11371210
return 0; // unable to compress
11381211
}
1139-
else if (nUniqueVals > 8 && singleValue >= 0)
1212+
else if (nUniqueVals > 8)
11401213
{
1141-
// check for benefit of single value mode when 4-bit fixed bit encoding
1142-
// requires at least 38 input values to have 9 or more uniques
1143-
const uint32_t singleValueOverFixexBitRepeats=nValues/2-nValues/16;
1144-
if (val256[singleValue] >= singleValueOverFixexBitRepeats)
1214+
if (singleValue >= 0)
11451215
{
1146-
// favor single value over fixed 4-bit encoding
1147-
return encodeSingleValueMode(inVals, outVals, nValues, singleValue);
1216+
// check for benefit of single value mode when 4-bit fixed bit encoding
1217+
// requires at least 38 input values to have 9 or more uniques
1218+
// FUTURE: graduate based on number of uniques versus fixed 31%
1219+
const uint32_t singleValueOverFixexBitRepeats=nValues/2-nValues/16; //
1220+
if (val256[singleValue] >= singleValueOverFixexBitRepeats)
1221+
{
1222+
// favor single value over fixed 4-bit encoding
1223+
return encodeSingleValueMode(inVals, outVals, nValues, singleValue);
1224+
}
11481225
}
11491226
}
1150-
11511227
// process fixed bit coding
11521228
uint32_t i;
11531229
uint32_t nextOut;
@@ -1761,6 +1837,11 @@ int32_t td64d(const unsigned char *inVals, unsigned char *outVals, const uint32_
17611837

17621838
// first bit of first byte 1: decode one of four modes
17631839
const unsigned char firstByte=inVals[0];
1840+
if (firstByte == 0x7f)
1841+
{
1842+
// string mode extended
1843+
return decodeStringModeExtended(inVals, outVals, nOriginalValues, bytesProcessed);
1844+
}
17641845
if ((firstByte & 7) == 0x01)
17651846
{
17661847
// string mode

td64.h

Lines changed: 58 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,21 +15,74 @@
1515
You should have received a copy of the GNU General Public License
1616
along with this program. If not, see <https://www.gnu.org/licenses/>.//
1717
*/
18-
// Notes for version 2.1.0:
18+
// Notes for version 1.1.0:
1919
/*
20-
1. Modified the td64 interface after studying the results from compressing
21-
up to 512 bytes in the td512 interface.
20+
1. Main program reads a file into memory that is compressed by
21+
calling td512 repeatedly. When complete, the compressed data is
22+
written to a file and read for decompression by calling td512d.
23+
td512 filename [loopCount]]
24+
filename is required argument 1.
25+
loopCount is optional argument 2 (default: 1). Looping is performed over the entire input file.
2226
*/
27+
// Notes for version 1.1.1:
28+
/*
29+
1. Updated some descriptive comments.
30+
*/
31+
// Notes for version 1.1.2:
32+
/*
33+
1. Moved 7-bit mode defines to td64.h because they are used
34+
outside of the 7-bit mode.
35+
2. When fewer than minimum values to use 7-bit mode of 16, don't
36+
accumulate high bit when reasonable. Main loop keeps this in because
37+
time required is minimal.
38+
3. When fewer than 24 input values, but greater than or equal to minimum
39+
values of 16 to use 7-bit mode, use 6% as minimum compression for
40+
compression modes used prior to 7-bit mode.
41+
*/
42+
// Notes for version 1.1.3:
43+
/*
44+
1. Fixed bugs in td5 and td5d functions.
45+
2. Recognize random data starting at 16 input values.
46+
*/
47+
// Notes for version 1.1.4:
48+
/*
49+
1. Added bit text mode that uses variable length encoding bits
50+
to maximize compression. td5 still uses the fixed bit text mode.
51+
2. Changed the random data metric to use number values init
52+
loop * 7/8 + 1 to be threshold for random data.
53+
3. Implemented a static global for decoding bit text mode and
54+
string mode to limit reads of input values.
55+
*/
56+
// Notes for version 1.1.5
57+
/*
58+
1. Added adaptive text mode that looks for occurrences of characters
59+
that are common to a particular data type when fewer than 3/4 of
60+
the input values are matched by a predefined character. Defined
61+
XML and HTML based on '<', '>', '/' and '"'. Defined C or other
62+
code files based on '*', '=', ';' and '\t'. Eight characters
63+
common to the text type are defined in the last 8 characters of
64+
the characters encoded.
65+
2. Added compression of high bit in unique characters in string mode
66+
when the high bit is 0 for all values.
67+
3. Set the initial loop in td64 to 7/16 of input values for 24 or
68+
more inputs. This provides a better result for adaptive text mode.
69+
*/
70+
// Notes for version 1.1.6
71+
/*
72+
1. Modified random data check and added a later check for random data.
73+
2. Added an early call to single value mode.
74+
3. Updated unused extended string mode to calculate high bit clear during processing and check for overflow as late as possible.
75+
*/
2376
#ifndef td64_h
2477
#define td64_h
2578

2679
#include <stdint.h>
2780
#include <string.h>
2881
#include <stdlib.h>
29-
//#define NDEBUG // disable asserts
82+
#define NDEBUG // disable asserts
3083
#include <assert.h>
3184

32-
#define TD64_VERSION "v2.1.0"
85+
#define TD64_VERSION "v1.1.6"
3386
#define MAX_TD64_BYTES 64 // max input vals supported
3487
#define MIN_TD64_BYTES 1 // min input vals supported
3588
#define MAX_UNIQUES 16 // max uniques supported in input

0 commit comments

Comments
 (0)