Skip to content

Commit 097a56d

Browse files
committed
feat: allow custom handling of duplicate rows
1 parent 91f78fd commit 097a56d

File tree

7 files changed

+439
-487
lines changed

7 files changed

+439
-487
lines changed

README.md

Lines changed: 66 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,31 @@ const stats = await diff({
122122
console.log(stats);
123123
```
124124

125+
### Diff 2 CSV files on the console with a single case insensitive primary key (using a custom comparer)
126+
127+
```Typescript
128+
import { diff, CellValue, cellComparer, stringComparer } from 'tabular-data-differ';
129+
130+
function caseInsensitiveCompare((a: CellValue, b: CellValue): number {
131+
if (typeof a === 'string' && typeof b === 'string') {
132+
return stringComparer(a.toLowerCase(), b.toLowerCase());
133+
}
134+
return cellComparer(a, b);
135+
}
136+
137+
const stats = await diff({
138+
oldSource: './tests/a.csv',
139+
newSource: './tests/b.csv',
140+
keys: [
141+
{
142+
name: 'id',
143+
comparer: caseInsensitiveCompare,
144+
}
145+
],
146+
}).to('console');
147+
console.log(stats);
148+
```
149+
125150
### Diff 2 CSV files and only get the stats
126151

127152
```Typescript
@@ -294,6 +319,37 @@ const stats = await ctx.to({
294319
console.log(stats);
295320
```
296321
322+
### Duplicate key handling
323+
324+
If your data sources contain duplicate keys, then the diffing will fail by default, but you can configure this behavior using the duplicateKeyHandling option.
325+
326+
You can resolve the conflict by keeping the first or last row of the duplicates:
327+
```Typescript
328+
import { diff } from 'tabular-data-differ';
329+
const stats = await diff({
330+
oldSource: './tests/a.csv',
331+
newSource: './tests/b.csv',
332+
keys: ['id'],
333+
duplicateKeyHandling: 'keepFirstRow', // or 'keepLastRow'
334+
duplicateRowBufferSize: 2000,
335+
}).to('console');
336+
console.log(stats);
337+
```
338+
339+
Or, if you need more control in the row selection, then you can provide your own handler:
340+
```Typescript
341+
import { diff } from 'tabular-data-differ';
342+
const stats = await diff({
343+
oldSource: './tests/a.csv',
344+
newSource: './tests/b.csv',
345+
keys: ['id'],
346+
duplicateKeyHandling: (rows) => rows[0], // same as 'keepFirstRow'
347+
duplicateRowBufferSize: 2000,
348+
}).to('console');
349+
console.log(stats);
350+
```
351+
352+
297353
### Order 2 CSV files and diff them on the console
298354
299355
Don't forget to install first my other lib: `npm i huge-csv-sorter`.
@@ -524,14 +580,16 @@ sortDirection| no | ASC | specifies if the column is sorted in ascen
524580
525581
### Differ options
526582
527-
Name |Required|Default value|Description
528-
----------------|--------|-------------|-----------
529-
oldSource | yes | | either a string filename, a URL or a SourceOptions
530-
newSource | yes | | either a string filename, a URL or a SourceOptions
531-
keys | yes | | the list of columns that form the primary key. This is required for comparing the rows. A key can be a string name or a {ColumnDefinition}
532-
includedColumns | no | | the list of columns to keep from the input sources. If not specified, all columns are selected.
533-
excludedColumns | no | | the list of columns to exclude from the input sources.
534-
rowComparer | no | | specifies a custom row comparer.
583+
Name |Required|Default value|Description
584+
----------------------|--------|-------------|-----------
585+
oldSource | yes | | either a string filename, a URL or a SourceOptions
586+
newSource | yes | | either a string filename, a URL or a SourceOptions
587+
keys | yes | | the list of columns that form the primary key. This is required for comparing the rows. A key can be a string name or a {ColumnDefinition}
588+
includedColumns | no | | the list of columns to keep from the input sources. If not specified, all columns are selected.
589+
excludedColumns | no | | the list of columns to exclude from the input sources.
590+
rowComparer | no | | specifies a custom row comparer.
591+
duplicateKeyHandling |no | fail | specifies how to handle duplicate rows in a source. It will fail by default and throw a UniqueKeyViolationError exception. But you can ignore, keep the first or last row, or even provide your own function that will receive the duplicates and select the best candidate.
592+
duplicateRowBufferSize|no | 1000 | specifies the maximum size of the buffer used to accumulate duplicate rows.
535593
536594
### diff function
537595

package-lock.json

Lines changed: 3 additions & 22 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

package.json

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "tabular-data-differ",
3-
"version": "1.0.2",
3+
"version": "1.1.0",
44
"description": "A very efficient library for diffing two sorted streams of tabular data, such as CSV files.",
55
"keywords": [
66
"table",
@@ -33,7 +33,6 @@
3333
"devDependencies": {
3434
"@jest/globals": "29.3.1",
3535
"@types/jest": "29.2.4",
36-
"@types/n-readlines": "1.0.3",
3736
"@types/node": "18.11.17",
3837
"jest": "29.3.1",
3938
"ts-jest": "29.0.3",

0 commit comments

Comments
 (0)