Skip to content

Commit b8b4467

Browse files
KDBX-564TPPub option to publish to TP
1 parent 48c593e commit b8b4467

6 files changed

Lines changed: 262 additions & 96 deletions

File tree

docs/reference.md

Lines changed: 62 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,38 +1,52 @@
11
# NYSE TAQ Data Loader module
22

3-
This module provides high-performance utilities for parsing [NYSE TAQ (Trade and Quote) PSV files](https://ftp.nyse.com/Historical%20Data%20Samples/DAILY%20TAQ/), performing data transformations, and loading the results either into a date-partitioned kdb+ database (HDB) or directly into in-memory tables (RDB).
3+
This module provides high-performance utilities for parsing [NYSE TAQ (Trade and Quote) PSV files](https://ftp.nyse.com/Historical%20Data%20Samples/DAILY%20TAQ/) and performing data transformations. Results can be:
4+
5+
- persisted into a date-partitioned kdb+ database (HDB)
6+
- loaded into in-memory tables (RDB)
7+
- replayed into a [tickerplant](https://code.kx.com/q/architecture/tickq/) (TP)
48

59
## Prerequisites: Input PSV Files
610

711
The module requires NYSE TAQ master, trade, and quote PSV files. These can be downloaded from the [NYSE FTP server](https://ftp.nyse.com/Historical%20Data%20Samples/DAILY%20TAQ/) and extracted into a local directory.
812

913
> [!WARNING]
10-
> The NYSE TAQ files are large. Depending on your network bandwidth, downloading them may take a long time and may require tens of gigabytes of disk space. Consider passing `SIZE=small` to `getCSVs.sh`.
14+
> The NYSE TAQ files are large. Depending on your network bandwidth, downloading them may take a long time and may require tens of gigabytes of disk space. Consider passing `--size small` to `getCSVs.sh`.
1115
12-
A utility script, `getCSVs.sh` (located in directory `scripts`), is provided to automate the download and extraction process via `curl`. To download and unzip all available TAQ files to `/tmp/nysetaqpsv`:
16+
A utility script, `getCSVs.sh` (located in the `scripts` directory), is provided to automate the download and extraction process via `curl`. To download and unzip all available TAQ files to `/tmp/nysetaqpsv`:
1317

1418
```bash
1519
# Extract available dates from the NYSE FTP
1620
DATES=$(curl -s "https://ftp.nyse.com/Historical%20Data%20Samples/DAILY%20TAQ/" | grep -oE '"EQY_US_ALL_TRADE_[0-9]{8}\.gz"' | grep -oE '[0-9]{8}' |paste -sd,)
17-
./scripts/getCSVs.sh /tmp/nysetaqpsv "$DATES"
21+
./scripts/getCSVs.sh --csvdir /tmp/nysetaqpsv --dates "$DATES"
1822
```
1923

2024
To manage disk space and bandwidth, you can restrict the download scope by:
2125

2226
1. **Limiting Dates:** Target a specific date (e.g., the most recent available).
23-
1. **Using the `SIZE` Variable**: Filter by symbol ranges using the `SIZE` environment variable.
27+
1. **Using the `--size` flag**: Filter by symbol ranges using the `--size` flag (or `-s`).
2428

2529
```bash
2630
# Example: Download PSVs for only a single date
2731
DATES=$(curl -s https://ftp.nyse.com/Historical%20Data%20Samples/DAILY%20TAQ/| grep -oE 'EQY_US_ALL_TRADE_2[0-9]{7}' | grep -oE '2[0-9]{7}'|head -1)
28-
SIZE=small ./scripts/getCSVs.sh /tmp/nysetaqpsv "$DATES"
32+
./scripts/getCSVs.sh --csvdir /tmp/nysetaqpsv --dates "$DATES" --size small
33+
```
34+
35+
### For replay: sorting by time
36+
37+
To replay NYSE TAQ data to TP, input PSV files must be sorted by time. `./scripts/sort.sh` sorts trade and individual quote files using the [sort](https://man7.org/linux/man-pages/man1/sort.1.html) command, then merges the quote files into a single large file in linear time via `sort -m`.
38+
39+
> **Note:** `sort` requires temporary disk space during sorting. If the default `/tmp` directory has insufficient space, set the `TMPDIR` environment variable to a directory with more space before running the script.
40+
41+
```bash
42+
./scripts/sort.sh --csvdir /tmp/nysetaqpsv --dates "$DATES" --size small
2943
```
3044

3145
### Dataset Statistics (Reference: 2025.07.01)
3246

3347
The following table estimates the data footprint by `SIZE` parameter:
3448

35-
| `SIZE` | Symbol Range (First Letter) | Uncompressed PSVs Size | Uncompressed HDB Size | Symbol Nr | Quote Nr |
49+
| `SIZE` | Symbol Range (First Letter) | Uncompressed PSVs Size | Uncompressed HDB Size | Symbol Count | Quote Count |
3650
| --- | ---: | ---: | ---: | ---: | ---: |
3751
| `small` | Z | ~10 GB | ~0.3 GB | 246 | 4,041,795 |
3852
| `medium` | I | ~20 GB | ~8.1 GB | 1,313 | 125,442,373 |
@@ -41,20 +55,21 @@ The following table estimates the data footprint by `SIZE` parameter:
4155

4256
## Quickstart
4357

44-
This module exposes two primary functions:
58+
This module exposes three primary functions:
4559

4660
- **`parseToDisk`** — parses TAQ data and persists it into a date-partitioned HDB on disk.
4761
- **`parseToMemory`** — parses TAQ data and loads it directly into in-memory tables, suitable for RDB-style workflows.
62+
- **`parseToTP`** — parses TAQ data and sends batches asynchronously to a TP via `.u.upd`, as if data arrived from the exchange. Input PSV files must be [sorted by time](#for-replay-sorting-by-time).
4863

4964
```q
50-
([parseToMemory; parseToDisk]): use `kx.taq
65+
([parseToMemory; parseToDisk; parseToTP]): use `kx.taq
5166
```
5267

5368
### parseToDisk
5469

5570
`parseToDisk` requires at least three parameters.
5671

57-
To create the `trade`, `quote`, `master` tables and `exnames` dictionary for October 2, 2025, and save them to `/tmp/kdbdb`:
72+
To create the `trade`, `quote`, `master` tables, and `exnames` dictionary for October 2, 2025, and save them to `/tmp/kdbdb`:
5873

5974
```q
6075
parseToDisk["/tmp/nysetaqpsv"; 2025.10.02; "/tmp/kdbdb"]
@@ -86,14 +101,14 @@ AMZN 0D04:00:00.009709706 219.5 3 219.41 219.98
86101
AMZN 0D04:00:00.010213563 219.7 3 219.41 219.98
87102
AMZN 0D04:00:00.010379075 219.7 2 219.41 219.98
88103
AMZN 0D04:00:00.010640417 219.98 100 219.41 219.98
89-
..
104+
...
90105
```
91106

92107
### parseToMemory
93108

94109
`parseToMemory` requires at least two parameters — no destination path is needed as data is loaded directly into memory.
95110

96-
To load the `trade`, `quote`, `master` tables and `exnames` dictionary for October 2, 2025, into memory:
111+
To load the `trade`, `quote`, `master` tables, and `exnames` dictionary for October 2, 2025, into memory:
97112

98113
```q
99114
(trade; quote; master; exnames): parseToMemory["/tmp/nysetaqpsv"; 2025.10.02]
@@ -122,17 +137,42 @@ Storing a full day of NYSE TAQ data in memory is RAM-intensive. The table below
122137
| `large` | A-H | ~79 GB |
123138
| `full` | A-Z | ~170 GB |
124139

140+
### parseToTP
141+
142+
Input PSV files must be [sorted by time](#for-replay-sorting-by-time) for proper replay of the data. `parseToTP` requires the tickerplant address as a third parameter:
143+
144+
```q
145+
parseToTP["/tmp/nysetaqpsv"; 2025.10.02; `:localhost:5010; ([batchsize: 5000])]
146+
```
147+
148+
The third parameter is passed directly to [hopen](https://code.kx.com/kdb-x/ref/hopen.html), so you can also pass a port number if the TP is on the same box:
149+
150+
```q
151+
parseToTP["/tmp/nysetaqpsv"; 2025.10.02; 5010; ([batchsize: 5000])]
152+
```
153+
154+
`parseToTP` calls `.u.upd` on the remote q process with table name as the first parameter and the records (as a table) as the second parameter. The simplest way to test this function is to start a q process on the provided port (5010) and define `.u.upd` as `upsert`:
155+
156+
```bash
157+
$ q -p 5010
158+
...
159+
q).u.upd: upsert
160+
```
161+
162+
`parseToTP` publishes quotes after trades. If you prefer simultaneous publication, start two kdb+ processes and use the `tbls` optional parameter to control which tables each process publishes.
163+
125164
## Configuration Options
126165

127-
Both `parseToDisk` and `parseToMemory` accept an optional dictionary as their last argument to customize the ingestion process.
166+
All three functions accept an optional dictionary as their last argument to customize the ingestion process.
128167

129168
### Common parameters
130169

131170
| Key | Default | Description |
132171
| --- | ---: | --- |
133172
| `letters` | `"A-Z"` | Restricts ingestion to symbols whose first letter falls within the specified range (e.g., `"K-N"`). |
134-
| `includetestsymbols` | `0b` | If `1b`, includes instruments flagged as test symbols in the `master` PSV. |
135173
| `batchsize` | `10 000 000` | Number of rows processed per chunk. Set to `0` to read the entire file in one pass for faster throughput if RAM permits. |
174+
| `tbls` | ``` `master`trade`quote ``` | Tables to process. |
175+
| `includetestsymbols` | `0b` | If `1b`, includes instruments flagged as test symbols in the `master` PSV. |
136176
| `logger` | logger created by `.logger.createLog[]` of the [KX log module](https://code.kx.com/kdb-x/modules/logging/overview.html) | Logger used for status updates during the ingestion process. |
137177

138178
### parseToDisk extra parameters
@@ -154,11 +194,18 @@ Both `parseToDisk` and `parseToMemory` accept an optional dictionary as their la
154194
Sorting by `sym` and having a parted attribute on `sym`:
155195

156196
```q
157-
parseToMemory["/tmp/nysetaqpsv"; 2025.10.02; ([letters: "Y-Z"; sortcols: `sym; symattr: `p])]
197+
(trade; quote; master; exnames): parseToMemory["/tmp/nysetaqpsv"; 2025.10.02; ([letters: "Y-Z"; sortcols: `sym; symattr: `p])]
158198
```
159199

160200
The original data is sorted by time within each symbol, so sorting by `sym` actually means sorting by `sym` and `time`.
161201

202+
### parseToTP extra parameters
203+
204+
| Key | Default | Description |
205+
| --- | ---: | --- |
206+
| `starttime` | `0D04:00` | Records before `starttime` are ignored. |
207+
208+
162209
## Performance Notes
163210

164211
* **Multithreading**: The PSV parsing engine is multithreaded. Start your ingestion process with the `-s` flag (e.g., `q -s 8`) to make use of available CPU cores.

docs/release-notes.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,14 @@
22

33
_This document provides the version history of the KDB-X Taq Module, detailing released versions, fixes, and improvements._
44

5+
## 1.3.0
6+
7+
**Release Date**: 2026-05-19
8+
9+
### Fixes and Improvements
10+
11+
- New function `parseToTP` (and `script/sort.sh`) to publish data to TP as if data arrived from the exchange.
12+
513
## 1.2.0
614

715
**Release Date**: 2026-05-17

scripts/common.sh

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
function get_letters () {
2+
local size=$1
3+
case "$size" in
4+
"full") echo 'A-Z' ;;
5+
"large") echo 'A-H' ;;
6+
"medium") echo 'I-I' ;;
7+
"small") echo 'Z-Z' ;;
8+
esac
9+
}
10+
11+
function getFilename() {
12+
local type=$1 letter=$2 date=$3
13+
echo "${type}_US_ALL_${letter}_${date}.gz"
14+
}
15+
16+
# BSD sed (macOS) requires an explicit empty string after -i; GNU sed does not accept it as a separate arg
17+
if [[ "$(uname -s)" == "Darwin" ]]; then
18+
SED_INPLACE=(sed -i '')
19+
else
20+
SED_INPLACE=(sed -i)
21+
fi
22+
23+
# Default values for optional arguments
24+
CSVDIR=""
25+
DATES_RAW=""
26+
SIZE="full"
27+
28+
usage() {
29+
echo "Usage: $0 --csvdir <dir> --dates <date1,date2,...> [--size small|medium|large|full]"
30+
exit 1
31+
}
32+
33+
while [[ $# -gt 0 ]]; do
34+
case "$1" in
35+
--csvdir|-c) CSVDIR="$2"; shift 2 ;;
36+
--dates|-d) DATES_RAW="$2"; shift 2 ;;
37+
--size|-s) SIZE="$2"; shift 2 ;;
38+
*) echo "Unknown option: $1"; usage ;;
39+
esac
40+
done
41+
42+
[[ -z "$CSVDIR" ]] && { echo "Error: --csvdir is required"; usage; }
43+
[[ -z "$DATES_RAW" ]] && { echo "Error: --dates is required"; usage; }
44+
45+
case "$SIZE" in
46+
small|medium|large|full) ;;
47+
*) echo "Error: --size must be one of: small, medium, large, full"; usage ;;
48+
esac
49+
50+
IFS=',' read -r -a DATEARRAY <<< "$DATES_RAW"
51+
52+
53+
LETTERS=$(get_letters "$SIZE")
54+
LETTERARRAY=($(eval echo "{${LETTERS:0:1}..${LETTERS:2:1}}"))

scripts/getCSVs.sh

Lines changed: 2 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -3,47 +3,11 @@
33
set -euo pipefail
44

55
script_dir=$(dirname "${BASH_SOURCE[0]}")
6-
7-
CSVDIR="$1"
8-
IFS=',' read -r -a DATEARRAY <<< "$2"
9-
SIZE="${SIZE:-full}"
6+
# shellcheck source=common.sh
7+
source "${script_dir}/common.sh"
108

119
readonly URLPREFIX="https://ftp.nyse.com/Historical%20Data%20Samples/DAILY%20TAQ/"
1210

13-
# BSD sed (macOS) requires an explicit empty string after -i; GNU sed does not accept it as a separate arg
14-
if [[ "$(uname -s)" == "Darwin" ]]; then
15-
SED_INPLACE=(sed -i '')
16-
else
17-
SED_INPLACE=(sed -i)
18-
fi
19-
20-
function get_letters () {
21-
SIZE=$1
22-
23-
readonly VALID_SIZES=("full" "large" "medium" "small")
24-
: "${SIZE:?Error: SIZE must be set to 'full', 'large', 'medium', or 'small'}"
25-
26-
if [[ ! " ${VALID_SIZES[*]} " =~ " ${SIZE} " ]]; then
27-
echo "Error: Unknown SIZE: $SIZE. Valid options are: ${VALID_SIZES[*]}"
28-
exit 1
29-
fi
30-
31-
case "$SIZE" in
32-
"full") LETTERS='A-Z' ;;
33-
"large") LETTERS='A-H' ;;
34-
"medium") LETTERS='I-I' ;;
35-
"small") LETTERS='Z-Z' ;;
36-
esac
37-
38-
echo ${LETTERS}
39-
}
40-
41-
42-
function getFilename() {
43-
local type=$1 letter=$2 date=$3
44-
echo "${type}_US_ALL_${letter}_${date}.gz"
45-
}
46-
4711
function get_CSVs () {
4812
local date=$1
4913
local letterarray=$2
@@ -108,9 +72,6 @@ function get_CSVs () {
10872
echo "NYSE TAQ CSV capture started."
10973
readonly start=$(date +%s)
11074

111-
LETTERS=$(get_letters $SIZE)
112-
LETTERARRAY=($(eval echo {${LETTERS:0:1}..${LETTERS:2:1}}))
113-
11475
for date in ${DATEARRAY[@]}; do
11576
get_CSVs $date $LETTERARRAY
11677
done

scripts/sort.sh

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
#!/usr/bin/env bash
2+
3+
set -euo pipefail
4+
5+
script_dir=$(dirname "${BASH_SOURCE[0]}")
6+
# shellcheck source=common.sh
7+
source "${script_dir}/common.sh"
8+
9+
function sort_by_time () {
10+
local date=$1
11+
local f
12+
13+
f="${CSVDIR}/$(getFilename "EQY" "TRADE" "${date}")"
14+
f="${f%.*}.psv"
15+
echo "Sorting trade file by time: ${f}"
16+
{ head -1 "$f"; tail -n +2 "$f" | sort -t'|' -k1,1; } > "${f}.tmp" && mv "${f}.tmp" "$f"
17+
18+
local merged="${CSVDIR}/EQY_US_ALL_BBO_${date}.psv"
19+
head -n 1 $(ls ${CSVDIR}/SPLITS_US_ALL_BBO_[$LETTERS]_${date}.psv | head -n 1) > "$merged"
20+
local tmpdir
21+
tmpdir=$(mktemp -d "${CSVDIR}/.tmp_XXXXXX")
22+
for f in ${CSVDIR}/SPLITS_US_ALL_BBO_[$LETTERS]_${date}.psv; do
23+
echo "Sorting quote file by time: ${f}"
24+
tail -n +2 "$f" | sort -t'|' -k1,1 > "${tmpdir}/$(basename "$f")"
25+
rm $f
26+
done
27+
28+
echo "Merging quote files into: ${merged}"
29+
sort -m -t'|' -k1,1 "${tmpdir}"/* >> "$merged"
30+
rm -rf "$tmpdir"
31+
}
32+
33+
echo "NYSE TAQ CSV resort by time started."
34+
readonly start=$(date +%s)
35+
36+
for date in "${DATEARRAY[@]}"; do
37+
sort_by_time "$date"
38+
done
39+
40+
readonly end=$(date +%s)
41+
readonly duration=$((end - start))
42+
echo "TAQ data resort by time completed in ${duration} seconds."

0 commit comments

Comments
 (0)