You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/reference.md
+62-15Lines changed: 62 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,38 +1,52 @@
1
1
# NYSE TAQ Data Loader module
2
2
3
-
This module provides high-performance utilities for parsing [NYSE TAQ (Trade and Quote) PSV files](https://ftp.nyse.com/Historical%20Data%20Samples/DAILY%20TAQ/), performing data transformations, and loading the results either into a date-partitioned kdb+ database (HDB) or directly into in-memory tables (RDB).
3
+
This module provides high-performance utilities for parsing [NYSE TAQ (Trade and Quote) PSV files](https://ftp.nyse.com/Historical%20Data%20Samples/DAILY%20TAQ/) and performing data transformations. Results can be:
4
+
5
+
- persisted into a date-partitioned kdb+ database (HDB)
6
+
- loaded into in-memory tables (RDB)
7
+
- replayed into a [tickerplant](https://code.kx.com/q/architecture/tickq/) (TP)
4
8
5
9
## Prerequisites: Input PSV Files
6
10
7
11
The module requires NYSE TAQ master, trade, and quote PSV files. These can be downloaded from the [NYSE FTP server](https://ftp.nyse.com/Historical%20Data%20Samples/DAILY%20TAQ/) and extracted into a local directory.
8
12
9
13
> [!WARNING]
10
-
> The NYSE TAQ files are large. Depending on your network bandwidth, downloading them may take a long time and may require tens of gigabytes of disk space. Consider passing `SIZE=small` to `getCSVs.sh`.
14
+
> The NYSE TAQ files are large. Depending on your network bandwidth, downloading them may take a long time and may require tens of gigabytes of disk space. Consider passing `--size small` to `getCSVs.sh`.
11
15
12
-
A utility script, `getCSVs.sh` (located in directory`scripts`), is provided to automate the download and extraction process via `curl`. To download and unzip all available TAQ files to `/tmp/nysetaqpsv`:
16
+
A utility script, `getCSVs.sh` (located in the`scripts` directory), is provided to automate the download and extraction process via `curl`. To download and unzip all available TAQ files to `/tmp/nysetaqpsv`:
./scripts/getCSVs.sh --csvdir /tmp/nysetaqpsv --dates "$DATES" --size small
33
+
```
34
+
35
+
### For replay: sorting by time
36
+
37
+
To replay NYSE TAQ data to TP, input PSV files must be sorted by time. `./scripts/sort.sh` sorts trade and individual quote files using the [sort](https://man7.org/linux/man-pages/man1/sort.1.html) command, then merges the quote files into a single large file in linear time via `sort -m`.
38
+
39
+
> **Note:**`sort` requires temporary disk space during sorting. If the default `/tmp` directory has insufficient space, set the `TMPDIR` environment variable to a directory with more space before running the script.
40
+
41
+
```bash
42
+
./scripts/sort.sh --csvdir /tmp/nysetaqpsv --dates "$DATES" --size small
29
43
```
30
44
31
45
### Dataset Statistics (Reference: 2025.07.01)
32
46
33
47
The following table estimates the data footprint by `SIZE` parameter:
34
48
35
-
|`SIZE`| Symbol Range (First Letter) | Uncompressed PSVs Size | Uncompressed HDB Size | Symbol Nr| Quote Nr|
49
+
|`SIZE`| Symbol Range (First Letter) | Uncompressed PSVs Size | Uncompressed HDB Size | Symbol Count| Quote Count|
@@ -41,20 +55,21 @@ The following table estimates the data footprint by `SIZE` parameter:
41
55
42
56
## Quickstart
43
57
44
-
This module exposes two primary functions:
58
+
This module exposes three primary functions:
45
59
46
60
-**`parseToDisk`** — parses TAQ data and persists it into a date-partitioned HDB on disk.
47
61
-**`parseToMemory`** — parses TAQ data and loads it directly into in-memory tables, suitable for RDB-style workflows.
62
+
-**`parseToTP`** — parses TAQ data and sends batches asynchronously to a TP via `.u.upd`, as if data arrived from the exchange. Input PSV files must be [sorted by time](#for-replay-sorting-by-time).
48
63
49
64
```q
50
-
([parseToMemory; parseToDisk]): use `kx.taq
65
+
([parseToMemory; parseToDisk; parseToTP]): use `kx.taq
51
66
```
52
67
53
68
### parseToDisk
54
69
55
70
`parseToDisk` requires at least three parameters.
56
71
57
-
To create the `trade`, `quote`, `master` tables and `exnames` dictionary for October 2, 2025, and save them to `/tmp/kdbdb`:
72
+
To create the `trade`, `quote`, `master` tables, and `exnames` dictionary for October 2, 2025, and save them to `/tmp/kdbdb`:
@@ -122,17 +137,42 @@ Storing a full day of NYSE TAQ data in memory is RAM-intensive. The table below
122
137
|`large`| A-H |~79 GB |
123
138
|`full`| A-Z |~170 GB |
124
139
140
+
### parseToTP
141
+
142
+
Input PSV files must be [sorted by time](#for-replay-sorting-by-time) for proper replay of the data. `parseToTP` requires the tickerplant address as a third parameter:
The third parameter is passed directly to [hopen](https://code.kx.com/kdb-x/ref/hopen.html), so you can also pass a port number if the TP is on the same box:
`parseToTP` calls `.u.upd` on the remote q process with table name as the first parameter and the records (as a table) as the second parameter. The simplest way to test this function is to start a q process on the provided port (5010) and define `.u.upd` as `upsert`:
155
+
156
+
```bash
157
+
$ q -p 5010
158
+
...
159
+
q).u.upd: upsert
160
+
```
161
+
162
+
`parseToTP` publishes quotes after trades. If you prefer simultaneous publication, start two kdb+ processes and use the `tbls` optional parameter to control which tables each process publishes.
163
+
125
164
## Configuration Options
126
165
127
-
Both `parseToDisk` and `parseToMemory` accept an optional dictionary as their last argument to customize the ingestion process.
166
+
All three functions accept an optional dictionary as their last argument to customize the ingestion process.
128
167
129
168
### Common parameters
130
169
131
170
| Key | Default | Description |
132
171
| --- | ---: | --- |
133
172
|`letters`|`"A-Z"`| Restricts ingestion to symbols whose first letter falls within the specified range (e.g., `"K-N"`). |
134
-
|`includetestsymbols`|`0b`| If `1b`, includes instruments flagged as test symbols in the `master` PSV. |
135
173
|`batchsize`|`10 000 000`| Number of rows processed per chunk. Set to `0` to read the entire file in one pass for faster throughput if RAM permits. |
174
+
|`tbls`|``` `master`trade`quote ```| Tables to process. |
175
+
|`includetestsymbols`|`0b`| If `1b`, includes instruments flagged as test symbols in the `master` PSV. |
136
176
|`logger`| logger created by `.logger.createLog[]` of the [KX log module](https://code.kx.com/kdb-x/modules/logging/overview.html)| Logger used for status updates during the ingestion process. |
137
177
138
178
### parseToDisk extra parameters
@@ -154,11 +194,18 @@ Both `parseToDisk` and `parseToMemory` accept an optional dictionary as their la
154
194
Sorting by `sym` and having a parted attribute on `sym`:
The original data is sorted by time within each symbol, so sorting by `sym` actually means sorting by `sym` and `time`.
161
201
202
+
### parseToTP extra parameters
203
+
204
+
| Key | Default | Description |
205
+
| --- | ---: | --- |
206
+
|`starttime`|`0D04:00`| Records before `starttime` are ignored. |
207
+
208
+
162
209
## Performance Notes
163
210
164
211
***Multithreading**: The PSV parsing engine is multithreaded. Start your ingestion process with the `-s` flag (e.g., `q -s 8`) to make use of available CPU cores.
0 commit comments