This module provides high-performance utilities for parsing NYSE TAQ (Trade and Quote) PSV files and performing data transformations. Results can be:
- persisted into a date-partitioned kdb+ database (HDB)
- loaded into in-memory tables (RDB)
- replayed into a tickerplant (TP)
The module requires NYSE TAQ master, trade, and quote PSV files. These can be downloaded from the NYSE FTP server and extracted into a local directory.
Warning
The NYSE TAQ files are large. Depending on your network bandwidth, downloading them may take a long time and may require tens of gigabytes of disk space. Consider passing --size small to getCSVs.sh.
A utility script, getCSVs.sh (located in the scripts directory), is provided to automate the download and extraction process via curl. To download and unzip all available TAQ files to /tmp/nysetaqpsv:
# Extract available dates from the NYSE FTP
DATES=$(curl -s "https://ftp.nyse.com/Historical%20Data%20Samples/DAILY%20TAQ/" | grep -oE '"EQY_US_ALL_TRADE_[0-9]{8}\.gz"' | grep -oE '[0-9]{8}' |paste -sd,)
./scripts/getCSVs.sh --csvdir /tmp/nysetaqpsv --dates "$DATES"To manage disk space and bandwidth, you can restrict the download scope by:
- Limiting Dates: Target a specific date (e.g., the most recent available).
- Using the
--sizeflag: Filter by symbol ranges using the--sizeflag (or-s).
# Example: Download PSVs for only a single date
DATES=$(curl -s https://ftp.nyse.com/Historical%20Data%20Samples/DAILY%20TAQ/| grep -oE 'EQY_US_ALL_TRADE_2[0-9]{7}' | grep -oE '2[0-9]{7}'|head -1)
./scripts/getCSVs.sh --csvdir /tmp/nysetaqpsv --dates "$DATES" --size smallTo replay NYSE TAQ data to TP, input PSV files must be sorted by time. ./scripts/sort.sh sorts trade and individual quote files using the sort command, then merges the quote files into a single large file in linear time via sort -m.
Note:
sortrequires temporary disk space during sorting. If the default/tmpdirectory has insufficient space, set theTMPDIRenvironment variable to a directory with more space before running the script.
./scripts/sort.sh --csvdir /tmp/nysetaqpsv --dates "$DATES" --size smallThe following table estimates the data footprint by SIZE parameter:
SIZE |
Symbol Range (First Letter) | Uncompressed PSVs Size | Uncompressed HDB Size | Symbol Count | Quote Count |
|---|---|---|---|---|---|
small |
Z | ~10 GB | ~0.3 GB | 246 | 4,041,795 |
medium |
I | ~20 GB | ~8.1 GB | 1,313 | 125,442,373 |
large |
A-H | ~51 GB | ~36 GB | 10,693 | 516,394,615 |
full |
A-Z | ~133 GB | ~106 GB | 24,377 | 1,570,602,937 |
This module exposes three primary functions:
parseToDisk— parses TAQ data and persists it into a date-partitioned HDB on disk.parseToMemory— parses TAQ data and loads it directly into in-memory tables, suitable for RDB-style workflows.parseToTP— parses TAQ data and sends batches asynchronously to a TP via.u.upd, as if data arrived from the exchange. Input PSV files must be sorted by time.
([parseToMemory; parseToDisk; parseToTP]): use `kx.taqparseToDisk requires at least three parameters.
To create the trade, quote, master tables, and exnames dictionary for October 2, 2025, and save them to /tmp/kdbdb:
parseToDisk["/tmp/nysetaqpsv"; 2025.10.02; "/tmp/kdbdb"]Once the data is generated, you can load it into a q session with 4 worker threads using the following command (or by \l in a running q session):
$ q /tmp/kdbdb -s 4You can then execute standard q queries against the partitioned data:
/ Calculate total size by exchange names for the most recent date
q)asc select sum size by exch: exnames ex from trade where date=last date
exch | size
----------------------------------| ----------
New York Stock Exchange | 738731806
Long-Term Stock Exchange | 1364829
NASDAQ OMX PSX | 23555592
...
/ Perform an as-of join (aj) between trades and quotes
q)aj[`sym`time; select sym, time, price, size from trade where date=first date, sym in `MSFT`GOOG`AMZN; select sym, time, bid, ask from quote where date=first date]
sym time price size bid ask
---------------------------------------------------
AMZN 0D04:00:00.009709706 219.5 3 219.41 219.98
AMZN 0D04:00:00.010213563 219.7 3 219.41 219.98
AMZN 0D04:00:00.010379075 219.7 2 219.41 219.98
AMZN 0D04:00:00.010640417 219.98 100 219.41 219.98
...parseToMemory requires at least two parameters — no destination path is needed as data is loaded directly into memory.
To load the trade, quote, master tables, and exnames dictionary for October 2, 2025, into memory:
(trade; quote; master; exnames): parseToMemory["/tmp/nysetaqpsv"; 2025.10.02]The resulting trade and quote tables are sorted by time and carry a grouped attribute on sym, matching the layout of a typical RDB. Unlike the HDB tables produced by parseToDisk, in-memory tables do not include a date column.
/ Perform an as-of join (aj) between the in-memory trades and quotes
q)aj[`sym`time; select sym, time, price, size from trade where sym in `MSFT`GOOG`AMZN; select sym, time, bid, ask from quote]
sym time price size bid ask
---------------------------------------------------
AMZN 0D04:00:00.009709706 219.5 3 219.41 219.98
AMZN 0D04:00:00.010213563 219.7 3 219.41 219.98
AMZN 0D04:00:00.010379075 219.7 2 219.41 219.98
AMZN 0D04:00:00.010640417 219.98 100 219.41 219.98
...Storing a full day of NYSE TAQ data in memory is RAM-intensive. The table below shows approximate memory requirements by SIZE parameter, measured against data from 2025.10.02.
SIZE |
Symbol Range (First Letter) | Memory need |
|---|---|---|
small |
Z | ~2 GB |
medium |
I | ~14 GB |
large |
A-H | ~79 GB |
full |
A-Z | ~170 GB |
Input PSV files must be sorted by time for proper replay of the data. parseToTP requires the tickerplant address as a third parameter:
parseToTP["/tmp/nysetaqpsv"; 2025.10.02; `:localhost:5010; ([batchsize: 5000])]The third parameter is passed directly to hopen, so you can also pass a port number if the TP is on the same box:
parseToTP["/tmp/nysetaqpsv"; 2025.10.02; 5010; ([batchsize: 5000])]parseToTP calls .u.upd on the remote q process with table name as the first parameter and the records (as a table) as the second parameter. The simplest way to test this function is to start a q process on the provided port (5010) and define .u.upd as upsert:
$ q -p 5010
...
q).u.upd: upsertparseToTP publishes quotes after trades. If you prefer simultaneous publication, start two kdb+ processes and use the tbls optional parameter to control which tables each process publishes.
All three functions accept an optional dictionary as their last argument to customize the ingestion process.
| Key | Default | Description |
|---|---|---|
letters |
"A-Z" |
Restricts ingestion to symbols whose first letter falls within the specified range (e.g., "K-N"). |
batchsize |
10 000 000 |
Number of rows processed per chunk. Set to 0 to read the entire file in one pass for faster throughput if RAM permits. |
tbls |
`master`trade`quote |
Tables to process. |
includetestsymbols |
0b |
If 1b, includes instruments flagged as test symbols in the master PSV. |
logger |
logger created by .logger.createLog[] of the KX log module |
Logger used for status updates during the ingestion process. |
| Key | Default | Description |
|---|---|---|
compparam |
([master: 0 0 0; trade: 0 0 0; quote: 0 0 0]), i.e. no compression |
Table-specific compression settings for .z.zd. Example: ([master: 0 0 0; trade: 17 2 6; quote: 17 2 6]). Pass a dictionary of dictionaries to specify column-level compression. |
linked |
0b |
Set 1b to add a linked column master to the trade and quote tables, linking via sym to the master table. |
| Key | Default | Description | Options |
|---|---|---|---|
symattr |
`g |
Attribute for the sym column of trade and quote tables. |
either of `g`p` |
sortcols |
`time |
Sort column or a list of sort columns for the trade and quote tables. |
Any subset of `time`ex`sym`cond`corr`seq`source`participantTimestamp |
Sorting by sym and having a parted attribute on sym:
(trade; quote; master; exnames): parseToMemory["/tmp/nysetaqpsv"; 2025.10.02; ([letters: "Y-Z"; sortcols: `sym; symattr: `p])]The original data is sorted by time within each symbol, so sorting by sym actually means sorting by sym and time.
| Key | Default | Description |
|---|---|---|
starttime |
0D04:00 |
Records before starttime are ignored. |
- Multithreading: The PSV parsing engine is multithreaded. Start your ingestion process with the
-sflag (e.g.,q -s 8) to make use of available CPU cores. - Memory Management: If you encounter memory pressure, reduce
batchsizein the options dictionary. Conversely, increasing it (or setting it to0) will speed up the process.