Text parsing overhaul, bug fixes, quality of life improvements. #33

JStooke · 2025-02-07T14:02:26Z

Key Changes

✨ Feature

Enable passing of bytes object instead of a file path

🐛 Fixes

Text parsing now correctly handles Access' "unicode compression" (TYPE_TEXT - Unicode Decoding #21 / Character encoding problems in table names and field values #26)
Date parsing now supports dates before the Access epoch
Small numbers now have correct padding when converted to strings
All table-owned data is now parsed
- Improved logic for identifying correct pages for tables (as linked_pages from headers was often inaccurate)
- Corrected overflow record end check (should help Failed parsing overflow record offset #31)
Variable length fields are now correctly identified using column metadata rather than implied index
Column output order now matches the order when viewed in Access

🙏 Appreciation

Big thanks for this package! It's been incredibly useful, and I really appreciate the effort put into maintaining it. These changes have enabled me to utilise the package in various projects, and I hope they help!

…ased charsets

the value is signed so changed from Q to q

…ocation.

…pty string.

If the db is acted on to delete records it can lead to out of date "table_linked_pages", There is a separate pointer in the header for each table to a page usage map. These usage maps need to be parsed to identify all current related data/free space pages various tweaks required to ensure the page lists are availble to the relevant functions. It is these owned pages that must be parsed to ensure a full result is generated. also the check at the end of the get_overflow_record was stopping the real overflwo record ends being identified returning the rest of the page rather than just the end of the record. if updated to identify end correctly.

it's possible for null tables, records, and table headers to dissagree on column count. This can be down to columns being added to a table with existing data but where the records themselves are not updated. It can be down to a column being deleted from the table, in these cases there is a still a reference on the records that were present before the removal. etc. Needed to confirm confirm field count from record unpack instead of relying on header. Needed to tweak how has_value is set for null table column count mismatches.

for tables that are changed a lot, the variable column location doesnt always match the index location in he column_map. the parsed column metadata does include the variable_column_number to be used though.

Because fixed and variable columns are parsed in groups. the final output table matched the internal storage order rather than the presented order when viewed in access. minor change added to match expected field order.

because of historical delete references, the parse function would still run against an empty table, to avoid column ordering issues, call create_empty_table instead.

Btibert3 · 2025-05-08T01:43:00Z

@JStooke I installed your branch and saw a new issue, which is promising!

ERROR:access_parser:Could not find table MSysObjects in DataBase

AttributeError Traceback (most recent call last)
in <cell line: 0>()
----> 1 db = AccessParser("IPEDS201617.accdb")

2 frames
/usr/local/lib/python3.11/dist-packages/access_parser/access_parser.py in parse_table(self, table_name)
189 :return defaultdict(list) with the parsed table -- table[column][row_index]
190 """
--> 191 return self.get_table(table_name).parse()
192
193 def print_database(self):

AttributeError: 'NoneType' object has no attribute 'parse'

expanded usage map functionality to handle reference maps streamlined db header parsing outstanding issues with example db for LVAL type 2.

Btibert3 · 2025-05-13T14:53:57Z

I attempted to install the change from https://github.com/JStooke/access_parser/tree/jsdev-branch , and now I get this error:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
[<ipython-input-2-e0dbd7fa4a09>](https://localhost:8080/#) in <cell line: 0>()
----> 1 from access_parser import AccessParser

1 frames
[/usr/local/lib/python3.11/dist-packages/access_parser/access_parser.py](https://localhost:8080/#) in <module>
      8 from .parsing_primitives import parse_relative_object_metadata_struct, parse_table_head, parse_data_page_header, \
      9     ACCESSHEADER, MEMO, parse_table_data, TDEF_HEADER, LVPROP, parse_buffer_custom
---> 10 from .utils import categorize_pages, parse_type, TYPE_MEMO, TYPE_TEXT, TYPE_BOOLEAN, read_db_file, numeric_to_string, \
     11     TYPE_96_BIT_17_BYTES, TYPE_OLE, SimpleVarLenMetadata
     12 from .jetformat import BaseFormat, Jet3Format, PageTypes

ImportError: cannot import name 'SimpleVarLenMetadata' from 'access_parser.utils' (/usr/local/lib/python3.11/dist-packages/access_parser/utils.py)

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

I am testing on Google Colab.

fine tuned table parser with relevant masks. fixed old reference to invalid class.

JStooke · 2025-05-16T14:28:38Z

Hi @Btibert3, sorry for not responding sooner. i've just committed some significant changes that, as far as i can see enable the library to parse all the tables in your test access db. Some of the tables are pretty big so can take some time to export. but the exported contents all appears to match the db content.

a little snippet of code below to give it a test:

from access_parser import AccessParser
import pandas as pd

accessfile = 'IPEDS201617.accdb'

db = AccessParser(accessfile)

for table in list(db.catalog.keys()):
    if any(exc in table for exc in ['MSys','f_']):
        continue #ignore system tables
    parsedTable = db.parse_table(table)
    tableDf = pd.DataFrame.from_dict(parsedTable)

    #output to csv.
    dest = (accessfile + '_' + table + '_processed' + '.csv')
    tableDf.to_csv(dest,sep=',',na_rep='',index=False)

At some point it may be worth building a generator for the table parser, so it can stream rows to the calling function rather than having to wait for the entire table to complete before returning.

Btibert3 · 2025-05-16T15:08:36Z

Absolutely no apologies necessary, thank you for sticking with this. The file above was one of the few that gave me fits, and I can confirm that I now can parse that full database. Using Colab, it took ~ 19 minutes but everything ran end to end.

Thank you for pinning down the issue!

…rior to current row were deleted. Without that pointer the end offset would return 0 which would typicallly be less than start and a blank row object would be parsed rather than the actual data. with the extra bit, offsets like 53248 would return 4096 (page limit) instead. Which is what we're after.

…ing.

JStooke added 12 commits February 7, 2025 10:41

shifted to modern toml approach

3b3024e

overhauled text decoding to handle unicode compression, and version b…

a491c73

…ased charsets

Fixed handling of dates before access_epoch

1c93e09

the value is signed so changed from Q to q

fixed small number to text padding.

a09a741

minor formatting standardisation

fb1326c

add ability to pass db object as bytes instead of reading from file l…

7ffc86b

…ocation.

the functions that could call this, expect an array rather than an em…

e319e64

…pty string.

variable column location fix

cc5b3d6

for tables that are changed a lot, the variable column location doesnt always match the index location in he column_map. the parsed column metadata does include the variable_column_number to be used though.

column output order fix.

9eb3885

Because fixed and variable columns are parsed in groups. the final output table matched the internal storage order rather than the presented order when viewed in access. minor change added to match expected field order.

parsing empty tables following delete

fc151fa

because of historical delete references, the parse function would still run against an empty table, to avoid column ordering issues, call create_empty_table instead.

Introduction of global version constants mapping.

5f07463

expanded usage map functionality to handle reference maps streamlined db header parsing outstanding issues with example db for LVAL type 2.

overhauled row parsing logic to be version aware.

a8b1a00

fine tuned table parser with relevant masks. fixed old reference to invalid class.

JStooke added 3 commits May 23, 2025 13:54

numeric scale could be in a different property. Fix applied.

1c3760a

switched to dict mapping for pages for better visibility while debugg…

2199560

…ing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Text parsing overhaul, bug fixes, quality of life improvements. #33

Text parsing overhaul, bug fixes, quality of life improvements. #33

Uh oh!

JStooke commented Feb 7, 2025

Uh oh!

Btibert3 commented May 8, 2025 •

edited

Loading

Uh oh!

Btibert3 commented May 13, 2025 •

edited

Loading

Uh oh!

JStooke commented May 16, 2025 •

edited

Loading

Uh oh!

Btibert3 commented May 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Text parsing overhaul, bug fixes, quality of life improvements. #33

Are you sure you want to change the base?

Text parsing overhaul, bug fixes, quality of life improvements. #33

Uh oh!

Conversation

JStooke commented Feb 7, 2025

Key Changes

✨ Feature

🐛 Fixes

🙏 Appreciation

Uh oh!

Btibert3 commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ERROR:access_parser:Could not find table MSysObjects in DataBase

Uh oh!

Btibert3 commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JStooke commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Btibert3 commented May 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Btibert3 commented May 8, 2025 •

edited

Loading

Btibert3 commented May 13, 2025 •

edited

Loading

JStooke commented May 16, 2025 •

edited

Loading