Skip to content

Conversation

@JStooke
Copy link

@JStooke JStooke commented Feb 7, 2025


Key Changes

✨ Feature

  • Enable passing of bytes object instead of a file path

🐛 Fixes


🙏 Appreciation

Big thanks for this package! It's been incredibly useful, and I really appreciate the effort put into maintaining it. These changes have enabled me to utilise the package in various projects, and I hope they help!


the value is signed so changed from Q to q
If the db is acted on to delete records it can lead to out of date "table_linked_pages", There is a separate pointer in the header for each table to a page usage map. These usage maps need to be parsed to identify all current related data/free space pages

various tweaks required to ensure the page lists are availble to the relevant functions.

It is these owned pages that must be parsed to ensure a full result is generated.

also the check at the end of the get_overflow_record was stopping the real overflwo record ends being identified returning the rest of the page rather than just the end of the record. if updated to identify end correctly.
it's possible for null tables, records, and table headers to dissagree on column count.
This can be down to columns being added to a table with existing data but where the records themselves are not updated.
It can be down to a column being deleted from the table, in these cases there is a still a reference on the records that were present before the removal.
etc.

Needed to confirm confirm field count from record unpack instead of relying on header.

Needed to tweak how has_value is set for null table column count mismatches.
for tables that are changed a lot, the variable column location doesnt always match the index location in he column_map. the parsed column metadata does include the variable_column_number to be used though.
Because fixed and variable columns are parsed in groups. the final output table matched the internal storage order rather than the presented order when viewed in access. minor change added to match expected field order.
because of historical delete references, the parse function would still run against an empty table, to avoid column ordering issues, call create_empty_table instead.
@Btibert3
Copy link

Btibert3 commented May 8, 2025

@JStooke I installed your branch and saw a new issue, which is promising!

ERROR:access_parser:Could not find table MSysObjects in DataBase

AttributeError Traceback (most recent call last)
in <cell line: 0>()
----> 1 db = AccessParser("IPEDS201617.accdb")

2 frames
/usr/local/lib/python3.11/dist-packages/access_parser/access_parser.py in parse_table(self, table_name)
189 :return defaultdict(list) with the parsed table -- table[column][row_index]
190 """
--> 191 return self.get_table(table_name).parse()
192
193 def print_database(self):

AttributeError: 'NoneType' object has no attribute 'parse'

expanded usage map functionality to handle reference maps

streamlined db header parsing

outstanding issues with example db for LVAL type 2.
@Btibert3
Copy link

Btibert3 commented May 13, 2025

I attempted to install the change from https://github.com/JStooke/access_parser/tree/jsdev-branch , and now I get this error:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
[<ipython-input-2-e0dbd7fa4a09>](https://localhost:8080/#) in <cell line: 0>()
----> 1 from access_parser import AccessParser

1 frames
[/usr/local/lib/python3.11/dist-packages/access_parser/access_parser.py](https://localhost:8080/#) in <module>
      8 from .parsing_primitives import parse_relative_object_metadata_struct, parse_table_head, parse_data_page_header, \
      9     ACCESSHEADER, MEMO, parse_table_data, TDEF_HEADER, LVPROP, parse_buffer_custom
---> 10 from .utils import categorize_pages, parse_type, TYPE_MEMO, TYPE_TEXT, TYPE_BOOLEAN, read_db_file, numeric_to_string, \
     11     TYPE_96_BIT_17_BYTES, TYPE_OLE, SimpleVarLenMetadata
     12 from .jetformat import BaseFormat, Jet3Format, PageTypes

ImportError: cannot import name 'SimpleVarLenMetadata' from 'access_parser.utils' (/usr/local/lib/python3.11/dist-packages/access_parser/utils.py)

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

I am testing on Google Colab.

image

fine tuned table parser with relevant masks.

fixed old reference to invalid class.
@JStooke
Copy link
Author

JStooke commented May 16, 2025

Hi @Btibert3, sorry for not responding sooner. i've just committed some significant changes that, as far as i can see enable the library to parse all the tables in your test access db. Some of the tables are pretty big so can take some time to export. but the exported contents all appears to match the db content.

a little snippet of code below to give it a test:

from access_parser import AccessParser
import pandas as pd

accessfile = 'IPEDS201617.accdb'

db = AccessParser(accessfile)

for table in list(db.catalog.keys()):
    if any(exc in table for exc in ['MSys','f_']):
        continue #ignore system tables
    parsedTable = db.parse_table(table)
    tableDf = pd.DataFrame.from_dict(parsedTable)

    #output to csv.
    dest = (accessfile + '_' + table + '_processed' + '.csv')
    tableDf.to_csv(dest,sep=',',na_rep='',index=False)

At some point it may be worth building a generator for the table parser, so it can stream rows to the calling function rather than having to wait for the entire table to complete before returning.

@Btibert3
Copy link

Absolutely no apologies necessary, thank you for sticking with this. The file above was one of the few that gave me fits, and I can confirm that I now can parse that full database. Using Colab, it took ~ 19 minutes but everything ran end to end.

Thank you for pinning down the issue!

JStooke added 3 commits May 23, 2025 13:54
…rior to current row were deleted.

Without that pointer the end offset would return 0 which would typicallly be less than start and a blank row object would be parsed rather than the actual data. with the extra bit, offsets like 53248 would return 4096 (page limit) instead. Which is what we're after.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants