-
Notifications
You must be signed in to change notification settings - Fork 19
Text parsing overhaul, bug fixes, quality of life improvements. #33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
the value is signed so changed from Q to q
If the db is acted on to delete records it can lead to out of date "table_linked_pages", There is a separate pointer in the header for each table to a page usage map. These usage maps need to be parsed to identify all current related data/free space pages various tweaks required to ensure the page lists are availble to the relevant functions. It is these owned pages that must be parsed to ensure a full result is generated. also the check at the end of the get_overflow_record was stopping the real overflwo record ends being identified returning the rest of the page rather than just the end of the record. if updated to identify end correctly.
it's possible for null tables, records, and table headers to dissagree on column count. This can be down to columns being added to a table with existing data but where the records themselves are not updated. It can be down to a column being deleted from the table, in these cases there is a still a reference on the records that were present before the removal. etc. Needed to confirm confirm field count from record unpack instead of relying on header. Needed to tweak how has_value is set for null table column count mismatches.
for tables that are changed a lot, the variable column location doesnt always match the index location in he column_map. the parsed column metadata does include the variable_column_number to be used though.
Because fixed and variable columns are parsed in groups. the final output table matched the internal storage order rather than the presented order when viewed in access. minor change added to match expected field order.
because of historical delete references, the parse function would still run against an empty table, to avoid column ordering issues, call create_empty_table instead.
|
@JStooke I installed your branch and saw a new issue, which is promising! ERROR:access_parser:Could not find table MSysObjects in DataBaseAttributeError Traceback (most recent call last) 2 frames AttributeError: 'NoneType' object has no attribute 'parse' |
expanded usage map functionality to handle reference maps streamlined db header parsing outstanding issues with example db for LVAL type 2.
|
I attempted to install the change from https://github.com/JStooke/access_parser/tree/jsdev-branch , and now I get this error: I am testing on Google Colab. |
fine tuned table parser with relevant masks. fixed old reference to invalid class.
|
Hi @Btibert3, sorry for not responding sooner. i've just committed some significant changes that, as far as i can see enable the library to parse all the tables in your test access db. Some of the tables are pretty big so can take some time to export. but the exported contents all appears to match the db content. a little snippet of code below to give it a test: from access_parser import AccessParser
import pandas as pd
accessfile = 'IPEDS201617.accdb'
db = AccessParser(accessfile)
for table in list(db.catalog.keys()):
if any(exc in table for exc in ['MSys','f_']):
continue #ignore system tables
parsedTable = db.parse_table(table)
tableDf = pd.DataFrame.from_dict(parsedTable)
#output to csv.
dest = (accessfile + '_' + table + '_processed' + '.csv')
tableDf.to_csv(dest,sep=',',na_rep='',index=False)At some point it may be worth building a generator for the table parser, so it can stream rows to the calling function rather than having to wait for the entire table to complete before returning. |
|
Absolutely no apologies necessary, thank you for sticking with this. The file above was one of the few that gave me fits, and I can confirm that I now can parse that full database. Using Colab, it took ~ 19 minutes but everything ran end to end. Thank you for pinning down the issue! |
…rior to current row were deleted. Without that pointer the end offset would return 0 which would typicallly be less than start and a blank row object would be parsed rather than the actual data. with the extra bit, offsets like 53248 would return 4096 (page limit) instead. Which is what we're after.

Key Changes
✨ Feature
bytesobject instead of a file path🐛 Fixes
linked_pagesfrom headers was often inaccurate)🙏 Appreciation
Big thanks for this package! It's been incredibly useful, and I really appreciate the effort put into maintaining it. These changes have enabled me to utilise the package in various projects, and I hope they help!