-
Notifications
You must be signed in to change notification settings - Fork 18
Refactor AtomDB: Migrate Elements Data from CSV to HDF5 #153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @enjyashraf18 looks good to me. For the docstrings, please use Numpy's style, see for example this docstring in the promolecule module Lines 104 to 123 in 464f116
It should be possible to automate this formatting in your IDE, for example these are instructions for how to do it in VS Code |
2d40a52 to
b7bb79e
Compare
|
@enjyashraf18 Here is my small proof-of-concept + notes on the data tables: # import atexit
import os
import tempfile
import uuid
import tables
# This lives in memory and contains external links to datasets
class MemoryDB:
_tempdir = tempfile.mkdtemp()
def __init__(self):
path = f"{os.path.join(MemoryDB._tempdir, uuid.uuid4().hex)}.h5"
self.h5f = tables.open_file(path, "a", driver="H5FD_CORE", driver_core_backing_store=0)
# atexit.register(self.h5f.close)
def __del__(self):
self.h5f.close()
# Species will reference this:
db = MemoryDB()
# Attaching datasets while handling version control:
def attach_dataset(dataset, version=-1):
# If dataset not present, download it...
if version == -1:
# find latest version...
version = 0
db.h5f.create_external_link("/", dataset, f"dataset_v{version:02d}.h5:/")
attach_dataset("test", version=1)
# Database structure
# ##################
# /(root)
# ----/dataset1
# --------/001_001_001 (atomic no., charge, mult) (*)
# ------------properties...
# --------more species...
# ----more datasets...
# ...
# ----/elements (†)
# --------properties....
# Species are a thin wrapper around the table denoted by (*) above and the element data (†).
# I think we should leave the property accessors as dict lookups (i.e. species['nelec'] vs
# species.nelec) in order to communicate that they are PyTables data tables, and so the PyTables
# querying functions can be used transparently without needing to write bindings,
# and so an appropriate KeyError is raised when data is not present (some species data is
# incomplete).
#
# I will make a class for this once the pieces have been coded. |
| import csv | ||
| import tables as pt | ||
| import numpy as np | ||
| from importlib_resources import \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct import statement in L4, should be a single line statement
|
Hi @enjyashraf18 After this you'll still need to check/correct the broken tests for the different datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good for now!
| return None | ||
|
|
||
| # open the HDF5 file in read mode | ||
| with pt.open_file(elements_hdf5_file, mode="r") as h5file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the global database instead of opening/closeing the file each time. For future, when the global database is implemented.
atomdb/species.py
Outdated
| if table.nrows == 1: | ||
| value = table[0]["value"] | ||
| # if the value is an int, return it as an int | ||
| if isinstance(value_col, pt.Int32Col): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@enjyashraf18 check the type of the value variable instead of the value_col one, and try using Integral data type from python numbers library. for example like is done here
Line 417 in 464f116
| if not isinstance(spinpol, Integral): |
0d78b49 to
22285b2
Compare
To address issue #147, a structured HDF5 file has been created to store periodic data.
This part introduces a migration script that reads atomic data from two CSV files, and stores the information in an organized HDF5 schema. The updates include:
scalarmethod that now reads periodic data directly from the HDF5 file.