Skip to content

Conversation

@enjyashraf18
Copy link
Collaborator

@enjyashraf18 enjyashraf18 commented Jun 17, 2025

To address issue #147, a structured HDF5 file has been created to store periodic data.

This part introduces a migration script that reads atomic data from two CSV files, and stores the information in an organized HDF5 schema. The updates include:

  • Table schema definitions for element data.
  • Clear documentation for the migration process.
  • An updated scalar method that now reads periodic data directly from the HDF5 file.

@gabrielasd
Copy link
Collaborator

Hi @enjyashraf18 looks good to me.
Let's just make a final adjustment to the format of the docstrings after which you can move forward with refactoring the periodic module.

For the docstrings, please use Numpy's style, see for example this docstring in the promolecule module

r"""
Compute the electron density of the promolecule at the desired points.
Parameters
----------
points: np.ndarray((N, 3), dtype=float)
Points at which to compute the density.
spin: ('t' | 'a' | 'b' | 'm'), default='t'
Type of density to compute; either total, alpha-spin, beta-spin,
or magnetization density.
log: bool, default=False
Whether to compute the log of the density instead of the density.
May be slightly more accurate.
Returns
-------
density: np.ndarray((N,), dtype=float)
Density evaluated at N points.
"""

It should be possible to automate this formatting in your IDE, for example these are instructions for how to do it in VS Code

@msricher
Copy link
Collaborator

@enjyashraf18 Here is my small proof-of-concept + notes on the data tables:

# import atexit
import os
import tempfile
import uuid

import tables


# This lives in memory and contains external links to datasets
class MemoryDB:
    _tempdir = tempfile.mkdtemp()

    def __init__(self):
        path = f"{os.path.join(MemoryDB._tempdir, uuid.uuid4().hex)}.h5"
        self.h5f = tables.open_file(path, "a", driver="H5FD_CORE", driver_core_backing_store=0)
        # atexit.register(self.h5f.close)

    def __del__(self):
        self.h5f.close()


# Species will reference this:
db = MemoryDB()


# Attaching datasets while handling version control:
def attach_dataset(dataset, version=-1):
    # If dataset not present, download it...
    if version == -1:
        # find latest version...
        version = 0
    db.h5f.create_external_link("/", dataset, f"dataset_v{version:02d}.h5:/")


attach_dataset("test", version=1)


# Database structure
# ##################
# /(root)
# ----/dataset1
# --------/001_001_001 (atomic no., charge, mult) (*)
# ------------properties...
# --------more species...
# ----more datasets...
# ...
# ----/elements (†)
# --------properties....


# Species are a thin wrapper around the table denoted by (*) above and the element data (†).
# I think we should leave the property accessors as dict lookups (i.e. species['nelec'] vs
# species.nelec) in order to communicate that they are PyTables data tables, and so the PyTables
# querying functions can be used transparently without needing to write bindings,
# and so an appropriate KeyError is raised when data is not present (some species data is
# incomplete).
#
# I will make a class for this once the pieces have been coded.

import csv
import tables as pt
import numpy as np
from importlib_resources import \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct import statement in L4, should be a single line statement

@gabrielasd
Copy link
Collaborator

Hi @enjyashraf18
I looked at the issue with the broken PR checks. Like we discussed part of the problem is the missing PyTables dependency. To (partially) fix it add the following line to the dependencies list in the pyproject.toml file:
"tables>=3.9.2"
For python 3.10 up this will fix the installation issues for the CI checks.

After this you'll still need to check/correct the broken tests for the different datasets.

@gabrielasd gabrielasd requested a review from msricher June 26, 2025 18:25
Copy link
Collaborator

@msricher msricher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good for now!

@gabrielasd gabrielasd requested review from FarnazH and removed request for FarnazH June 27, 2025 14:23
return None

# open the HDF5 file in read mode
with pt.open_file(elements_hdf5_file, mode="r") as h5file:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the global database instead of opening/closeing the file each time. For future, when the global database is implemented.

if table.nrows == 1:
value = table[0]["value"]
# if the value is an int, return it as an int
if isinstance(value_col, pt.Int32Col):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@enjyashraf18 check the type of the value variable instead of the value_col one, and try using Integral data type from python numbers library. for example like is done here

if not isinstance(spinpol, Integral):

@FarnazH FarnazH self-requested a review July 25, 2025 14:32
@msricher msricher merged commit 9cd8ec4 into theochem:dev-gsoc Aug 25, 2025
0 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants