Refactor AtomDB: Migrate Elements Data from CSV to HDF5 #153

enjyashraf18 · 2025-06-17T19:36:42Z

To address issue #147, a structured HDF5 file has been created to store periodic data.

This part introduces a migration script that reads atomic data from two CSV files, and stores the information in an organized HDF5 schema. The updates include:

Table schema definitions for element data.
Clear documentation for the migration process.
An updated scalar method that now reads periodic data directly from the HDF5 file.

gabrielasd · 2025-06-18T05:50:11Z

Hi @enjyashraf18 looks good to me.
Let's just make a final adjustment to the format of the docstrings after which you can move forward with refactoring the periodic module.

For the docstrings, please use Numpy's style, see for example this docstring in the promolecule module

AtomDB/atomdb/promolecule.py

Lines 104 to 123 in 464f116

    
                   r""" 
        
                   Compute the electron density of the promolecule at the desired points. 
        
                   Parameters 
        
                   ---------- 
        
                   points: np.ndarray((N, 3), dtype=float) 
        
                       Points at which to compute the density. 
        
                   spin: ('t' | 'a' | 'b' | 'm'), default='t' 
        
                       Type of density to compute; either total, alpha-spin, beta-spin, 
        
                       or magnetization density. 
        
                   log: bool, default=False 
        
                       Whether to compute the log of the density instead of the density. 
        
                       May be slightly more accurate. 
        
                   Returns 
        
                   ------- 
        
                   density: np.ndarray((N,), dtype=float) 
        
                       Density evaluated at N points. 
        
                   """

It should be possible to automate this formatting in your IDE, for example these are instructions for how to do it in VS Code

msricher · 2025-06-23T13:05:41Z

@enjyashraf18 Here is my small proof-of-concept + notes on the data tables:

# import atexit
import os
import tempfile
import uuid

import tables


# This lives in memory and contains external links to datasets
class MemoryDB:
    _tempdir = tempfile.mkdtemp()

    def __init__(self):
        path = f"{os.path.join(MemoryDB._tempdir, uuid.uuid4().hex)}.h5"
        self.h5f = tables.open_file(path, "a", driver="H5FD_CORE", driver_core_backing_store=0)
        # atexit.register(self.h5f.close)

    def __del__(self):
        self.h5f.close()


# Species will reference this:
db = MemoryDB()


# Attaching datasets while handling version control:
def attach_dataset(dataset, version=-1):
    # If dataset not present, download it...
    if version == -1:
        # find latest version...
        version = 0
    db.h5f.create_external_link("/", dataset, f"dataset_v{version:02d}.h5:/")


attach_dataset("test", version=1)


# Database structure
# ##################
# /(root)
# ----/dataset1
# --------/001_001_001 (atomic no., charge, mult) (*)
# ------------properties...
# --------more species...
# ----more datasets...
# ...
# ----/elements (†)
# --------properties....


# Species are a thin wrapper around the table denoted by (*) above and the element data (†).
# I think we should leave the property accessors as dict lookups (i.e. species['nelec'] vs
# species.nelec) in order to communicate that they are PyTables data tables, and so the PyTables
# querying functions can be used transparently without needing to write bindings,
# and so an appropriate KeyError is raised when data is not present (some species data is
# incomplete).
#
# I will make a class for this once the pieces have been coded.

gabrielasd · 2025-06-25T04:00:42Z

atomdb/migration/periodic/elements_data.py

+import csv
+import tables as pt
+import numpy as np
+from importlib_resources import \


Correct import statement in L4, should be a single line statement

atomdb/migration/periodic/elements_data.py

atomdb/datasets/slater/db/H_000_002_000.msg

gabrielasd · 2025-06-26T15:51:56Z

Hi @enjyashraf18
I looked at the issue with the broken PR checks. Like we discussed part of the problem is the missing PyTables dependency. To (partially) fix it add the following line to the dependencies list in the pyproject.toml file:
"tables>=3.9.2"
For python 3.10 up this will fix the installation issues for the CI checks.

After this you'll still need to check/correct the broken tests for the different datasets.

msricher

This is good for now!

msricher · 2025-06-27T14:25:03Z

atomdb/species.py

+            return None
+
+        # open the HDF5 file in read mode
+        with pt.open_file(elements_hdf5_file, mode="r") as h5file:


Use the global database instead of opening/closeing the file each time. For future, when the global database is implemented.

gabrielasd · 2025-07-04T20:33:14Z

atomdb/species.py

+            if table.nrows == 1:
+                value = table[0]["value"]
+                # if the value is an int, return it as an int
+                if isinstance(value_col, pt.Int32Col):


@enjyashraf18 check the type of the value variable instead of the value_col one, and try using Integral data type from python numbers library. for example like is done here

AtomDB/atomdb/species.py

Line 417 in 464f116

if not isinstance(spinpol, Integral):

atomdb/species.py

enjyashraf18 force-pushed the refactor-atomdb branch from 2d40a52 to b7bb79e Compare June 21, 2025 19:16

gabrielasd requested changes Jun 25, 2025

View reviewed changes

gabrielasd requested a review from msricher June 26, 2025 18:25

msricher approved these changes Jun 26, 2025

View reviewed changes

gabrielasd requested review from FarnazH and removed request for FarnazH June 27, 2025 14:23

msricher reviewed Jun 27, 2025

View reviewed changes

gabrielasd requested changes Jul 4, 2025

View reviewed changes

enjyashraf18 added 2 commits July 23, 2025 21:56

Add migration script to convert periodic data to HDF5 format

476e144

refactor scalar method to read data from the HDF5 file.

22285b2

enjyashraf18 force-pushed the refactor-atomdb branch from 0d78b49 to 22285b2 Compare July 23, 2025 19:00

FarnazH self-requested a review July 25, 2025 14:32

msricher merged commit 9cd8ec4 into theochem:dev-gsoc Aug 25, 2025
0 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor AtomDB: Migrate Elements Data from CSV to HDF5 #153

Refactor AtomDB: Migrate Elements Data from CSV to HDF5 #153

Uh oh!

enjyashraf18 commented Jun 17, 2025 •

edited

Loading

Uh oh!

gabrielasd commented Jun 18, 2025

Uh oh!

msricher commented Jun 23, 2025

Uh oh!

gabrielasd Jun 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gabrielasd commented Jun 26, 2025

Uh oh!

msricher left a comment

Uh oh!

msricher Jun 27, 2025

Uh oh!

gabrielasd Jul 4, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Refactor AtomDB: Migrate Elements Data from CSV to HDF5 #153

Refactor AtomDB: Migrate Elements Data from CSV to HDF5 #153

Uh oh!

Conversation

enjyashraf18 commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabrielasd commented Jun 18, 2025

Uh oh!

msricher commented Jun 23, 2025

Uh oh!

gabrielasd Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gabrielasd commented Jun 26, 2025

Uh oh!

msricher left a comment

Choose a reason for hiding this comment

Uh oh!

msricher Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

gabrielasd Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

enjyashraf18 commented Jun 17, 2025 •

edited

Loading