Centralizing component information across databases? #556

Garren-H · 2026-03-05T23:13:08Z

Garren-H
Mar 5, 2026

Hi all

Currently, all database files (.csv files) stores several component information, including but not limited to multiple different variations of names, CAS numbers, molecular weights, sometimes INCHI-keys, SMILES, etc. In some cases an alternative name might be updated in one sheet and not another. Or in other cases, we may have molecular weights reported with different levels of accuracy. In terms of structuring it would be better to centralize compound information into a single sheet. My idea would be to have identifiers in one column, which maps to a common IUPAC name used throughout other databases. Water for instance may have several entries in the sheet

Identifier	Normalized Name
H2O	water
7732-18-5	water

We store these in different rows to avoid string parsing (which is currently done in the database files with ~|~ markers). We use a single identifier, "water", across all other database files. Another "global" database can be stored for other constant compound information, maybe with columns

Normalized Name	SMILES	CAS#	INCHI-Key	MW	Tc	Pc	Vc	acentric_factor	...

Keeping all constant compound information centralized. A lookup would then entail "User input"->Identifier->Normalized name->global information + local/model parameters.

Just my opinion, but this would make the database much cleaner and control/updating centralized. I am not sure what the implications of this choice would be on user the speed, or on user supplied databases.

Interested to hear your thoughts on this approach

longemen3000 · 2026-03-07T04:12:17Z

longemen3000
Mar 7, 2026
Maintainer

Some thoughts i have, mainly about the data we store and how do we parse it:

If i remember correctly, the database infraestructure was done with the main focus of being as easy to edit as possible. any people capable of interacting with a CSV could create their own an pass it to userlocations. then, later, a way to pass user parameters was bolted on (the parameters use the same CSV pipeline, give or take). Probably the first breaking change in Clapeyron 0.7 would be changing userlocations to userparams.
At first, not even standarization existed across the databases. Different databases had different names and some of those names clashed. i did a lot of standarization work (with the help of ChemicalIdentifiers.jl package) in addition to some manual lookup, and help from users in finding duplicates or incorrect data. An important point is that we always compare the "normalized" (no spaces, lowercase) name with the normalized version of the synonym list, so CarbonDioxide and carbon dioxide resolve to the same entry. The result of that standarization work is the identifiers.csv file.
I decided to use that long string of synonyms as the unique identifier. because neither CAS (some missing CAS) nor SMILES (problems with substances that are defined as a racemic mix, vs one of their isomers) nor Inchi (same issue as SMILES) are actually unique identifiers. For example we have a compound named para-hydrogen (QCPR, SAFTVRQMie and SingleFluid). that component does not have any standarized identifier (it has a CAS for CoolProp compatibility). For the purpose of standarize any new names, i have some (undocumented) utility functions in Clapeyron.normalize_components_sym and Clapeyron.write_csv!)
if you look it from the data perspective only, critical properties are not "constant", in the sense that not every EoS uses their same values. For example, tcPR has a different set of critical values than out critical database, empiric helmholtz models define their own critical point that is numerically close but not equal to the real critical point, and similar discrepancies are found in quantum gases.

2 replies

longemen3000 Mar 7, 2026
Maintainer

With that said, our standarization effort is relatively recent (identifiers.csv was created in 0.6.4), so it is not really well integrated with our database files. Probably an script could be done to standarize all databases at once, or to check if databases are standarized or not in the test suite.

I have not put really a lot of thought on integrating the identifier database on our database files (i put an emphasis in our, as you can change the default database directory via changing the string in Clapeyron.DB_PATH). We could proceed with changing the names to a standarized identifier, but i recommend doing that change in an automated and available way for future Clapeyron contributors.

Garren-H Mar 7, 2026
Author

Thanks for the detailed explanation Andrés! Just a few point to clarify

The result of that standarization work is the identifiers.csv file.

The identifiers.csv file is more or less what I envisioned for this purpose, but with some modifications. I am also working on some fairly large dataset of compounds, where I used ClassyFire and the PubChem API in an attempt to normalize some compound information.

I decided to use that long string of synonyms as the unique identifier. because neither CAS (some missing CAS) nor SMILES (problems with substances that are defined as a racemic mix, vs one of their isomers) nor Inchi (same issue as SMILES) are actually unique identifiers.

My point of contention with using ~|~ is that it should introduce additional overhead when looking up compounds. Not sure what is currently used in the parsing, but using a statement like findfirst(Base.Fix1(occursin,compound),list) where compound is the normalized user identifier (as outlined for the CO2 example) and list is a vector/dataframe column containing a bunch of synonym strings. Each find here will have to look through the entire synonym string which probably involves some regex matching. If instead an approach of manually splitting with ~|~ is used, then the overhead will be much more.

Instead of introducing this overhead one could instead perform findfirst(Base.Fix1(==,compound),flat_list) which does contain a larger list, but avoids the regex parsing (or manually splitting) of each synonym string. Constructing a single model this is probably insignificant, but constructing multiple models, potentially in series, one should see a noticeable difference in speed

i have some (undocumented) utility functions in Clapeyron.normalize_components_sym and Clapeyron.write_csv!)

This is good to know. Will definitely have a look at this

if you look it from the data perspective only, critical properties are not "constant", in the sense that not every EoS uses their same values. For example, tcPR has a different set of critical values than out critical database, empiric helmholtz models define their own critical point that is numerically close but not equal to the real critical point, and similar discrepancies are found in quantum gases.

This a fair point. I would argue that in this case what is used as "critical properties" should actually be interpreted as model parameters. An easy way to deal with this discrepancies, while still keeping this data centralized, is to prioritize model specific information above global constant. I.e. we could have a Tc globally and within the database sheet for tcPR. If this is the case, we just use the value within the local database. So in short we prioritize user_locations/input to models -> local database files -> global database

I have not put really a lot of thought on integrating the identifier database on our database files (i put an emphasis in our, as you can change the default database directory via changing the string in Clapeyron.DB_PATH). We could proceed with changing the names to a standarized identifier, but i recommend doing that change in an automated and available way for future Clapeyron contributors.

Noted, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Centralizing component information across databases? #556

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Centralizing component information across databases? #556

Uh oh!

Garren-H Mar 5, 2026

Replies: 1 comment · 2 replies

Uh oh!

longemen3000 Mar 7, 2026 Maintainer

Uh oh!

longemen3000 Mar 7, 2026 Maintainer

Uh oh!

Garren-H Mar 7, 2026 Author

Garren-H
Mar 5, 2026

Replies: 1 comment 2 replies

longemen3000
Mar 7, 2026
Maintainer

longemen3000 Mar 7, 2026
Maintainer

Garren-H Mar 7, 2026
Author