A software suite for processing the ExoMol line lists into standardised inputs for the Lida database.
The code inside this repository is designed to be executed directly on the exoweb server, as it reads the exomol line list data and process them into the LiDa-compatible lifetimes data.
This documentation page compiles hopefully all the information to get anyone fluent in python up to speed and running.
Clean virtual environment on the exoweb server.
Clone the repo:
git clone git@github.com:ExoMol/lida-web.gitCreate the
config/config_local.pyfile. The contents of which should befrom pathlib import Path EXOMOL_DATA_DIR = Path("/absolute/path/to/exomol3_data") # the directory with individual molecules
This points to the data, and can be changed for testing on a local PC anywhere.
Install dependencies:
pip install -r requirements.txtTest the package if all ok:
pytestwill test all docstrings, integration tests and unit tests,pytest tests_integrationandpytest tests_unitbreak down the testing.Note: The outputs for processed molecules are saved in the
outputsdirectory. I have run the code on 27 molecules already, outputs of which can be found on exoweb inmhaniciuser folder on the exoweb server. This might be the best way to familiarize oneself with the structure of the outputs. The outputs are by design excluded from the git VCS.
The project structure is as follows. Parts of the project will be later discussed in more detail.
configpackage containsconfigandconfig_local(not under VCS) modules with some project-wide configuration options.exomol2lidapackage contains all the code for the actual data generation and processing.inputfolder contains the input files configuring the processing of individual molecules.outputfolder is where the data processing outputs are saved to. Contents of this folder (paths to) are what needs to be passed to the data population script in the lida-web. project. This folder is not a part of VCS (while input folder is). Reason is outputs can be quite bulky. I have, however, processed 27 molecules in the past already, outputs of which can be found on exoweb inmhaniciuser folder (incode, where this repo is cloned). This might be the best way to familiarize oneself with the structure of the outputs.preferred_isotopologuesis a package grouping some functionality for guessing which isotopologue for each molecule is likely the most abundant one, and should be a part of LIDA data. This is just a utility package and is not used implicitly by any other parts of theexomol2lidaworkflow.tests_unitandtests_integrationare testing harnesses written forpytest.process.pyandpostprocess.pyare top-level scripts doing processing and post-processing for a single molecule.
The input files located in the input folder are currently the following:
input/mapping_el.pyinput/molecules.py
Let us start with the molecules.py input file.
The main input file contains the configuration for each molecule that is supposed to
be processed by exomol2lida. The structure of the molecules.py file is the
dict with molecule formulas as keys and dictionaries with number of mandatory and
optional attributes, as follows:
molecules = {
"molecule formula": {
"mol_slug": "molecule slug", # mandatory
"iso_slug": "isotopologue slug", # mandatory
"dataset_name": "the name of the dataset", # mandatory
"resolve_vib": ["quantum 1", ..., "quantum n"], # at least one of "resolve_*" needs to be given
"resolve_el": ["quantum 1", ..., "quantum m"], # at least one of "resolve_*" needs to be given
"states_header": ["i", "E", "g_tot", "J", ..., "q_1", ..., "q_k"], # if given, both "resolve_*" ignored
"energy_max": int("maximal energy [eV]"), # optional, if not present, all data are used
"only_with": {"quantum": value}, # optional, if not present, all data are used
"only_without": {"quantum": value} # optional, if not present, all data are used
},
...,
"other molecule formula": {...}
}The molecule formula here needs to be a pyvalem compatible formula, but does not
need to be the same as the ExoMol formula (but generally will be, with exception
of distinguishing between isomers and different isotopologues of hydrogen).
It might be best to show an example:
molecules = {
"CO": {
"mol_slug": "CO",
"iso_slug": "12C-16O",
"dataset_name": "Li2015",
"resolve_vib": ["v"]
},
"HCN": {
"mol_slug": "HCN",
"iso_slug": "1H-12C-14N",
"dataset_name": "Harris",
"resolve_vib": ["v1", "v2", "v3"],
"only_with": {"iso": "0"}
},
"HNC": {
"mol_slug": "HCN",
"iso_slug": "1H-12C-14N",
"dataset_name": "Harris",
"resolve_vib": ["v1", "v2", "v3"],
"only_with": {"iso": "1"},
"energy_max": 5.0
},
"VO": {
"mol_slug": "VO",
"iso_slug": "51V-16O",
"dataset_name": "VOMYT",
"states_header": [
"i", "E", "g_tot", "J", "tau", "+/-", "e/f", "State", "v", "Lambda", "Sigma",
"Omega"
],
"resolve_el": ["State"],
"resolve_vib": ["v"],
"only_without": {"State": "0"},
},
"HD+":{
"mol_slug": "H2",
"iso_slug": "1H-2H_p",
"dataset_name": "CLT",
...
},
...
}The mandatory "mol_slug", "iso_slug", "dataset_name" attributes identify
the data within the ExoMol ecosystem. The "resolve_el" and "resolve_vib"
attributes need to exist as columns in the .states file for the given dataset and these
quanta will be resolved in the final lida data. All the other quanta columns in the .states
file will be lumped and averaged over. At least one of the "resolve_el" and
"resolve_vib" attributes need to be specified for each molecule.
The "states_header" defines the names of all columns in the .states file for the
dataset, and needs to have the same length as the number of the .states file's columns.
Of course, the resolve_el | resolve_vib need to be subset of the states_header.
The "states_header" is optional in the configuration, if not provided, the columns
are inferred from the .def file, if possible, or an error is raised. Therefore the
states_header attribute serves as a workaround for inconsistent .def/.states files.
Finally, the "energy_max", "only_with", and "only_without" attributes
specify the filtering of the data, in the way that states with higher energy than
specified, states with quanta values given by only_without and all the states
other than with quanta values given by only_with, will be completely ignored, and
their transitions will not be considered at all for calculations of the lifetimes
of the final lumped states.
This is shown on the "HCN" and "HNC" example, which produces two LIDA molecules
out of a single ExoMol dataset, each only considering states with one of the
isomers, denoted in the ExoMol dataset by the "iso" column in the .states file.
Similarly, the "only_without" parameter can be used to filter out some unphysical
or nonsensical states, such as was done for the "VO" example, which has a state
(in the .states file) with value "0" under the "State" column, which needed to
be ignored. This could be used filter out all the states (and transitions to and from)
with a certain value of some specified quanta. One application would be to filter out
all the states with some vibrational quanta with values "*" or -1, which indeed
do exist in many ExoMol dataset. But this was such a common occurrence, that such
filtering is hard-coded into the algorithm and does not need to be explicitly defined
by the input configuration file.
The "HCN" isomers, as well as the "HD+" molecule are examples of the
resulting LIDA molecule formulas differing from the ExoMol molecule formulas. The
keys in the molecules dictionary specify the LiDa molecule names, which need to be
unique within the LiDa ecosystem, while the first three mandatory parameters for each
molecule define the path to the correct dataset within the ExoMol database.
The Lida database will require pyvalem
compatible formulas of species, isotopologues and states. For them to be constructed,
the electronic states resolved for each species need to take form of valid molecular
term symbols, which pyvalem can parse. This is often the case without any
intervention, often, when ExoMol dataset resolved electronic states, there exists a
"State" column in the .states file, populated with values which are in the
pyvalem compatible form already. In the cases where this is not the case, however,
a mapping between the ExoMol electronic states and the LiDa (pyvalem compatible)
electronic state labels needs to be provided.
The structure of this input file is made clear by the following self-explanatory
example of the mapping_el.py input file:
mapping_el = {
"SiH": {
("a4Sigma",): "a(4SIGMA-)",
("B2Sigma",): "B(2SIGMA-)",
},
"NaH": {
("X",): "X(1SIGMA+)",
("A",): "A(1SIGMA+)"
},
"CN": {
("X",): "X(2SIGMA+)",
("A",): "A(2PI)",
("B",): "B(2SIGMA+)"
},
...
}In theory, there might be more than a single column of the ExoMol .states file
associated with the electronic state, all necessary to resolve for LiDa, which is
the reason for the keys of the mapping above being tuples. In all the examples above
(and indeed in all the datasets processed so far), however, there is only a single
column in the .states file describing the electronic state, which has been considered
important to resolve for the lumped LiDa states. That is why all the tuple keys in
the mapping_el dicts have only a single value. In the example above, the "X" and
"A" as keys on the "NaH" molecule actually represent all the possible values
of the "State" column on the .states file for the NaH ExoMol dataset, where the
corresponding input in the molecules.py would be
"NaH": {..., "resolve_el": ["State"], ...}.
Running the exomol2lida algorithm (described further below) for all the molecules
defined in the input files will result in the output files saved into the output
folder with a similar structure as:
$ tree output/
output
├── ...
| ...
├── CN
│ ├── meta_data.json
│ ├── states_composite_map.py
│ ├── states_data.csv
│ ├── states_electronic.csv
│ ├── states_electronic_raw.csv
│ ├── states_vibrational.csv
│ └── transitions_data.csv
├── CO
│ ├── meta_data.json
│ ├── states_composite_map.py
│ ├── states_data.csv
│ ├── states_vibrational.csv
│ └── transitions_data.csv
| ...
├── ...This file compiles all the metadata about the processed dataset, as an example, the
following was recorded for the CN molecule:
The metadata file contains the original input file for the molecule exactly as was when the dataset was processed, plus some more metadata belonging to the ExoMol dataset. In particular, the version recorded might be used in some automatic management of newly released ExoMol line lists and their propagation into the LiDa database.
This is a file recording the lifetimes and energies of all the newly defined lumped
states, generated by the exomol2lida algorithm from the original states from the
ExoMol dataset's .states file. The lifetimes are in [s] and the energies in [eV]. The
example for the CN molecule looks as follows:
$ nano output/CN/states_data.csv
i, tau, E
0, inf, 0.00023
1, 0.10943243874229817, 0.25345
2, 0.057139791266185895, 0.50341
3, 0.03905472113608493, 0.75011
...
98, 1.4510310386401931e-06, 7.25698
99, 0.33311171927799127, 7.26205
100, 0.0001961755103469539, 7.27088The i column gives the unique ids of the lumped states generated from the original
ExoMol highly resolved states.
This file gives the mapping between the ids of the lumped states and the ids of the original ExoMol states (from the first column of the .states file). Again, the example for the molecule above would be
$ nano output/CN/states_composite_map.py
data = {
0: {1, 102, 203, ..., 27798, 27868, 27937},
1: {...},
...
99: {...},
100: {101, 202, 342, ..., 5275, 5413, 5551}
}This mapping makes for easy checks which of the original highly resolved states belong to each lumped state (or composite state).
This file specifies the resolved vibrational quanta per each lumped state, e.g.
$ nano output/CN/states_vibrational.csv
i, v
0, 0
1, 1
2, 2
...
98, 20
99, 39
100, 39In this example, only a single column v exists, because input config for this dataset
would have been set as "CN": {..., "resolve_vib": ["v"], ...}, but more vibrational
quanta will generally be resolved for polyatomic molecules, resulting in more columns, such
as columns i, v1, v2, v3 for the HCN molecule for example.
For the datasets which do not resolve vibrational states, these files will not exist.
These files specify the resolved electronic quanta per each lumped state. The raw file
simply gives the original values of the columns in the "resolve_el": ["q1", ...]
input parameter. The second file reflects the mapping from input/mapping_el.py and
gives the pyvalem compatible molecular term symbols, such as
$ nano output/CN/states_electronic_raw.csv
i, State
0, X
1, X
2, X
...
98, B
99, X
100, A
$ nano output/CN/states_electronic.csv
i, State
0, X(2SIGMA+)
1, X(2SIGMA+)
2, X(2SIGMA+)
...
98, B(2SIGMA+)
99, X(2SIGMA+)
100, A(2PI)In the cases where the electronic quanta values from the ExoMol .states file are already
pyvalem compatible molecular terms and no input/mapping_el.py entry is needed,
both the output/formula/states_electronic_raw.csv and
output/formula/states_electronic.csv will exist and will simply be identical.
For the datasets which will not resolve electronic states, these files will not exist.
While the states_electronic_raw.csv is produced by the exomol2lida.process_dataset
module, the states_electronic.csv is produced by the exomol2lida.postprocess_dataset
The last output file contains the transitions between the lumped states with their
calculated partial lifetimes. The same example of the CN molecule gives
$ nano output/CN/transitions_data.csv
i, f, tau_if
1, 0, 0.10943243874229817
2, 0, 1.2059110179290697
2, 1, 0.059981920029654155
...
100, 29, 56790.90109600513
100, 30, 0.9100356742558192
100, 31, 0.00029494955495862026The i, f values refer to the ids of the lumped states.
In this section, the main state-lumping algorithm is briefly described verbally. The algorithm follows these steps:
Filter the original states from the .states file. This is where the input parameters
energy_max,only_withandonly_withoutcome in, see the inputs section. All the original states which do not survive this filtering will be completely ignored for the further calculations.Define the lumped states, create mapping between the original state ids and the lumps. Each lumped state is uniquely identified by the set of distinct values of the columns defined by the
resolve_elandresolve_vibin the input file. In the example case of{..., "resolve_vib": ["v1", "v2", "v3"], "resolve_el": ["State"], ...}, each lumped state is a collection of distinct values of theState, v1, v2, v3columns from the ExoMol .states file. Each lump will consist of a number of original ExoMol states characterised with the same resolved quanta, but generally with different values under other columns, and other quanta, such as "+/-", "J", etc. A mapping is created between the lumped indicesiand the original states indicesi_orig,i -> {i_orig}Assign energies to the lumped states. Within each lump, the lowest-J states are identified and the energy of the lump is set to be the average of the original resolved states with the lowest J number, weighted by the
g_tottotal degeneracies.Lump the transitions and calculate the partial lifetimes of the lumped transitions. First, all the in-lump transitions are ignored and not used in any way in the calculations. Also, any transitions from and to a non-existing lump (such as to and from the original states that did not survive the filtering within the state lumping process, either too high energy, or regarding the
only_with,only_withoutparameters) are completely ignored. Ifi,fare the indices of the lumped states andi_orig,f_origare indices of the filtered original states from the .trans file, the partial lifetimes of each lumped transitioni -> fis calculated as\tau_{i \to f} = \mathrm{avg}_{f_\mathrm{orig} \in f} (\tau_{i \to f_\mathrm{orig}}),\frac{1}{\tau_{i \to f_\mathrm{orig}}} = \sum_{i_\mathrm{orig} \in i} A_{i_\mathrm{orig} \to f_\mathrm{orig}}.Here,
Arefers to the einstein coefficients from the original .trans file.Calculate the total lifetimes of the lumped states, as
\frac{1}{\tau_i} = \sum_{f} \tau_{i \to f}.
The algorithm described above is implemented in the exomol2lida.process_dataset
module inside the DataProcessor class. The class is well documented and the
docstrings and the in-line comments serve as the best source of documentation and
usage. A the end of the processing workflow, most of the output files (discussed above)
are created. Apart the DataProcessor, there is also the
exomol2lida.postprocess_dataset.DatasetPostProcessor class, which handles the
conversion between electronic states as are in the ExoMol database, and pyvalem
compatible molecular term symbol labels expected by the LiDa database.
Two top-level script exist which trigger the whole workflow. Assuming there exist an
input entry for molecule, e.g. "H2O" in the input/molecules.py file, the
calculation of the LiDa data for this molecule can be run by
python process.py H2OSimilarly, the data can be post-processed (after processing finished) aby running
python postprocess.py H2OThis assumes that if electronic states are resolved for this molecule, they either
can be parsed automatically by the DatasetPostProcessor, or the
input/mapping_el.py input file has defined the mapping to the valid molecular term
symbols.
Both can be run together by
python process.py H2O --postprocessIf anything goes wrong, hopefully the error message will give a hint on what happened. The processing and post-processing workflow can also be run on all the molecules found in the input files, by running
python process.py all --postprocessThis section discusses some of the known issues of the codebase and ideas for further development.
A more powerful state filtering system is needed. As an example, the H3+ dataset contains the ground state with (v1, v2) = (0, 0), which does not physically exist, therefore does not have energy defined. The
exomol2lidacode runs fine, but the lida database code breaks, as the energy of the (0, 0) state is set toNaNby theexomol2lida, which cannot be dealt with by the lida code (different repo). What is needed is the way to specify something like"H3+": {..., "only_without": {("v1", "v2"): (0, 0)}, ...}. This is currently not supported, and the option of"only_without": {"v1": 0, "v2": 0}, which is supported, would also filter out states with vibrational quanta (0, 1), (0, 2), ..., (1, 0), (2, 0), ...In the case of LiDa data built from the multiple isomers, such as the case of HCN, built from
"HCN": {..., "only_with": {"iso": 0}, ...}, and HNC, built from"HNC": {..., "only_with": {"iso": 1}, ...}, both LiDa datasets are built from a single ExoMol dataset, under the same molecule slug and isotopologue slug. Theexomol2lidacode runs fine, but problems are created for the lida code (different repo), because both isomers have the same isotopologue formula saved in themeta_data.jsonoutput, specifically"(1H)(12C)(14N)". This is used by the LiDa database code and there cannot exist twoIsotopologueinstances with the same formula. What is needed is something likemolecules = { ... "HNC": { "mol_slug": "HCN", "iso_slug": "1H-12C-14N", "dataset_name": "Harris", "resolve_vib": ["v1", "v2", "v3"], "only_with": {"iso": "1"}, "surrogate_iso_formula": "(1H)(14N)(12C)" }, ... }
and change in the processing code, which would save the surrogate formula into the output
meta_data.jsonfile.The process -> postprocess workflow as currently implemented works fine for the case where there is a 1-to-1 relationship between the original state labels and the needed
pyvalemcompatible labels defined in theinput/mapping_el.py. Unfortunatelly there are cases of many-to-1 relationships, such as in the case of the VO molecule, for which the mapping looks likemapping_el = { ... "VO": { ("b2Gamma",): "b(2GAMMA)", ("X",): "X(4SIGMA-)", ("a2",): "a(2SIGMA-)", ("Ap",): "A'(4PHI)", ("A",): "A(4PI)", ("b2",): "b(2GAMMA)", ("c2",): "c(2DELTA)", ("d2",): "d(2SIGMA+)", ("B",): "B(4PI)", ("e2",): "e(2PHI)", ("C",): "C(4SIGMA-)", ("f2",): "f(2PI)", ("D",): "D(4DELTA)", ("g2",): "g(2PI)", }, ... }
As we can see, the
"b2Gamma"and"b2"original states both need to be mapped into a singlepyvalemcompatible"b(2GAMMA)"state string. This breaks the code unfortunately, as the lumped states are created using the original state strings and in the post-processing we end up with two lumped states which should have been a single one. To fix this, the electronic state label substitution for thepyvalemcompatible strings needs to happen inside the lumping procedure within the processing. This would than shift the post-processing workflow to the beginning of the processing workflow - the state re-labeling will happen inside processing and post-processing will not exist anymore. This needs to be implemented, as currently VO molecule fails in the post-processing step due to the reasons described.Some datasets show very high lifetimes after processing, which are surely unphysical. I have not looked into this in detail, but it does deserve some attention, to decide if the ExoMol data are incorrect or if there is a problem with the algorithm. An example of this is the SiH2 molecule, where all the states have lifetimes between 1e3 - 1e14 s. But as other molecules have reasonable lifetimes, it is more probable that the .trans file for this ExoMol dataset is in some way faulty, like incorrect units of einstein coefficients, of swapped columns, etc. Investigation is needed, as well as some safeguards against this happening.
We get quite a lot of transitions from lower-energy states to higher-energy states. This is understood, and it is the result of how the states are lumped, as each lump includes states of the whole range of J numbers, but energy of the lump is calculated only regarding the lowest-energy J. This is just to draw attention to the issue.
In some cases, we see ground states with finite lifetimes. This should probably never happen, and it can signal some possible problems with the algorithm (or inconsistent .trans files to begin with, where there exist transitions from the ground states). Further investigation is needed, examples of molecules with finite lifetimes of the ground states are e.g. CaO, H2, NaCl, and others.
Another non-bug is that we see some high-energy states with very high lifetimes, caused by the .trans file not being complete and missing some radiation channels for the very high energy states. This will often be a problem, because the .trans files must be truncated somewhere. Question is where should the LiDa states and transitions truncated, and how to detect incomplete transitions, so the lumped transition and its initial state could be simply ignored as incomplete and therefore with compromised data. One option could be truncate LiDa states by energy, but question is where to set the threshold.
The fact is that there probably is a strong need for some truncation. The algorithm runs very slowly on the big molecules. Fact is that only molecules of manageable size have been so far processed, and the largest molecules will probably take way too long to be done by the algorithm as it is. The lumped states and transitions probably will have to be truncated to limit the computational load, as well as the size of the lumped dataset. Another thing is that the lifetimes and transitions are so far lumped over all the J numbers, which might not be feasible for the large molecules, and instead a reference J could be picked and the data calculated only for that single J.
Apart from the fact that some molecules are simply too large for the algorithm, there appears to be a problem with the computational cost, where state lumping and transitions lumping algorithm takes more and more time with each chunk of the dataset data read from the exoweb server for a single molecule. For some reason, more data is processed already, more time it takes to process the remaining data, which should not be the case. Probably some profiling and optimization is needed, I did not have the time to pinpoint the issue, but I'm fairly certain that the algo could be optimized.