This package is developed and maintained by the NFSI for quality control assessment and pre-processing operations performed on data acquired by NFSI's Aquarius ocean-bottom seismometers (OBS). The code is subject to ongoing maintenance and development, so instructions for use may change from time to time. Reasonable efforts will be made to maintain backwards compatibility.
For generating PDF reports, an installation of pandoc and texlive is required. All other package requirements are included in the conda environment definition.
- This package uses a conda environment (Anaconda/Miniconda), running Python 3
- Python 3.7 to 3.10, inclusive, are compatible with the full dependency list at this time
- Run
conda env createto setup the environment orconda env updateto update an existing environment after a package version change
- Python 3 (3.7 to 3.10)
- numpy (1.21)
- matplotlib
- scipy
- obspy (1.3.1 or greater)
- pandas
- openpyxl
- bitstring
- pypandoc
- jinja2
- scikit-learn
- seaborn
- psutil
- stdb
- obstools
- pynmeagps
- ioos_qc (modified to allow sample rate greater than 1Hz)
- By default, this package expects to be located at
[base_dir]/OBSDataPipeline, with a folder calledresourceat the same level asbase_dir. If this is not the case the package will still function, but the data QC script (OBS_QC.py) cannot be run without inputs (default test case) andlog_dirmust be specified in thecommonsection of the default config file. - To use the data QC script (OBS_QC.py):
- Copy
configs/config.ini.stockto eitherresource/OBSDataPipeline/configs/or the existingconfigsdirectory in the repository and name the copyconfig.ini. Alternatively, the script will do this automatically the first time it runs ifconfig.inidoes not exist. - Edit
config.inias necessary for your particular setup. - Some basic settings fall back to this default configuration file if not specified in an individual project's
config.inifile
- Copy
- Configuration is not required for other functionality, including editing of StationXML information and data pre-processing for distribution.
This script is intended to be run on raw data downloaded from the Aquarius OBS. It may be run at any time after downloading the data, but is currently only designed to be run on the Aquarius-specific data package as downloaded from the OBS (data in miniSEED format, metadata as dataless SEED or StationXML, data for each auxiliary channel spans a maximum of 3 miniSEED files).
General settings are included in a config.ini file, which can typically be specified at a project level. Station-specific settings are normally few enough to use command line arguments without being too cumbersome. Most settings may be specified either as command line arguments or parameters in the INI file, except of course the path to the INI file itself if not using the default one. In the case of settings specified by multiple input methods, the command line arguments will take precedence.
The script QC_many.py is designed to run this QC for several instruments in sequence, with a JSON file specifying all command line parameters for each instrument. This is especially useful for test datasets, where several instruments will be "recovered" at the same time, or for batch updating field data after a recovery cruise.
| Name | Default Value | Description |
|---|---|---|
config |
Path to config.ini file. Must be absolute path unless base_dir and relative_paths are also given as command line arguments |
Command line arguments override parameters also present in the configuration file.
| Name | Default Value | Description |
|---|---|---|
base_dir |
Absolute path to base directory if using relative paths | |
relative_paths |
False | Flag to specify all other paths relative to base_dir |
data_dir |
Directory where OBS data is stored (script will find ALL miniSEED files in this directory). Station-specific. | |
datalog |
Deployment summary spreadsheet, including deployment and recovery information. Preferred format is XLSX following NFSI template. | |
logdelimiter |
, | Optional delimiter if datalog file is delimited text (default comma-delimited) |
logcolnames |
False | Maintained for backwards compatibility. If true, use column names from datalog file, rather than hard-coded (legacy) names. Will be deprecated in future. |
obsid |
Unique identifier for OBS (normally station name or OBS serial number) | |
start |
Start date of deployment to be analyzed (if multiple deployments of same instrument present in log file), as YYYYMMDD. Normally only required for test datasets. | |
network |
XX | FDSN-style network code assigned for the project |
outdir |
data_dir |
Output directory for QC results |
channelmap |
Spreadsheet or delimited text file mapping correct SEED codes to existing channel identifiers in raw data. Optional | |
metadata |
Optional path to metadata file (dataless SEED or StationXML). If not specified, code will search data_dir for a suitable file. |
|
extra_meta |
Optional JSON file with extra description, QC information and per-channel plot settings | |
detrend_seismic |
False | Detrend seismic data prior to analysis (RMS linear fit) |
use_existing_plots |
False | Do not re-create plots which already exist in output directory (not fully implemented) |
projectname |
Project name. If not specified, code looks in extra_meta JSON file instead. |
|
colormap |
viridis | Matplotlib colormap to use for spectrograms |
ignore_seismic |
False | Do not analyze data channels for seismometer or hydrophone |
limited_seismic |
False | Analysis of seismometer and hydrophone data limited to only data extent, readability, and gaps |
debug |
False | Set logging level to debug (see package logging) |
All CLI-optional parameters listed above belong in section dataset.
| Name | Section | Default Value | Description |
|---|---|---|---|
log_dir |
common |
~/resource/OBSDataPipeline/logs | Directory where run-time logs are to be saved. Will recognize :base and :data as base_dir and data_dir, respectively. Otherwise, specify full path. |
pdftex_path |
common |
Path to PDFLaTex application executable | |
window_length |
seismic |
3600 | Window length to be used for PSD calculations, in seconds (default 1 hour) |
overlap_percent |
seismic |
50 | Percent overlap for PSD windows (default 50%) |
spectrogram_window |
seismic |
60 | NOT IMPLEMENTED. Window length to be used for spectrogram, in seconds (default 1 minute). Set equal to PSD window length (not yet properly implemented to be different). |
The code does not require a specific folder structure outside of data_dir itself. Below is a recommended structure for ease of organization and clarity.
- base_dir
- station_1_dir
- station_2_dir
- ...
- station_n_dir
- Project_Deployment_Summary.xlsx
- project_info.json
- project_channel_map.xlsx (if required)
Additionally, a directory named QC reports is often created under the base directory to keep QC results separate from the raw data. Each station's QC output is then saved in a sub-directory named as the station ID or OBS identifier.
Raw data as downloaded directly from the Aquarius OBS is in miniSEED file format, with one channel per file and maximum file size of 128 MB, using the STEIM2 data compression algorithm. This means that the time span covered by any particular file varies with sampling rate and compression efficiency.
The standard used by SeisComP (SDS archive) and more familiar to seismology researchers has data in miniSEED format, with each file including data for a single channel over a 24-hour period (UTC day). These files are organized in a standard folder structure, and have standardized filenames. See the SeisComP documentation for more information.
The miniseed_recut.py script can be used to convert raw Aquarius data to the SDS structure. The SDS_many.py script allows this operation to be run for several stations in sequence with a JSON file providing command line inputs.
- Base folder
- Year
- Network
- Station
- Channel (optional ".D" suffix)
- miniSEED data files
- Channel (optional ".D" suffix)
- Station
- Network
- Year
Filename ("data" channel): [Net].[Sta].[Loc].[Chan].D.[Year].[JulianDay].mseed
- Example 2S.L102..CH3.D.2023.352.mseed
- Auxiliary channels omit the ".D" from the file name and channel folder name
Some channel identifiers used by default on the Aquarius OBS do not follow the SEED convention, and require correction using the same channel map file used by the data QC script. This pre-processing script also applies a linear clock drift correction based on the final clock offset measurement collected at instrument recovery, if available.
| Name | Default Value | Description |
|---|---|---|
data_dir |
Absolute path to directory where Aquarius data package is stored | |
archive_dir |
Absolute path to directory where SDS archive format files are to be saved | |
subfolders |
Comma-separated list of direct child directories in data_dir to be processed | |
channels |
Comma-separated list of channels to process (optional). Defaults to only seismometer and hydrophone. Will recognize "all" to process all channels present in the data directory. | |
start |
Start date/time for data processing, as YYYYMMDD[HH[MM[SS]]] | |
end |
End date/time for data processing, as YYYYMMDD[HH[MM[SS]]] | |
correct_metadata |
False | Flag to correct channel IDs in output files |
network |
XX |
FDSN network code assigned to the data. Default XX for test data. |
relative_paths |
False | Specify all paths relative to data_dir, with the exception of archive_dir and log_dir |
metadata |
Path to metadata file (dataless SEED or StationXML). Channel IDs should match the raw data. | |
channelmap |
Spreadsheet or delimited text file mapping correct SEED codes to existing identifiers in raw data (same as for QC script). Optional | |
extra_meta |
Optional JSON file, as used for QC script. Channel descriptions are taken from here if present. | |
log_dir |
Absolute path to directory where runtime logs are to be saved | |
datalog |
Deployment summary spreadsheet, same as used for QC script | |
logdelimiter |
, | Optional delimiter if datalog file is delimited text (default comma-separated) |
legacylogcols |
False | If true, use legacy column names for datalog file (opposite behaviour to logcolnames in QC script). Will be deprecated in future. |
auxseparate |
False | If true, save output data files for auxiliary channels in a separate SDS folder structure, named the same as archive_dir with '_AUX' suffix |
debug |
False | Set logging level to debug for extra information |
This script is most often used as part of a batch process for data collected from an entire array of OBS (SDS_many.py). A JSON file is used as input for the batch script, with an item named instruments to specify station-specific parameters. Common CLI parameters are specified at the top level, with boolean flags given as a list labeled flags.
Common parameters normally included:
- data_dir (project directory on disk)
- archive_dir ("SDS" optionally in project directory)
- channels ("all")
- channelmap
- extra_meta
- datalog
- network
- flags
- correct_metadata
- relative_paths
- auxseparate
Station-specific parameters typically used:
- subfolders
- metadata
- start (touchdown at seafloor)
- end (release from seafloor)
The Aquarius OBS produce StationXML files which include all response information for channels available on the instrument. Values for location may be set during programming of the instrument, but these are generally not known accurately prior to deployment. This script updates all relevant values from the project/station metadata, and corrects SEED codes as necessary to align with the final pre-processed data files. Some NFSI-specific information is also added, including contact information.
The StationXML files produced by this script should be compatible with SeisComP, but include only a single station at this time. Automatically combining into a full-network StationXML when run for several input files is a planned future development.
| Name | Default Value | Description |
|---|---|---|
input_dir |
Absolute path to directory where input (partial) StationXML files are stored and/or base directory for relative paths | |
output_dir |
Path to directory where output StationXML files are to be saved | |
log_dir |
Path to directory where runtime logs are to be saved | |
relative_paths |
False | Specify all paths relative to input_dir |
xml |
Path to input StationXML file if only processing a single file. Will override input_dir for this purpose if specified (does not affect relative_paths behaviour). |
|
datalog |
Deployment summary spreadsheet, same as used for QC script | |
legacylogcols |
False | If true, use legacy column names for datalog file (opposite behaviour to logcolnames in QC script). Will be deprecated in future. |
channelmap |
Spreadsheet or delimited text file mapping correct SEED codes to existing identifiers in raw data (same as for QC script). Optional | |
out_channels |
List of channel IDs to include in output StationXML (after any correction), either comma-separated or text file. If not specified, channels included in channelmap will be output. If no channel map given, all channels in input will be output. |
|
other_meta |
Optional JSON file providing extra information to be added to files. Normally used to specify network DOI and sub-sensor serial numbers (hydrophone and Keller). | |
dataless |
False | Set to true if the input files are dataless SEED rather than StationXML |
survey |
Triangulation | Survey method used to determine seafloor locations. Overridden by column "Survey Calculation Method" in datalog if present. |
This script is most often run for a full array of OBS, using the project directory as input_dir. If StationXML files exist within the input directory but do not contain any channels specified for output, updated copies of these files will not be saved in the output directory.
Typical parameters used:
- input_dir
- relative_paths
- output_dir ("StationXML")
- datalog
- channelmap
- other_meta