-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Use Cases:
- As a data engineer, I need to transform FHIR Observations to a flattened DataFrame with Observation Codes as columns for analysis and reporting purposes.
- As a UI developer, I need to fetch and display patient-specific observations from a FHIR server in a tabular format with observation codes as columns. Given there are many observations, I need to ensure efficient way to group these facets into categories.
Example
References
Goals/Forces/Motivation :
The primary reason for using the pivot function is to reshape the data. It transforms data from long to wide format, which helps when comparing different variables more effectively. This reshaping is fundamental in preparing datasets for analysis or visualization as it allows for a more structured and readable form of data representation.
Data pivoting is the process of transforming data from a long format (rows) to a wide format (columns), typically to make it more understandable or suitable for analysis.
Common scenarios for data pivoting include summarizing data, creating cross-tabulations, or presenting data in a more structured format for reporting purposes.
See R's tidyverse
Tidy data refers to ‘rectangular’ data. These are the data we typically see in spreadsheet software like Googlesheets, Microsoft Excel, or in a relational database like MySQL, PostgreSQL, or Microsoft Access, The three principles for tidy data are:
- Variables make up the columns
- Observations (or cases) go in the rows
- Values are in cells
Put them together, and these three statements make up the contents in a tidy data frame or tibble. While these principles might seem obvious at first, many of the data arrangements we encounter in real life don’t adhere to this guidance.
In context of FHIR, the Observation is the "tidy data" aka “indexed”, and has a defined, fixed schema.
The dataframe is the "wide data" or "Cartesian" data see ggplot2 it's schema is not fixed as "variables" are defined by Observation.code.
Workflow
- Create dataframe and maintain ES dataframe mapping (g3t meta dataframe)
- Based on newly submitted Observations, maintain a hierarchy of facets for the dataframe - driven by the Observation.category and Observation.code
- Automatically update the explorer configuration while reading submitted data
- Signal the guppy service to restart after updating the ES mapping
- Signal the portal to read the new explorer configuration on change
Data Frame Creation:
Create an initial pandas DataFrame with columns for each unique observation code extracted from the observations.
Each row represents a set of observations with its associated attributes (subject, encounter, value, effectiveDateTime, etc.) for a given subject, specimen, focus at a given time.
Use observation codes as columns to ensure each code has its dedicated column in the DataFrame.
De normalization by Observation.subject:
- Identify the subject resourceType attribute within each Observation resource, which represents the entity (e.g., patient, device, location) to which the observation applies.
- Normalize the value[x] attribute in the Observation resource, extracting the value based on the data type specified in the value[x] field (e.g., valueQuantity, valueCodeableConcept, valueString).
- Temporal Data Extraction:
- Extract the effectiveDateTime attribute from the Observation resource, representing the date and time of the observation.
- If no observation.effectiveDateTime, include the onsetAge or use effectiveDateTime attribute from the specimen or focus resource, representing temporal data - represented as an age. If no temporal data is available, the field should be null.
- In any case, ensure that a new dataframe row is created for each corresponding temporal data.
- Retrieve from working or create the dataframe row
- If we need to create the row:
- retrieve the subject, specimen and focus entities referenced in the Observation resource, expanding the DataFrame to include their scalar attributes and coding, prefix each attribute with lower case resourceType eg
patient_*,specimen_* - create an identifier for the row based on the subject, specimen, focus, and temporal data (age), ensuring uniqueness.
- retrieve the subject, specimen and focus entities referenced in the Observation resource, expanding the DataFrame to include their scalar attributes and coding, prefix each attribute with lower case resourceType eg
- If we need to create the row:
- Add a new column to the data frame
- Verify new column exists or should be added to explorer_config
- After processing all Observations, write dataframe to ES, maintaining the ES index mapping
- If the explorerConfig is updated:
- upsert the document in ES.
- signal the guppy & portal to read the new configuration
- If the explorerConfig is updated:
Facet Management Category and Coding Extraction:
While creating the "dataframe" we also need to discover and maintain a hierarchy of facets.
- Before reading incoming Observations, read the new
explorer_configES index to retrieve current explorer configuration.- This index will contain a single document, that will contain the
explorerConfigneeded by the front end
- This index will contain a single document, that will contain the
- Specifically, maintain a "has changed" flag on that document.
- Changes to dataframer:
- In the case of multiple code values, select the well known code (e.g., LOINC, SNOMED) for normalization.
- Apply "string to variable name conversion" to normalize 'code' to column name.
- Maintain a "gitops" explorer fragment:
- see gen3 documentation
- Apply inflection.titleize to normalize 'category' name
explorerConfig:
{{ for all unique subject.resourceType }}
- tabTitle: <resourceType> # e.g., Patient, Specimen, etc.
charts: [] # manually add charts
filters:
{{for all categories}} # e.g., 'Vital Signs', 'Laboratory', etc.
tabs:
- title: category
fields: <codes from category> # e.g., 'heart_rate', 'blood_pressure', etc.
<scalars and extensions>
table:
enabled: true
detailsConfig: # manually add detailsConfig
fields:
# all flattened references
# all codes
Notes: Out of scope or static elements in explorerConfig.
- we expect that a the tabTitle corresponds 1:1 with a dataframer - as such we expect that the entry in explorer config will be initialized as the dataframer is developed. In other words, if the tabTitle, the dataframer will initialize it with [detailsConfig, charts] etc.
Example: from Prostate_Microenvironments
Summary view of dataframe e.g. patient-centric
Changes to guppy config
# Guppy configuration
guppy:
enabled: true
dbRestore: false
indices:
- index: observation
type: observation
- index: file
type: file
# added to support facet management
- index: explorer_config
type: explorerConfig
configIndex: gen3.aced.io_array-config
End to End
Guppy PR link: uc-cdis/guppy#273
FF issue: ACED-IDP/gen3-frontend-framework#12
Existing work
g3t meta dataframe --help
Usage: g3t meta dataframe [OPTIONS] [DIRECTORY_PATH] [OUTPUT_PATH]
Render a metadata dataframe.
directory_path: The directory path to the metadata.
output_path: The output path for the dataframe. default [meta.csv]
Options:
--dtale Open the graph in a browser using the dtale
package for interactive data exploration.
--data_type [Patient|Specimen|Observation|DocumentReference]
Create a data frame for a specific data
type. [required]
--debug
E.G.
g3t meta dataframe --data_type Observation META/ --dtale
Discussion points
- Independent of submission ("push") method
- Dependency on guppy PR (
guppy/_restart) - Responsibility for
facet categorydelegated to Observation.category/.code. AKA -etl driven facet hierarchy - Does guppy's graphql support fetching a nested document (explorerConfig)
- Does FEF support reading explorerConfig dynamically via graphql
packages/sampleCommons/config/aced/explorer.json- If guppy/graphql does not support nested document, can the FEF read the document dynamically via a url (public bucket)
