Dataframer/Pivot with Facet Mangement

## Use Cases:  
* As a data engineer, I need to transform FHIR Observations to a flattened DataFrame with Observation Codes as columns for analysis and reporting purposes.
* As a UI developer, I need to fetch and display patient-specific observations from a FHIR server in a tabular format with observation codes as columns.  Given there are many observations, I need to ensure efficient way to group these facets into categories.


### Example
<img width="777" alt="image" src="https://github.com/ACED-IDP/submission/assets/47808/107390a8-9750-497d-bde2-9b3424e9dfb7">

### References
* https://www.researchgate.net/publication/241136160_Design_Patterns_for_Relational_Databases

## Goals/Forces/Motivation :

> The primary reason for using the pivot function is to reshape the data. It transforms data from long to wide format, which helps when comparing different variables more effectively. This reshaping is fundamental in preparing datasets for analysis or visualization as it allows for a more structured and readable form of data representation.

> Data pivoting is the process of transforming data from a long format (rows) to a wide format (columns), typically to make it more understandable or suitable for analysis.
Common scenarios for data pivoting include summarizing data, creating cross-tabulations, or presenting data in a more structured format for reporting purposes.

See R's [tidyverse](https://www.storybench.org/pivoting-data-from-columns-to-rows-and-back-in-the-tidyverse/)

> Tidy data refers to ‘rectangular’ data. These are the data we typically see in spreadsheet software like Googlesheets, Microsoft Excel, or in a relational database like MySQL, PostgreSQL, or Microsoft Access, The three principles for tidy data are:
> * Variables make up the columns
> * Observations (or cases) go in the rows
> * Values are in cells
> Put them together, and these three statements make up the contents in a tidy [data frame or tibble](https://tibble.tidyverse.org/). While these principles might seem obvious at first, many of the data arrangements we encounter in real life don’t adhere to this guidance.

In context of FHIR, the Observation is the "tidy data" aka “indexed”, and has a defined, fixed schema.
The dataframe is the "wide data" or "Cartesian" data see [ggplot2](https://amzn.to/2UlyNNp) it's schema is not fixed as "variables" are defined by Observation.code. 

<img width="919" alt="image" src="https://github.com/user-attachments/assets/0d476766-f5a3-4e04-a598-53db3bd9d8b5">

<img width="684" alt="image" src="https://github.com/user-attachments/assets/5f0c2a4f-de7f-4130-81e2-047fe41d7359">

## Workflow

  * Create dataframe and maintain ES dataframe mapping (g3t meta dataframe)
  * Based on newly submitted Observations, maintain a hierarchy of facets for the dataframe - driven by the Observation.category and Observation.code
  * Automatically update the explorer configuration while reading submitted data
  * Signal the guppy service to restart after updating the ES mapping
  * Signal the portal to read the new explorer configuration on change

## Data Frame Creation:

Create an initial pandas DataFrame with columns for each unique observation code extracted from the observations.
Each row represents a set of observations with its associated attributes (subject, encounter, value, effectiveDateTime, etc.) for a given subject, specimen, focus at a given time.
Use observation codes as columns to ensure each code has its dedicated column in the DataFrame.

De normalization by [Observation.subject](https://www.hl7.org/fhir/observation-definitions.html#Observation.subject):

* Identify the subject resourceType attribute within each Observation resource, which represents the entity (e.g., patient, device, location) to which the observation applies.
* Normalize the value[x] attribute in the Observation resource, extracting the value based on the data type specified in the value[x] field (e.g., valueQuantity, valueCodeableConcept, valueString).
* Temporal Data Extraction: 
  * Extract the effectiveDateTime attribute from the Observation resource, representing the date and time of the observation.
  * If no observation.effectiveDateTime, include the onsetAge or use effectiveDateTime attribute from the specimen or focus resource, representing temporal data - represented as an *age*.  If no temporal data is available, the field should be null.
  * In any case, ensure that a new dataframe row is created for each corresponding temporal data.
* Retrieve from working or create the dataframe row 
    * If we need to create the row:
      * retrieve the subject, specimen and focus entities referenced in the Observation resource, expanding the DataFrame to include their scalar attributes and coding, prefix each attribute with lower case resourceType eg `patient_*`,  `specimen_*`
      * create an identifier for the row based on the subject, specimen, focus, and temporal data (age), ensuring uniqueness.
* Add a new column to the data frame
* Verify new column exists or should be added to explorer_config
* After processing all Observations, write dataframe to ES, maintaining the ES index mapping
  * If the explorerConfig is updated:
    * upsert the document in ES.
    * signal the guppy & portal to read the new configuration

## Facet Management Category and Coding Extraction:


While creating the "dataframe" we also need to discover and maintain a hierarchy of facets.
* Before reading incoming Observations, read the new `explorer_config` ES index to retrieve current explorer configuration.
  * This index will contain a single document, that will contain the `explorerConfig` needed by the front end
* Specifically, maintain a "has changed" flag on that document.
* Changes to dataframer:
  * In the case of multiple code values, select the well known code (e.g., LOINC, SNOMED) for normalization.
  * Apply ["string to variable name conversion"](https://github.com/ACED-IDP/gen3_util/blob/development/gen3_tracker/meta/dataframer.py#L398) to normalize 'code' to column name.
  * Maintain a "gitops" explorer fragment:
    * see [gen3 documentation](https://github.com/uc-cdis/data-portal/blob/master/docs/multi_tab_explorer.md)
    * Apply inflection.titleize to normalize 'category' name

```text
explorerConfig:
  {{ for all unique subject.resourceType }}
  - tabTitle: <resourceType>  # e.g., Patient, Specimen, etc.
    charts: []  # manually add charts
    filters:
      {{for all categories}}  # e.g., 'Vital Signs', 'Laboratory', etc.
        tabs:
          - title: category
            fields: <codes from category>  # e.g., 'heart_rate', 'blood_pressure', etc.
                       <scalars and extensions>
    table:
        enabled: true
        detailsConfig: # manually add detailsConfig
        fields:
          # all flattened references
          # all codes
```  

Notes:  Out of scope or `static` elements in explorerConfig.   
* we expect that a the tabTitle corresponds 1:1 with a dataframer - as such we expect that the entry in explorer config will be initialized as the dataframer is developed.  In other words, if the tabTitle, the dataframer will initialize it with [detailsConfig, charts] etc.

### Example: from `Prostate_Microenvironments` 

<img width="1489" alt="image" src="https://github.com/ACED-IDP/submission/assets/47808/236d5982-f93b-417c-8f67-cb1c4e742e20">

### Summary view of dataframe e.g. patient-centric

<img width="1182" alt="image" src="https://github.com/ACED-IDP/submission/assets/47808/daf40c40-433c-4b7a-9532-716578fc5691">

 
### Changes to guppy config

```
# Guppy configuration
guppy:
  enabled: true
  dbRestore: false
  indices:
  - index: observation
    type: observation
  - index: file
    type: file
# added to support facet management
  - index: explorer_config
    type: explorerConfig

  configIndex: gen3.aced.io_array-config
```

### End to End
![image](https://github.com/user-attachments/assets/5d0f588b-585d-45bc-8bf2-eaabc6aa627c)

### Guppy PR link: https://github.com/uc-cdis/guppy/pull/273

### FF issue: https://github.com/ACED-IDP/gen3-frontend-framework/issues/12
### Existing work

```
 g3t meta dataframe --help
Usage: g3t meta dataframe [OPTIONS] [DIRECTORY_PATH] [OUTPUT_PATH]

  Render a metadata dataframe.

  directory_path: The directory path to the metadata.
  output_path: The output path for the dataframe. default [meta.csv]

Options:
  --dtale                         Open the graph in a browser using the dtale
                                  package for interactive data exploration.
  --data_type [Patient|Specimen|Observation|DocumentReference]
                                  Create a data frame for a specific data
                                  type.  [required]
  --debug
```

#### *E.G.*
`g3t meta dataframe --data_type Observation META/  --dtale`

<img width="1173" alt="image" src="https://github.com/user-attachments/assets/e90f5202-3ad6-4514-b04d-4bc9c1138749">


### Discussion points

* Independent of submission ("push") method
* Dependency on guppy PR  (`guppy/_restart`)
* Responsibility for `facet category` delegated to Observation.category/.code.  AKA - `etl driven facet hierarchy`
* Does guppy's graphql support fetching a nested document (explorerConfig)
* Does FEF support reading explorerConfig dynamically via graphql `packages/sampleCommons/config/aced/explorer.json`
  * If guppy/graphql does not support nested document, can the FEF read the document dynamically via a url (public bucket)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dataframer/Pivot with Facet Mangement #23

Use Cases:

Example

References

Goals/Forces/Motivation :

Workflow

Data Frame Creation:

Facet Management Category and Coding Extraction:

Example: from `Prostate_Microenvironments`

Summary view of dataframe e.g. patient-centric

Changes to guppy config

End to End

Guppy PR link: uc-cdis/guppy#273

FF issue: ACED-IDP/gen3-frontend-framework#12

Existing work

E.G.

Discussion points

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dataframer/Pivot with Facet Mangement #23

Description

Use Cases:

Example

References

Goals/Forces/Motivation :

Workflow

Data Frame Creation:

Facet Management Category and Coding Extraction:

Example: from Prostate_Microenvironments

Summary view of dataframe e.g. patient-centric

Changes to guppy config

End to End

Guppy PR link: uc-cdis/guppy#273

FF issue: ACED-IDP/gen3-frontend-framework#12

Existing work

E.G.

Discussion points

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Example: from `Prostate_Microenvironments`