Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 32 additions & 1 deletion docs/source/concepts/pipeline_schema/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,17 @@ to show the breadth of possible specifications:
Each subdirectory must contain two files, where each is
in a tabular format."

.. _input_data_list_quirk:

.. warning::

There is currently **one** exception to the "files or directories"
description.
Input data that come directly from the user (through an ``input_data.yaml``)
are represented and passed to implementations as *lists* of file paths.
We plan to change this in the future to make it consistent with other data
specifications.

Data specifications are enforced by EasyLink;
a pipeline will fail if any data do not follow their specification.

Expand Down Expand Up @@ -497,6 +508,8 @@ EasyLink pipeline schema

.. image:: images/easylink_pipeline_schema.drawio.png

.. _input_datasets:

Input datasets
^^^^^^^^^^^^^^

Expand All @@ -505,11 +518,16 @@ A set of named datasets.
Each dataset contains observations recorded about (some) entities in the population of interest for analysis.

**Specification:**
A directory of files, where each file is in a tabular format.
A list of files, where each file is in a tabular format.
Each file's name identifies the name of that input dataset.
Each file may have any number of columns,
but one of them must be called “Record ID” and it must have unique values.

.. note::

This is a **list** of files, not a directory.
See :ref:`this note above <input_data_list_quirk>` for context.

**Example:**

.. list-table::
Expand Down Expand Up @@ -757,6 +775,19 @@ Pandas code dropping records with matching record IDs.
Note that if the default implementation is used,
input and output data specifications do not need to be checked.

Datasets
^^^^^^^^

**Interpretation:**
See :ref:`input datasets <input_datasets>`.

**Specification:**

Exactly the same as :ref:`input datasets <input_datasets>`, but is
a *directory* of files rather than a *list* of files.
This is a result of the current quirk that
:ref:`input datasets have a different kind of specification than other data dependencies <input_data_list_quirk>`.

New clusters
^^^^^^^^^^^^

Expand Down