Skip to content

Commit 9b7bcca

Browse files
committed
Starting to split off the state of the data stream into its own class
hierarchy.
1 parent c9d643b commit 9b7bcca

File tree

8 files changed

+986
-686
lines changed

8 files changed

+986
-686
lines changed

docs/data_ingestion.rst

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,3 +37,74 @@ Two of the new format data readers are the ``python``, ``SMILES``, and
3737
Several of these readers (SMILES and
3838
:ref:`HDF5<sec:hdf5_data_reader>`) support the use of :ref:`sample
3939
lists<sec:sample-lists>`.
40+
41+
"Really New" Data Subsystem
42+
---------------------------
43+
44+
During execution LBANN will ingest one or more streams of data. There
45+
will be unique streams of data for each execution mode:
46+
- training
47+
- validation
48+
- tournament
49+
- testing
50+
- inference
51+
52+
Note that execution modes should become more flexible and should be
53+
able to be arbitrarily named.
54+
55+
The data stream object is responsible for keeping track of the "count"
56+
/ state of that data stream for that execution context. For bounded /
57+
batched data streams, this would be the current position within the
58+
stream and the total number of passes over the stream. (index and
59+
epoch)
60+
61+
For infinite streams the object will just maintain the index /
62+
position within the stream.
63+
64+
In both cases it is necessary for the object to track the "step" size
65+
(i.e. mini-batch size). Additionally, because the data stream will be
66+
accessed in parallel, it is necessary to track the position of each
67+
rank within the stream in terms of offset.
68+
69+
..
70+
Data source class file: The data source class tracks the statefule
71+
aspects of one logical stream of data.
72+
Data sources are either bounded or infinite
73+
data sources. The class is responsible for keeping track of state
74+
with respect to
75+
76+
77+
Sample list:
78+
79+
Track how to retrive a data set from the outside world. This
80+
typically is a set of file locations for each sample as well as a
81+
count of how many samples are in the set.
82+
83+
Data coordinator:
84+
85+
Responsible for managing one or more data streams for each execution
86+
context. It is
87+
88+
89+
data reader / loader:
90+
91+
Function to ingest bits from outside and place them into an in-memory
92+
object that is managed by the data coordinator.
93+
94+
Data store:
95+
in-memory data repository for holding samples that have been read in
96+
97+
io_data_buffer:
98+
Holds sample being fetched or the future of it.
99+
100+
data packer:
101+
copies data fields from conduit nodes and maps them to Hydrogen
102+
matrices. Specific to a data set
103+
104+
Data Set:
105+
106+
Composed of:
107+
- data reader
108+
- data stream
109+
- sample list
110+
- data packer

0 commit comments

Comments
 (0)