You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: book/00-introduction/00-purpose.md
+14-9Lines changed: 14 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,15 +2,20 @@
2
2
title: Purpose
3
3
---
4
4
5
-
#
6
-
This book is a comprehensive guide to *DataJoint for Python* — a framework for building reliable, efficient, and scalable scientific data workflows, a practice we call **Scientific Operations (SciOps)**.
5
+
## This Book
6
+
This book is a comprehensive guide to *DataJoint for Python*, a framework for building reliable and scalable scientific data workflows.
7
+
It is designed for collaborative, data-intensive research, where the complexity of data and computations requires a principled approach.
8
+
9
+
DataJoint was developed to overcome the limitations of managing research with scripts, spreadsheets, and complex folder structures—an approach that is often slow, error-prone, and difficult to scale.
10
+
The value of a more rigorous framework was demonstrated in ambitious projects like **MICrONS (Machine Intelligence from Cortical Networks)**[@10.1038/s41586-025-08790-w]. A nine-year effort to map a piece of the brain, MICrONS generated a deluge of data from electron microscopy, neurophysiology, and animal behavior. A project of this scale and complexity would have been intractable with traditional methods. DataJoint, already a mature framework by the start of the project, proved essential for managing the data pipeline and enabling a large team to collaborate effectively.
11
+
7
12
8
13
Most research begins with ad-hoc processes, managing data with scripts, spreadsheets, and complex folder structures. This approach often proves slow, error-prone, and difficult to scale, especially in collaborative projects. This challenge became starkly apparent during my work on **MICrONS (Machine Intelligence from Cortical Networks)**[@10.1038/s41586-025-08790-w], a nine-year effort to map a piece of the brain that generated a deluge of data from electron microscopy, neurophysiology, and animal behavior.
9
14
10
15
The traditional methods simply collapsed under this complexity. This experience led directly to the creation of **DataJoint**, a tool designed to bring the rigor of relational databases to the dynamic and evolving world of scientific research.
11
16
12
-
```{note}
13
-
DataJoint builds upon the relational database model but introduces a crucial innovation: it treats computational dependencies as a first-class feature. This allows you to define, execute, and reproduce entire data processing pipelines with precision and efficiency.
17
+
```{admonition} Key Innovation
18
+
DataJoint builds upon the relational database model but introduces a crucial innovation: it treats computational dependencies as a first-class feature. This allows you to define, execute, and reproduce entire data processing pipelines with precision and efficiency. @10.48550/arXiv.1807.11104
14
19
```
15
20
16
21
By integrating data storage, processing, and analysis into a unified system, DataJoint empowers you to describe not just the structure of your data, but the full sequence of computations that transform raw inputs into meaningful results. This formalized approach eliminates the need for ad-hoc scripting and manual data wrangling, ensuring every step is transparent, reproducible, and easy to manage.
@@ -20,28 +25,28 @@ Throughout this book, our goal is to learn how to implement rigorous Scientific
20
25
21
26
Recognizing the need for more structured approaches in research, we recently partnered with other neuroinformatics leaders to define a roadmap for enhancing operations in neuroscience projects. This roadmap is designed to guide research teams from ad-hoc processes toward automated and scalable collaborations, enabling them to tackle more significant problems. The ultimate goal is to achieve closed-loop studies that seamlessly integrate human ingenuity with AI efficiency [@10.48550/arXiv.2401.00077].
22
27
23
-
###A Principled Approach to Scientific Data
28
+
## A Principled Approach to Scientific Data
24
29
25
30
Programming is often seen as a way to communicate with machines, but it is more importantly the art of thinking clearly and communicating precisely with other people. The primary goal is to write code that humans can easily read, understand, and extend, especially in dynamic, collaborative projects.
26
31
27
32
In database design, this clarity is paramount. The structure of the data and its integrity constraints must reflect the logic of the problem you are solving. DataJoint is designed with simplicity and clarity in mind, helping teams manage shared data workflows where analysis pipelines and project priorities rapidly evolve. It acts as a foundational building block for transforming research labs into efficient data generation machines, helping guide teams from ad-hoc processes toward automated and scalable collaborations. This book provides the foundational database skills to build that ladder, moving your research from fragile scripts to a robust, queryable, and collaborative scientific enterprise.
28
33
29
-
###Who This Book Is For
34
+
## Who This Book Is For
30
35
31
36
This book provides an accessible introduction to relational database programming for data science and research applications, such as neuroscience and machine learning. It is designed to help scientists and engineers build a solid understanding from scratch. While proficiency in Python is assumed, no prior experience with databases is required.
32
37
33
-
###Learning DataJoint and SQL
38
+
## Learning DataJoint and SQL
34
39
35
40
**SQL (Structured Query Language)** is the standard language for managing relational databases. DataJoint is built on the same relational theory but uses a modern Python-based syntax. DataJoint statements are automatically converted into SQL, combining the power of relational databases with the convenience of Python.
36
41
37
42
While you can become proficient in relational concepts using DataJoint without ever writing SQL directly, this book aims to be a comprehensive introduction to databases. Therefore, we will teach the equivalent SQL concepts and syntax alongside DataJoint, with executable examples for both. You will not only learn DataJoint but also gain a solid foundation in SQL programming.
38
43
39
-
###The Role of AI and Neuroscience
44
+
## The Role of AI and Neuroscience
40
45
41
46
As of 2025, AI assistance has become a transformative force in programming. This book will explore the impact of AI on database schema design, computation, and queries, as these core elements are poised for significant evolution in AI-infused environments.
42
47
43
48
Furthermore, DataJoint has its roots in systems neuroscience, and many examples in this book are drawn from that field. However, these examples are chosen to illustrate broader principles and techniques that can be adapted to any computationally intensive discipline.
44
49
45
-
###Contributions
50
+
## Contributions
46
51
47
52
I, Dimitri Yatsenko, am the principal author and editor of this book, which incorporates text from prior documentation written by our broader team. I welcome your contributions, whether as a reviewer or a contributor. All contributions will be gratefully acknowledged. Please feel free to contact me directly or submit an issue in the book's GitHub repository.
## Academic Origins: From Lab Tool to Open-Source Project
15
13
16
-
# Early Work at BCM
17
-
In the summer of 2008, I joined Dr. Andreas Tolias's new lab at Baylor College of Medicine's Department of Neuroscience.
18
-
The lab was focused on complex neurophysiology experiments.
19
-
A group of students and postdocs, including Alex Ecker, Philipp Berens, Andreas Hoenselaar, and R. James Cotton, had already started a MATLAB-based library called "Steinbruch," which used MySQL to link data through computational dependencies.
14
+
The story of DataJoint begins in 2008 in Dr. Andreas Tolias's neuroscience lab at Baylor College of Medicine. The lab was tackling complex neurophysiology experiments, and a group including **Alex Ecker**, **Philipp Berens**, **Andreas Hoenselaar**, and **R. James Cotton** had already created a MATLAB library called "Steinbruch" that used MySQL to manage data.
20
15
21
-
In the fall of 2009, I started thinking about how relational database principles could be adapted for scientific data analysis.
22
-
I wanted to create a database system that could naturally reflect the complexities of a scientific study and be easy for a research team to use.
23
-
This led me to develop the first version of DataJoint for MATLAB, which I used in my neurophysiology experiments.
16
+
Building on these ideas, I began to formalize a new approach in the fall of 2009. My goal was to design a system based on rigorous relational principles that could naturally represent a scientific workflow, with native support for computational dependencies—a feature missing from mainstream database models. This led to the first version of DataJoint for MATLAB, which I developed and applied to my own neurophysiology experiments.
24
17
25
-
My goal was to design a database system based on strong principles, with a focus on data integrity and reliable transaction processing, with native support for computational dependencies and orchestration.
26
-
I noticed that mainstream database models lacked computational dependencies, so I created DataJoint to fill that gap.
18
+
By 2011, with the help of early adopters **Manolis Froudarakis** and **Jacob Reimer**, DataJoint was fully integrated into our lab's workflow. Recognizing its potential, **Dr. Tolias** supported its broader release, and I launched DataJoint as an open-source project on Google Code. By the time I defended my Ph.D. in 2014, the framework was already in use at research institutions worldwide.
27
19
28
-
By 2011, DataJoint was fully integrated into our lab's workflow, thanks to the early adopters Manolis Froudarakis and Jacob Reimer.
29
-
Dr. Tolias recognized its potential and supported its use, which led me to release DataJoint as an open-source project on [Google Code](https://code.google.com/archive/p/datajoint/).
20
+
Although I had started a Python version in 2011, it gained serious momentum between 2014 and 2015 when lab members **Edgar Y. Walker** and **Fabian Sinz** joined the effort to create a full Python package. The Tolias lab's participation in the IARPA **MICrONS project**[^1] further validated DataJoint's approach. Its ability to manage the work of a large, multidisciplinary team proved essential and significantly boosted its adoption in the scientific community.
30
21
31
-
# Gaining Recognition
32
-
By 2014, the year I defended my Ph.D. thesis, DataJoint had already spread beyond the lab and was used in research institutions worldwide.
22
+
[^1]: MICrONS Consortium et al., *Functional connectomics spanning multiple dimensions of mouse visual cortex*, Nature **(2025)**, [https://www.nature.com/immersive/d42859-025-00001-w/index.html](https://www.nature.com/immersive/d42859-025-00001-w/index.html)
33
23
34
-
Although I had started working on a Python version of DataJoint in 2011, significant progress was made when two other lab members, Edgar Y. Walker and Fabian Sinz, worked with me from 2014 to 2015 to create a full Python package.
24
+
***
35
25
36
-
My work at BCM coincided with the Tolias lab's participation in the [IARPA MICrONS project](https://www.iarpa.gov/research-programs/microns), which aimed to devise new forms of machine intelligence by learning the structure and function of biological neural networks. DataJoint’s ability to manage a large, multidisciplinary team made it an essential tool, further boosting its adoption.
26
+
## Commercialization: From Consulting to a Collaborative Platform
37
27
38
-
# Consulting Business
39
-
In 2016, four members of the Tolias lab—Dimitri Yatsenko, Jacob Reimer, Edgar Y. Walker, and Andreas Tolias—started a company, Vathes LLC, to provide consluting services around DataJoint, providing data engineering services to research labs.
40
-
This was in response to DARPA-funded effort to commercialize neuroscience data tools.
41
-
In 2017, Vathes received a Phase I SBIR grant to explore the commercial potential of DataJoint. Edgar and Dimitri split their time between managing the company operations and their ongoing academic work.
28
+
As DataJoint's user base grew, so did the need for professional support. In 2016, **Dimitri Yatsenko**, **Jacob Reimer**, **Edgar Y. Walker**, and **Andreas Tolias** founded **Vathes LLC** to provide data engineering and consulting services to research labs, spurred by a DARPA initiative to commercialize neuroscience tools.
42
29
43
-
By 2018, Vathes had added key members: Shan Shen, Thinh Nguyen, Chris Turner, and Raphael Guzman, who played crucial roles in developing DataJoint further and integrating it into the workflows of large labs that became our customers. Our growth, collaborations, and new team members helped shape our approach to data-driven projects. Collaborations with Prof. Carlos Brody and Prof. Karel Svoboda also significantly increased DataJoint's use.
30
+
In 2017, Vathes received a Phase I SBIR grant, allowing Edgar and me to balance company operations with our academic work. The team expanded in 2018 with key members**Shan Shen**, **Thinh Nguyen**, **Chris Turner**, and **Raphael Guzman**, who were crucial in developing the framework and integrating it into large-scale lab workflows. Collaborations with **Prof. Carlos Brody** and **Prof. Karel Svoboda** also significantly increased its use.
44
31
45
-
2020 was a pivotal year for us. With a significant 5-year NIH grant, we began developing [DataJoint Elements](https://datajoint.com/docs/elements)—a set of reference implementations for DataJoint pipelines in neurophysiology studies. This period also brought key leadership changes, including the addition of Dr. Kabilar Gunalan, who played a vital role in advancing the DataJoint Elements initiative.
32
+
The year 2020 was pivotal. A major 5-year NIH grant enabled the development of **DataJoint Elements**—a collection of curated data pipelines for neuroscience. This initiative was advanced by **Dr. Kabilar Gunalan**, who joined the team and played a vital leadership role. In 2021, we rebranded the company as **DataJoint** to reflect our core product and shifted our focus to commercial technology for research collaboration, at which point I transitioned to a full-time role as CEO.
46
33
47
-
In 2021, we rebranded the company as **DataJoint** to better align with our core product, shifting our focus toward commercial technology for research collaboration. During this time, I transitioned to a full-time role as CEO.
34
+
In 2022, a Phase II SBIR commercialization grant from the NIH funded the development of our online collaborative platform: the **DataJoint Platform**. **Monty Kosma**, who joined that year, spearheaded the platform's development, eventually becoming a co-founder and President and guiding the company's transformation into a product-focused enterprise.
48
35
49
-
# The DataJoint Platform
36
+
The **DataJoint Platform** officially launched in 2024, empowering its first cohort of labs. In the fall of that year, the company entered a new phase of growth with the addition of **Jim Olson**, former CEO of the data operations platform Flywheel.io. Jim's appointment as CEO in December 2024 brought fresh strategic vision to scale the company's impact.
50
37
51
-
In 2022, we were awarded a Phase II SBIR commercialization grant from the NIH to develop an innovative online collaborative platform: DataJoint Works. This marked a significant milestone in our journey, enabling us to expand our capabilities beyond traditional consulting services. Monty Kosma, who joined the company that year, spearheaded the development of the platform, bringing his vision and leadership to the forefront. Monty's contributions were instrumental, and he later became a co-founder and President, guiding DataJoint's transformation from a consulting firm into a product-focused enterprise.
38
+
In August 2025, DataJoint Inc. raised its seed round, bringing in venture capital and bolstering its vision to accelerate the platform's development.
52
39
53
-
The **DataJoint Works** platform officially launched in 2024, empowering the first cohort of labs to seamlessly operate their experiments using its robust and intuitive features. This milestone demonstrated the potential of DataJoint Works to revolutionize how scientific data is managed and shared, setting a strong foundation for its adoption across the research community.
54
-
55
-
In the fall of 2024, the company entered a new phase of growth with the addition of Jim Olson, a seasoned executive with a proven track record. Jim had previously served as CEO of Flywheel.io, a leading data operations platform, where he established himself as a visionary leader in the field. Joining DataJoint as CEO in December 2024, Jim brought fresh perspectives and strategic insights, positioning the company to scale its impact and reach new heights in the years ahead.
56
-
57
-
Today, DataJoint exemplifies a harmonious blend of community-driven open-source development and a powerful online platform for hosting and managing DataJoint pipelines. This dual approach ensures that our tools remain accessible, reliable, and continuously evolving, while providing researchers with a secure, collaborative environment to advance their work.
40
+
Today, DataJoint embodies a unique blend of community-driven open-source development and a powerful online platform. This dual approach ensures our tools remain accessible and continuously evolving while providing researchers with a secure, collaborative environment to advance their work.
0 commit comments