Skip to content

Commit 518183a

Browse files
committed
Add blog post about OLAP book
1 parent db771f5 commit 518183a

File tree

2 files changed

+163
-0
lines changed

2 files changed

+163
-0
lines changed
+163
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
---
2+
layout: post
3+
title: Core Principles and Design Practices of OLAP Engines
4+
author: Yiteng Xu, Yingju Gao, Manfred Moser
5+
excerpt_separator: <!--more-->
6+
image: /assets/blog/core-principles-olap-book.jpg
7+
---
8+
9+
Yiteng Xu and Yingju Gao are proudly announcing the new book "Core Principle and
10+
Design Practices of OLAP Engines" from China Machine Press. This is great news
11+
for the Trino community, since the book is based on the open source project
12+
Trino, specifically Trino 350. It took more than four years for the two authors
13+
to finish writing. All concepts and details are explained with Trino falvor and
14+
generalized to all OLAP engines. Let us walk throught the chapters and you will
15+
find out the two author dive deep into the source code layer and bring you so
16+
many treasures.
17+
18+
<!--more-->
19+
20+
## Author introduction
21+
22+
[Yiteng (Ivan) Xu](https://github.com/medsmeds): is a data security engineer and
23+
is currently utilizing Trino, Spark, and Calcite for SQL analysis. His work
24+
encompasses various scenarios, including data warehouse metrics, SQL
25+
auto-rewriting, SQL purpose detection, and the development of SQL-based
26+
Purpose-Aware Access Control System.
27+
28+
[Yingju (Gary) Gao](https://github.com/garyelephant) is an Apache Seatunnel PMC
29+
member and the lead of the time series database team. He currently serves as the
30+
technical lead for the observability-engine team, and is responsible for
31+
building the ecosystem for observability data, including metrics, trace, log,
32+
and event data, providing a high-performance, high-throughput data pipeline from
33+
ingestion to consumption, storage, querying, and data warehousing. Additionally,
34+
he oversees metrics stability, multi-tenant access, and user requirement
35+
integration.
36+
37+
Both authors are passionate about sharing their technical knowledge. They have
38+
delved deep into source code and excel in technical writing, breaking down
39+
complex underlying principles into a linear and comprehensible format for
40+
readers. They firmly believe that sharing is a virtue and are committed to
41+
continuing their technical contributions.
42+
43+
So now it is time to get the book, or read on for a walk through of the content:
44+
45+
<div class="card-deck spacer-30">
46+
<a class="btn btn-pink" target="_blank"
47+
href="https://item.m.jd.com/product/10136949561522.html">
48+
Core Principles and Design Practices of OLAP Engines
49+
</a>
50+
</div>
51+
52+
## Walk through
53+
54+
Let's have a look at the different chapters in a high-level walk through.
55+
56+
### Part 1: Background knowledge
57+
58+
**Chapter 1**: Introduce the concept of OLAP (Online Analytical Processing),
59+
provide comparsion among different engines like Trino, Impala, Doris and others.
60+
61+
**Chapter 2**: Provides a comprehensive introduction to the Trino engine,
62+
covering its principles, architecture, enterprise use cases, compilation, and
63+
execution. It also compares Trino with the Presto project and introduces the
64+
SQL statements that are referenced throughout the book.
65+
66+
### Part 2: Core principles
67+
68+
**Chapter 3**: Offers an overview of the distributed SQL query process, serving
69+
as a high-level introduction to the subsequent chapters.
70+
71+
**Chapter 4**: Begins with the generation of query execution plans, including
72+
the transformation of SQL into abstract syntax trees, semantic analysis, and the
73+
creation of initial logical plans. It then delves into the theoretical knowledge
74+
of optimizers and the overall framework of the Trino optimizer.
75+
76+
### Part 3: Classic SQL
77+
78+
**Chapter 5**: Explains the generation and optimization of execution plans for
79+
SQL statements involving only `TableScan`, `Filter`, and `Project` operations,
80+
along with their scheduling and execution processes.
81+
82+
**Chapter 6**: Focuses on SQL statements with `Limit` and `Sort` operations,
83+
detailing the generation and optimization of execution plans, as well as their
84+
scheduling and execution.
85+
86+
**Chapter 7**: Introduces the basic principles of aggregate queries. It then
87+
covers the generation and optimization of execution plans for grouped and
88+
non-grouped aggregate SQL statements, along with their scheduling and execution
89+
processes.
90+
91+
**Chapter 8**: Discusses SQL statements with count distinct and multiple
92+
aggregate operations, explaining the generation and optimization of execution
93+
plans, as well as their scheduling and execution. This includes the
94+
`Scatter-Gather` model and `MarkDistinct` optimization. Finally, a complex SQL
95+
statement is used to tie together the concepts from Chapters 5 to 8.
96+
97+
### Part 4: Data exchange mechanism
98+
99+
**Chapter 9**: Introduces the overall concept of data exchange mechanisms and
100+
how data exchange is incorporated during the query optimization phase via the
101+
`AddExchanges` optimizer, along with the design principles for scheduling and
102+
execution.
103+
104+
**Chapter 10**: Explains how tasks establish connections during the query
105+
scheduling phase and the mechanisms for upstream and downstream data flow during
106+
execution. It also covers the principles of intra-task data exchange, RPC
107+
interaction mechanisms, and analyzes backpressure, Limit semantics, and
108+
out-of-order request handling.
109+
110+
### Part 5: Plugin mechanisms and connectors
111+
112+
**Chapter 11**: Begins with an introduction to Trino's plugin system and SPI
113+
mechanism, including plugin loading and JVM's class loading principles. It then
114+
dissects connectors, covering metadata modules, read modules, pushdown
115+
optimization, and providing in-depth insights into connector design.
116+
117+
**Chapter 12**: Uses the example-http connector to help readers understand
118+
connector design and implements a simple data source using Python's Flask
119+
framework.
120+
121+
### Part 6: Function principles and development
122+
123+
**Chapter 13**: Provides an overview of Trino's function system, including
124+
function types, lifecycle, and several function development methods. It delves
125+
into the data structures and annotations related to functions and explains the
126+
function registration and parsing process during semantic analysis.
127+
128+
**Chapter 14**: Focuses on how to write a udf in practice. It covers
129+
annotation-based development methods for scalar functions, as well as low-level
130+
development methods using `codeGen` or `methodHandle` APIs. For aggregate
131+
functions, it introduces annotation-based development methods and low-level
132+
methods where developers handle serialization and state on their own.
133+
134+
### Why Trino?
135+
136+
In 2020, one of the authors, Yiteng Xu, encountered a scenario at work where
137+
data needed to be read from two Hive instances, each modified by different
138+
internal teams. The company's infrastructure team attempted a simple solution by
139+
registering virtual tables and using MapReduce for federated queries. However,
140+
this approach proved inadequate for the agile analysis needs of data analysts,
141+
with complex queries taking nearly 12 hours to complete. One mistake per SQL
142+
meant an entire day was wasted.
143+
144+
Later, another team researched and adopted Presto (before Trino became
145+
independent). By adapting the Hive engine at the connector level, they enabled
146+
federated queries across the two Hive instances without data migration or
147+
extensive code changes. Users only needed to be aware of a catalog prefix,
148+
making the process incredibly convenient. The author later had the opportunity
149+
to participate in the project and developed a strong interest in its source
150+
code. The elegance of the open-source project, its plugin design, and the inner
151+
workings of connectors and Airlift framework sparked a deep curiosity, leading
152+
the author on a journey of source code exploration. As the PrestoSQL project was
153+
more active and receptive to developer feedback, the author chose to continue
154+
following the Trino project when it emerged in late 2020.
155+
156+
## Get your copy
157+
158+
<div class="card-deck spacer-30">
159+
<a class="btn btn-pink" target="_blank"
160+
href="https://item.m.jd.com/product/10136949561522.html">
161+
Get the book - Core Principles and Design Practices of OLAP Engines
162+
</a>
163+
</div>
57.1 KB
Loading

0 commit comments

Comments
 (0)