-
Notifications
You must be signed in to change notification settings - Fork 0
/
Snowflake ID.tex
484 lines (423 loc) · 15.3 KB
/
Snowflake ID.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
\documentclass{article}
\usepackage[english]{babel}
\usepackage[a4paper,top=2cm,bottom=2cm,left=3cm,right=3cm,marginparwidth=1.75cm
]{geometry}
\usepackage{amsmath}
\usepackage[utf8]{inputenc}
\usepackage{graphicx}
\usepackage[colorlinks=true, allcolors=blue]{hyperref}
\usepackage{xcolor}
\usepackage{titling}
\usepackage{float}
\usepackage{tikz}
\usepackage[RPvoltages]{circuitikz}
\usepackage{fancyvrb}
\usepackage{newverbs}
\addto\captionsenglish{
\renewcommand{\contentsname}%
{Table of Contents}%
}
\newcommand{\code}[1]{\colorbox{cverbbg}{\texttt{#1}}}
\newcommand{\hn}[0]{\hfill \\}
\definecolor{cverbbg}{gray}{0.93}
\newenvironment{lcverbatim}
{\SaveVerbatim{cverb}}
{\endSaveVerbatim{}
\flushleft\fboxrule=0pt\fboxsep=.5em
\colorbox{cverbbg}{%
\makebox[\dimexpr\linewidth-2\fboxsep][l]{\BUseVerbatim{cverb}}%
}
\endflushleft{}
}
\title{\textbf{Snowflake Id}}
\author{Finn Behrend}
\begin{document}
\begin{titlingpage}
\maketitle
\begin{center}
\vspace{50px}
Complete sheet and specification about Twitter's Snowflake Id
concept
\vfill
\footnotesize{Copyright \copyright{} 2023 --- Finn Behrend
(https://github.com/liquiddevelopmentnet)}
\end{center}
\end{titlingpage}
\tableofcontents
\pagebreak
\section{Introduction}
The `Snowflake' concept was originally invented by
\href{https://twitter.com}{Twitter} and is designed to generate short and
guaranteed unique Id\footnote{Short for \textbf{Identifier}, is a name
that
identifies (that is, labels the identity of) either a unique object or
a
unique
class of objects, where the `object' or class may be an idea, physical
countable object (or class thereof), or physical non-countable
substance
(or
class thereof).} numbers at high scale.
\subsection{Motivation}
Twitter invented the Snowflake concept because they were moving away from
\href{https://en.wikipedia.org/wiki/MySQL}{MySQL}\footnote{MySQL is an
open-source relational database management system. Its name is a
combination of
`My', the name of co-founder Michael Widenius's daughter My, and `SQL',
the
acronym for Structured Query Language.} and moved to
\href{https://en.wikipedia.org/wiki/Cassandra}{Cassandra}\footnote{Apache
Cassandra is
a free and open-source, distributed, wide-column store, NoSQL database
management system designed to handle large amounts of data across many
commodity servers, providing high availability with no single point of
failure.} because of scaling
reasons, so they needed a new way to generate unique Id's based on the fact
that Cassandra does and should not provide a sequential Id generation facility.
\section{Capacity}
A Snowflake Infrastructure is designed to withstand the following specifications with ease:
\begin{itemize}
\item 10k Id generations per second per process
\item Response rate of 2ms
\end{itemize}
In large data centers, machines generating Id's should not have to coordinate
with each other. That's what makes Snowflakes Concept so brilliant as computers
do not have to communicate with each other during or after the Id creation
process.
\section{Scheme}
Example of a Snowflake Id from
\href{https://discord.com/}{Discord}\footnote{Discord is a VoIP and instant
messaging social platform. Users have the ability to communicate with
voice
calls, video calls, text messaging, media and files in private chats or
as
part
of communities called `servers'.} we'll use throughout the sheet:
\code{555780371086704681} \\
\hn{}
A Snowflake Id consists of a 64-bit number (e.g.\ a \code{uint64}), depending
on
implementation they mostly are handled as strings to prevent integer overflows
in some languages. \\
\hn{}
Being a 64-bit integer makes the Snowflake Id's practical maximum
$2^{64} - 1$ (or $($FFFFFFFFFFFFFFFF$)_{16}$),
(see \nameref{snowflake_id_constraints}).
\section{Binary Breakdown}
A Snowflake Id is constructed by chaining binary data and then converting to
decimal. \\
\begin{figure}[H]
\large{\texttt{\color{cyan}000001111011011010000111010011011000011111
\color{red}00001 \color{green}00000
\color{gray}000000101001}}
\\
\caption{Binary Breakdown of the example
Id}\label{fig:breakdown_example}
\end{figure}
\begin{table}[h]
\begin{tabular}{l|c|l}
Field & Bits & Description
\\ \hline
Timestamp & \color{cyan}42 & Millisenconds since the
standard UNIX Epoch
\\ \hline
Internal worker Id & \color{red}5 & Usually incremented Id
of the
worker
\\ \hline
Internal process Id & \color{green}5 & Usually incremented Id
of the
process in
the worker
\\ \hline
Increment / Sequence & \color{gray}12 & For every Id generated
on
process, this number is incremented
\end{tabular}
\end{table}
\pagebreak
\section{Retrieving data from a Snowflake Id}
\subsection{Timestamp Field}
The timestamp field is the first field and has \code{42} bits in total. It
contains the UNIX Timestamp\footnote{Unix time is a date and time
representation widely used in computing. It measures time by the number
of
seconds or milliseconds that have elapsed since \texttt{00:00:00 UTC on
1
January 1970}, the beginning of the Unix epoch, less
adjustments made
due to
leap seconds.} when the Snowflake Id was generated. Implementations
like to
use
custom epochs to lower the timestamp at the beginning of use, and postpone the
`Year-2038' problem. Usually the first day of January in a Year is taken, (see \nameref{year_2038_problem}).
\begin{figure}[H]
\begin{lcverbatim}
long snowflake = 555780371086704681;
long customEpoch = 1420070400000;
// As corresponding example for the Snowflake Id,
// Discord's custom Epoch is used
long timestamp = (snowflake >> 22) + customEpoch;
\end{lcverbatim}
\caption{Example of timestamp retrieval in C with custom
epoch}\label{fig:r_timestamp_1}
\end{figure}
\begin{figure}[H]
\begin{lcverbatim}
long snowflake = 555780371086704681;
long timestamp = (snowflake >> 22);
\end{lcverbatim}
\caption{Example of timestamp retrieval in C without custom
epoch}\label{fig:r_timestamp_2}
\end{figure}
\subsection{Internal worker-/process Id Fields}
The `Internal worker Id' and `Internal process Id' fields are the second and
third fields and both hold 5 bits of data. \\
\hn{}
These fields hold simply what their names tell us, the former field holds the
unique Id of the worker machine that created the Snowflake, and the second
field holds the unique process Id living in that worker machine, both usually
incremented. \\
\begin{figure}[H]
\begin{lcverbatim}
long snowflake = 555780371086704681;
int workerId = (snowflake & 0x3E0000) >> 17;
\end{lcverbatim}
\caption{Example of worker Id retrieval in C}\label{fig:r_worker}
\end{figure}
\begin{figure}[H]
\begin{lcverbatim}
long snowflake = 555780371086704681;
int processId = (snowflake & 0x1F000) >> 12;
\end{lcverbatim}
\caption{Example of process Id retrieval in C}\label{fig:r_process}
\end{figure}
\pagebreak
\subsection{Increment Field}
The increment field, with consisting of 12 bits is the last chained data in the
Snowflake Id.\@ It is incremented for every Id that is generated on that
process.
\begin{figure}[H]
\begin{lcverbatim}
long snowflake = 555780371086704681;
int increment = snowflake & 0xFFF;
\end{lcverbatim}
\caption{Example of increment retrieval in C}\label{fig:r_inc}
\end{figure}
\subsection{Complete Implementation Example}
This is what a proper code retrieving all fields from a Snowflake Id may look
like: \\
\begin{figure}[H]
\begin{lcverbatim}
long snowflake = 555780371086704681;
long customEpoch = 1420070400000;
long timestamp = (snowflake >> 22) + customEpoch;
int workerId = (snowflake & 0x3E0000) >> 17;
int processId = (snowflake & 0x1F000) >> 12;
int increment = snowflake & 0xFFF;
\end{lcverbatim}
\caption{Example of proper retrieval implementation with custom epoch
in
C}\label{fig:r_complete_1}
\end{figure}
\begin{figure}[H]
\begin{lcverbatim}
long snowflake = 555780371086704681;
long timestamp = (snowflake >> 22);
int workerId = (snowflake & 0x3E0000) >> 17;
int processId = (snowflake & 0x1F000) >> 12;
int increment = snowflake & 0xFFF;
\end{lcverbatim}
\caption{Example of proper retrieval implementation without custom
epoch in
C}\label{fig:r_complete2}
\end{figure}
\pagebreak
\section{Generating Snowflake Id's}
\begin{figure}[H]
\centering
\resizebox{1\textwidth}{!}{%
\begin{circuitikz}
\tikzstyle{every node}=[font=\footnotesize]
\draw [ -Stealth] (9.75,10.75) -- (12.25,10.75);
\draw (8.5,10.75) circle (1cm) node {\Large User} ;
\draw (17.5,12.5) rectangle node {\normalsize Machine
1}
(20.75,11.5);
\draw (17.5,11.25) rectangle node {\normalsize
Machine 2}
(20.75,10.25);
\draw (17.5,10) rectangle node {\normalsize
Machine 3} (20.75,9);
\draw [, dashed] (12.5,10.25) rectangle node
{\normalsize Relay}
(14.75,11.25);
\draw (16.75,10.75) to[short, -*] (16.75,10.75);
\draw [](15,10.75) to[short] (16.75,10.75);
\draw [](16.75,10.75) to[short] (16.75,12);
\draw [](16.75,10.75) to[short] (16.75,9.5);
\draw [ -Stealth] (16.75,12) -- (17.25,12);
\draw [ -Stealth] (16.75,10.75) -- (17.25,10.75);
\draw [ -Stealth] (16.75,9.5) -- (17.25,9.5);
\end{circuitikz}
}%
\caption{Example Infrastructure of a Id generating
network}\label{fig:ex_infrastructure}
\end{figure}
Assuming we have a infrastructure as shown above, and a user is requesting a
registration that gets sent to a relay, which then relays the request of one of
our 3 machines (which generates a Id for that user), and let our Id generating
function look like this:
\begin{figure}[H]
\begin{lcverbatim}
static long increment_mask = -1 ^ (-1 << 12);
unsigned long snowflake(int workerId, int processId)
{
long timestamp = unix_time();
static int increment = -1;
increment++;
increment = increment & increment_mask;
// Mask to 12 bits to prevent overflow
return (timestamp << 22)
| (workerId << 17)
| (processId << 12)
| increment;
}
\end{lcverbatim}
\caption{Basic Example of a snowflake generating function in
C}\label{fig:ex_gen_function}
\end{figure}
\hn{}
As you can see, we have two arguments that can be passed into that function, a
\code{workerId} variable and a \code{processId} variable. Every worker machine
should have a different worker Id, while every process or thread of that worker
should have a different process Id (both usually incremented).\\
\hn{}
First, we are getting the current UNIX Timestamp. It is highly recommended to
use NTP\footnote{The Network Time Protocol (short \textbf{NTP}) is a networking
protocol for clock synchronization between computer systems over
packet-switched, variable-latency data networks.} to keep your system
clock
accurate. \\
\hn{}
Now we are defining a static integer to make a consistent incremental for each
call of the function, and increase it by 1 after initialisation. \\
\hn{}
At last, we are shifting the data into the right positions and return the final
Snowflake Id.\@
\pagebreak
\subsection{Practical Test}
I've let the first machine generate three snowflakes, two immediately after
each other and one delayed 2 seconds after the first two Id's.
\begin{figure}[H]
\begin{lcverbatim}
7046923580470329344
7046923580474523649
7046923588863131650
\end{lcverbatim}
\caption{The three Snowflake Id's that were
generated}\label{fig:output_1}
\end{figure}
\hn{}
Let's break them down. \\
\begin{figure}[H]
\large{\texttt{\color{cyan}11000011100101110110011101010010000010101
\color{red}00001 \color{green}00000
\color{gray}000000000000}}
\\
\caption{Binary Breakdown of the first Id}\label{fig:breakdown_1}
\end{figure}
\begin{figure}[H]
\large{\texttt{\color{cyan}11000011100101110110011101010010000010110
\color{red}00001 \color{green}00000
\color{gray}000000000001}}
\\
\caption{Binary Breakdown of the second Id}\label{fig:breakdown_2}
\end{figure}
\begin{figure}[H]
\large{\texttt{\color{cyan}11000011100101110110011101010101111100110
\color{red}00001 \color{green}00000
\color{gray}000000000010}}
\\
\caption{Binary Breakdown of the third Id}\label{fig:breakdown_3}
\end{figure}
\hn{}
As you can see, all Id's were generated on worker one and process zero (red and
green fields). \\
\hn{}
Also, the increment field properly incremented from one, to two, to three.
\\
\hn{}
Now, let's look at the timestamps (non-custom epoch):
\begin{figure}[H]
\begin{lcverbatim}
1680117507093
1680117507094
1680117509094
\end{lcverbatim}
\caption{The timestamps of the three Snowflake Id's that were
generated}\label{fig:breakdown_timestamps}
\end{figure}
As you can see, on the first two generations the timestamp only changed by
\code{1 ms} which is pure machine latency.
But looking at the difference from the second and third timestamp we see a
\code{2000 ms} difference, which is exactly what we wanted! \\
\hn{}
For a custom epoch, just subtract your epoch from the UNIX epoch and then
compile to a Snowflake.
\break{}
\section{The 2038 Problem}
\label{year_2038_problem}
The 2038 problem, also known as the Unix Y2K bug, refers to a limitation of the
Unix timestamp, which is used to represent time as a 32-bit signed integer, and
is set to overflow on \code{January 19, 2038}. This means that any software
relying on the Unix timestamp to represent dates beyond that point will
encounter errors or fail completely. \\
\hn{}
As we already know, a Snowflake is a 64-bit integer and a part of it consists
of a constantly growing UNIX Timestamp. Since the Snowflake is in fact able to
hold a 42-bits instead of just 32, the problem gets put off to \code{May 15,
2109} but, is still present. \\
\hn{}
One approach, is to use custom epochs which postpone the problem for a time. A
good choice for a custom epoch is a time before you started utilising Snowflake
Id's, since Snowflake Id's usually never require to have a timestamp that goes
back before the implementation time. \\
Of course, this is not the case if you already have and plan to migrate from an
old Id System, and want to convert Id's. \\
\hn{}
After all, those approaches only delay the problem, so its always good to have
it back in mind.
\section{Constraints}
\subsection{Timestamp Field}
The timestamp is a 42-bit integer, which means that it can hold a value from
\code{0} to \code{4398046511103} (or $2^{42} - 1$).
\subsection{Worker Id and Process Id Field}
The worker Id and process Id are both 5-bit integers, which means that they
can hold a value from \code{0} to \code{31}. \\
\hn{}
The worker Id is used to identify the machine that generated the Id, while the
process Id is used to identify the process or thread that generated the Id.\@
\subsection{Increment Field}
The increment is a 12-bit integer, which means that it can hold a value from
\code{0} to \code{4095}.
\subsection{Snowflake Id}
\label{snowflake_id_constraints}
The Snowflake Id is a 64-bit integer, which means that it can hold a value from
\code{0} to \code{18446744073709551615} (or $2^{64} - 1$).
\section{Sources}
\begin{small}
\begin{itemize}
\item
\url{https://github.com/twitter-archive/snowflake/tree/snowflake-2010#readme}
\item \url{https://en.wikipedia.org/wiki/Identifier}
\item \url{https://en.wikipedia.org/wiki/MySQL}
\item \url{https://en.wikipedia.org/wiki/Apache_Cassandra}
\item \url{https://en.wikipedia.org/wiki/Discord}
\item \url{https://en.wikipedia.org/wiki/Network_Time_Protocol}
\item \url{https://en.wikipedia.org/wiki/Unix_time}
\item \url{https://en.wikipedia.org/wiki/2038_problem}
\end{itemize}
\end{small}
\end{document}