data_engineering_cheatsheet/u.2.5.3 - cassandra at master · izner32/data_engineering_cheatsheet · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
// LESSON 1.1 - INTRO TO CASSANDRA
what is cassandra
    - highly scalable, high performacne, distributed nosql database
    - designed to handle huge amount of data across many commodity servers, providing hgih availability without a single point of failure
    - column oriented nosql database
    - fault tolerant meaning suppose there are 4 nodes in a cluster, here each node has a copy of same data, if one node is no longer serving then other three nodes can served as per request
    - supports acid property (atomicity, consistency, isolation and durability)
cassandra architecture: cassandra was designed to handle big data workload across multiple nodes without a single point of failure
    - in Cassandra, each node is independent and at the same time interconnected to other nodes. All the nodes in a cluster play the same role
    - every node in a cluster can accept, read, and write requests, regardless of where the data is actually located in the cluster
    - in the case of failure of one node, read/write requests can be served from other nodes in the network
components of cassandra
    node: place where data is stored
    data center: collection of related nodes
    cluster: component which contains one or more data centers
    commit log: crash-recovery mechanism, every write operation is written to the comit log
    mem-table: memory resident data structure. after commit log, the data will be written to the mem-table. Sometimes, for a single-column fmaily there wil be multiple mem-tables
    sstable: disk file to which the data is flushed from the mem-table when its contents reach a threshold value
    bloom filter: quick, nondeterministic, algorithms for testing whether an element is a member of a set. special kind of cache. bloom filters are accessed after every query
cql (cassandra query language)
    - used to access cassandra through its nodes
    - client can approach any of the nodes for their read-write operations, that node (coordinator) plays a proxy between the client and the nodes holding the data

    write operations
        write data -> commit log(inside memory disk) -> memtable(outside memory disk) -> flush to sstable(inside memory disk)
    read operations
        - cassandra gets values from the mem-table and checks the bloom filter to find the appropriate sstable which contains the required data
unfinished - data model

// LESSON 1.2 - CASSANDRA KEYSPACE: equivalent to database in rdbms
create keyspace
    CREATE KEYSPACE keyspace_name
    WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 3}; // WITH replicaton={'class':strategy name,'replication_factor': no_of_replications_on_different_nodes}
    [AND DURABLE_WRITES = false;] // default is set to true, just don't touch this at all
    strategies
        SimpleStrategy: used in the case of one data center
        NetworkTopologyStrategy: used in the case of more than one data centers
    replication factor: number of replicas of data places on different nodes
alter keyspace
    ALTER KEYSPACE keyspace_name
    WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 3}; // WITH replicaton={'class':strategy name,'replication_factor': no_of_replications_on_different_nodes}
    [WITH DURABLE_WRITES = false;]
drop keyspace
    DROP KEYSPACE keyspace_name

// UNFINISHED - INDEX - LESSON 1.3 - CASSANDRA TABLE INDEX
create table
    CREATE TABLE tablename(
        column1_name uuid PRIMARYKEY,  // column_name datatype constraint
        column2_name double,  // column_name datatype
        column3_name float
    )
alter table
    adding new column
        ALTER TABLE table_name
        ADD new_column_name text; // ADD new_column_name datatype;
drop table: remove the table as a whole
    DROP TABLE table_name
truncate table: all the rows of the table are all deleted permanently
    TRUNCATE table_name
create index

drop index

cassandra batch: used to execute multiple modification statements (insert, update, delete)
    BEGIN BATCH
        // enter multiple statements here
        INSERT INTO table_name(column_name1, column_name2) VALUES (1,2)
        UPDATE table_name SET column_name = 1 WHERE column_id = 1;
    APPLY BATCH

// LESSON 1.4 - CASSANDRA QUERY (CQL)
create data
    INSERT INTO table_name
    (column_name1, column_name2)
    VALUES (value_1, value_2);
read data
    SELECT column_name FROM table_name;
update data
    UPDATE keyspace_name.table_name
    SET column_name1 = value1,
        column_name2 = value2
    WHERE column_id = 1;
delete data
    DELETE FROM table_name
    WHERE column_name = value; // WHERE [condition]

// UNFINISHED - EXPLAIN EACH DATATYPE - LESSON 1.5 - CASSANDRA CQL TYPES
data types
    ascii
    bigint
    blob
    boolean
    counter
    double
    float
    frozen
    inet
    int
    list
    map
    set text
    timestamp
    timeuuid
    tuple
    uuid
    varchar
    varint