Data Modeling Process

Understand the end-to-end data modeling process in Apache Cassandra, including conceptual modeling, application workflow, logical and physical design. Learn to create efficient schemas based on query patterns, leveraging denormalization and partitioning for balanced data distribution and high performance. The lesson covers core principles to optimize read/write efficiency and maintain scalability in distributed environments.

We'll cover the following...

Data modeling in RDBMS vs. Apache Cassandra
Apache Cassandra data modeling goals
- Even distribution of data across the cluster
- Minimum number of partitions accessed by a query
Apache Cassandra data modeling process

Data modeling in RDBMS vs. Apache Cassandra

In traditional RDBMS, data modeling is entity-driven and table-centric. Normalized tables hold data with foreign keys referencing related data in other tables. Queries are impacted by the organization and structure of tables and the use of table joins. Referential integrity is enforced by the database.

In contrast, Cassandra’s data modeling is query-driven and query-centric. A table is designed to fulfill a query or a set of queries. Cassandra does not support table joins, and a query must only access a single table, resulting in very fast reads. Thus, Cassandra tables are denormalized and contain all the data (one or more entities) that a query requires. Multiple queries for a single entity, and each query backed by a separate table, result in entity data being duplicated across multiple tables.

Cassandra excels in achieving high write throughput, providing nearly uniform efficiency for all write operations. Additionally, disk space is a far cheaper resource as compared to CPU, memory, or network. Therefore, Apache Cassandra utilizes denormalization and data duplication to perform additional writes to enhance the efficiency of read operations, which are typically costly and present greater optimization challenges.

Apache Cassandra data modeling goals

To design a successful schema in Apache Cassandra, the following high-level goals must be kept in mind:

Even distribution of data across the cluster

Rows of Cassandra tables are partitioned and distributed around nodes in the cluster based on the hash of the partition key. By spreading data evenly, each node in the cluster is responsible for an equal portion of the data, resulting in load balancing. This ensures optimal performance and prevents some nodes from becoming overwhelmed with a disproportionately large amount of data. Additionally, even data distribution allows even workload distribution, resulting in faster response times and increased throughput.

Relational Databases	Apache Cassandra
Relational data modeling methodology	Cassandra data modeling methodology
Entity driven	Query driven
Table-centric	Query-centric
Table joins & RI (Referential Integrity)	Denormalization - no joins, no RI
PK for uniqueness	PK for partitioning, uniqueness & ordering
Often a SPOF (single point of failure)	Distributed architecture - no SPOF
ACID compliant	CAP theorem

1.Getting Started

2.Apache Cassandra Overview

3.Apache Cassandra Architecture

4.Apache Cassandra Data Modeling

5.Apache Cassandra Table

6.Apache Cassandra Data Types

7.Tunable Consistency

8.Apache Cassandra Read and Write Path

9.Wrap Up

Data Modeling Process

Data modeling in RDBMS vs. Apache Cassandra

Apache Cassandra data modeling goals

Even distribution of data across the cluster