Data Modeling Process
Understand the end-to-end data modeling process in Apache Cassandra, including conceptual modeling, application workflow, logical and physical design. Learn to create efficient schemas based on query patterns, leveraging denormalization and partitioning for balanced data distribution and high performance. The lesson covers core principles to optimize read/write efficiency and maintain scalability in distributed environments.
Data modeling in RDBMS vs. Apache Cassandra
In traditional RDBMS, data modeling is entity-driven and table-centric. Normalized tables hold data with foreign keys referencing related data in other tables. Queries are impacted by the organization and structure of tables and the use of table joins. Referential integrity is enforced by the database.
In contrast, Cassandra’s data modeling is query-driven and query-centric. A table is designed to fulfill a query or a set of queries. Cassandra does not support table joins, and a query must only access a single table, resulting in very fast reads. Thus, Cassandra tables are denormalized and contain all the data (one or more entities) that a query requires. Multiple queries for a single entity, and each query backed by a separate table, result in entity data being duplicated across multiple tables.
Relational Databases | Apache Cassandra |
Relational data modeling methodology | Cassandra data modeling methodology |
Entity driven | Query driven |
Table-centric | Query-centric |
Table joins & RI (Referential Integrity) | Denormalization - no joins, no RI |
PK for uniqueness | PK for partitioning, uniqueness & ordering |
Often a SPOF (single point of failure) | Distributed architecture - no SPOF |
ACID compliant | CAP theorem |
Cassandra excels in achieving high write throughput, providing nearly uniform efficiency for all write operations. Additionally, disk space is a far cheaper resource as compared to CPU, memory, or network. Therefore, Apache Cassandra utilizes denormalization and data duplication to perform additional writes to enhance the efficiency of read operations, which are typically costly and present greater optimization challenges.
Apache Cassandra data modeling goals
To design a successful schema in Apache Cassandra, the following high-level goals must be kept in mind:
Even distribution of data across the cluster
Rows of Cassandra tables are partitioned and distributed around nodes in the cluster based on the hash of the partition key. By spreading data evenly, each node in the cluster is responsible for an equal portion of the data, resulting in load balancing. This ensures optimal performance and prevents some nodes from becoming overwhelmed with a disproportionately large amount of data. Additionally, even data distribution allows even workload distribution, resulting in faster response times and increased throughput.
Even data distribution also enables the system to scale seamlessly in
A table’s partition key plays a crucial role in achieving even data distribution across the cluster. Choosing a suitable partition key requires careful consideration of the data access patterns, query requirements, and cardinality of the data. A good practice is to select a partition key that provides a good distribution of values and avoids data skew, where certain partitions receive significantly more data than others.
Minimum number of partitions accessed by a query
This goal is aimed at optimizing read operations. In Cassandra, each table’s data is distributed around the cluster nodes in partitions based on the partition key. Each partition represents a unit of data storage and can contain ...