Introduction to Apache Cassandra
Cassandra has a peer-to-peer distributed architecture that is much more elegant, and easy to set up and maintain. In Cassandra, all nodes are the same; there is no concept of a master node, with all nodes communicating with each other via a gossip protocol.
There is nothing programmatic that a developer or administrator needs to do or code to distribute data across a cluster. Data is transparently partitioned across all nodes in either a randomized or ordered fashion, with random being the default.
When creating a new Cassandra database (also called a keyspace), a user simply indicates via a single command which data centers and/or cloud providers will hold copies of the new database; everything from that point forward is automatically handled and maintained by Cassandra.
If one or more nodes responsible for a particular set of data are down, data is simply written to another node, which temporarily holds the data. Once the node(s) come back online, they automatically bring themselves back up to date from nodes that are holding the data they maintain.
A user requests data from any node (which becomes that user’s coordinator node ), with the user’s query being assembled from one or more nodes holding the necessary data. If a particular node having the required data is down, Cassandra simply requests data from another node holding a replicated copy of that data.
While Cassandra is not a transactional database in the same way that legacy RDBMSs offer ACID transactions, it does offer the “AID” portion of ACID, in that data written is atomic, isolated, and durable. The “C” of ACID does not apply to Cassandra, as there is no concept of referential integrity or foreign keys
Because NoSQL databases like Cassandra do not support operations like SQL joins, data tends to be highly denormalized. While such a thing (wide rows) is normally a problem for an RDBMS, Cassandra provides exceptional performance for objects with many thousands of columns.
Migrate a Relational Database into Cassandra (Part II – Northwind Planning)
In a relational database setting I can often simply normalize away and worry about which table I need to focus my indexing efforts on later when I’m working in the application. However, in NoSQL, non-relational database design, we often need to decide up front which entity most queries will be interested in and build everything else around that entity.
So…will it be “order” or “product”? Today I’ll decide that the key entity in this database is “order” – customers will be hitting this on a daily, per transaction basis whereas I can probably run my product reports offline.
Cassandra From a Relational World
When you are designing the schema for your relational database, the primary thought on your mind is “What’s the best way to store this data?”. But with Cassandra, your dominant concern should be “How am I going to query this data?”
Basic Rules of Cassandra Data Modeling
Developers coming from a relational background usually carry over rules about relational modeling and try to apply them to Cassandra. To avoid wasting time on rules that don’t really matter with Cassandra
Writes in Cassandra aren’t free, but they’re awfully cheap. Cassandra is optimized for high write throughput, and almost all writes are equally efficient [1]. If you can perform extra writes to improve the efficiency of your read queries, it’s almost always a good tradeoff.
Denormalization and duplication of data is a fact of life with Cassandra. Don’t be afraid of it. Disk space is generally the cheapest resource (compared to CPU, memory, disk IOPs, or network), and Cassandra is architected around that fact.
Cassandra doesn’t have JOINs, and you don’t really want to use those in a distributed fashion.
Rows are spread around the cluster based on a hash of the partition key, which is the first element of the PRIMARY KEY. So, the key to spreading data evenly is this: pick a good primary key.
Partitions are groups of rows that share the same partition key. When you issue a read query, you want to read rows from as few partitions as possible.
The way to minimize partition reads is to model your data to fit your queries. Don’t model around relations. Don’t model around objects.
If you need different types of answers, you usually need different tables. This is how you optimize for reads. Remember, data duplication is okay. Many of your tables may repeat the same data.
The DevOps of Cassandra Data Modeling
In Cassandra, it is best to start the data modeling process by defining your query patterns first.
An Advanced Cassandra Data Modeling Guide
Denormalize ALL THE THINGS: Increase the number of writes to reduce and simplify reads Most importantly we did this without querying multiple tables and merging and reconciling the results by using a data model that duplicates data and writes our host information in a de-normalized manner contrary to the original highly relational MySQL data model.
Advanced data modeling with apache cassandra
Data Modeling Step:
- Conceptual Data Model (ER Diagram?)
- Application Query Workflow
- Logical Data Model (Combine 1 & 2)
- Physical Data Model (3 with Data Type)
Eventually Consistent - Revisited
the CAP theorem, which states that of three properties of shared-data systems—data consistency, system availability, and tolerance to network partition—only two can be achieved at any given time.
An important observation is that in larger distributed-scale systems, network partitions are a given; therefore, consistency and availability cannot be achieved at the same time. This means that there are two choices on what to drop: relaxing consistency will allow the system to remain highly available under the partitionable conditions, whereas making consistency a priority means that under certain conditions the system will not be available.
[Hey Relational Developer, Let’s Go Crazy (Patrick McFadin, DataStax) | Cassandra Summit 2016](https://www.youtube.com/watch?v=KFCmxrmnkt8) - Silde |