Cassandra is a distributed database from Apache that is highly scalable and designed to manage very large amounts of structured data. It provides high availability with no single point of failure.
The tutorial starts off with a basic introduction of Cassandra followed by its architecture, installation, and important classes and interfaces.
The open source Apache Cassandra NoSQL database has quickly become the preferred data management platform for cloud applications that need to scale and perform in distributed environments that consist of multiple data centers and/or clouds.
Cassandra’s masterless, shared nothing architecture provides enterprises with constant uptime for their transactional/operational database applications as well as a flexible data model capable of storing today’s modern datatypes and operational simplicity for easy database management.
The Certified Cassandra in DataStax Enterprise builds upon open source Cassandra to deliver a database for cloud applications.
Apache Cassandra is a highly scalable, high-performance distributed database designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. It is a type of NoSQL database. Let us first understand what a NoSQL database does.
-Inspires relevant marketing & advertising
-Incites breakthrough innovation
-Informs product development & incremental innovation
-Drives understanding of the young consumer audience
-Fosters creativity & new thinking
-Promotes workplace engagement of youth
A NoSQL database (sometimes called as Not Only SQL) is a database that provides a mechanism to store and retrieve data other than the tabular relations used in relational databases. These databases are schema-free, support easy replication, have simple API, eventually consistent, and can handle huge amounts of data.
The primary objective of a NoSQL database is to have
-simplicity of design,
-horizontal scaling, and
-finer control over availability.
Interested in mastering Cassandra Training? Enroll now for FREE demo on Cassandra Training.
NoSql databases use different data structures compared to relational databases. It makes some operations faster in NoSQL. The suitability of a given NoSQL database depends on the problem it must solve.
Besides Cassandra, we have the following NoSQL databases that are quite popular:
Apache HBase - HBase is an open source, non-relational, distributed database modeled after Google’s BigTable and is written in Java. It is developed as a part of Apache Hadoop project and runs on top of HDFS, providing BigTable-like capabilities for Hadoop.
MongoDB - MongoDB is a cross-platform document-oriented database system that avoids using the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas making the integration of data in certain types of applications easier and faster.
Benefits of Cassandra
Cassandra has become so popular because of its outstanding technical features. Given below are some of the features of Cassandra:
Elastic scalability - Cassandra is highly scalable; it allows to add more hardware to accommodate more customers and more data as per requirement.
Always on architecture - Cassandra has no single point of failure and it is continuously available for business-critical applications that cannot afford a failure
Fast linear-scale performance - Cassandra is linearly scalable, i.e., it increases your throughput as you increase the number of nodes in the cluster. Therefore it maintains a quick response time.
Flexible data storage - Cassandra accommodates all possible data formats including: structured, semi-structured, and unstructured. It can dynamically accommodate changes to your data structures according to your need.
Easy data distribution - Cassandra provides the flexibility to distribute data where you need by replicating data across multiple data centers.
Transaction support - Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability (ACID).
Fast writes - Cassandra was designed to run on cheap commodity hardware. It performs blazingly fast writes and can store hundreds of terabytes of data, without sacrificing the read efficiencyC.
Check which version of Java is installed by running the following command:
$ java -version
Add the DataStax Distribution of Apache Cassandra 3.x repository to the/etc/apt/sources.list.d/cassandra.sources.list
$ echo "deb http://debian.datastax.com/datastax-ddc 3.version_number main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
Note: Be sure to specify the version number. For example: 3.2.Optional: On Debian systems, to allow installation of the Oracle JVM instead of the OpenJDK JVM:
-In /etc/apt/sources.list, find the line that describes your source repository for Debian and add contrib non-free to the end of the line. For example:
deb http://some.debian.mirror/debian/ $distro main contrib non-free
-Save and close the file when you are done.
Add the DataStax repository key to your aptitude trusted keys.
$ curl -L https://debian.datastax.com/debian/repo_key | sudo apt-key add -
Install the latest package:
$ sudo apt-get update
$ sudo apt-get install datastax-ddc
This command automatically installs the Cassandra utilities such as sstablelevelreset, sstablemetadata, sstableofflinerelevel, sstablerepairedset, sstablesplit, token-generator. Each utility provides usage/help information; type help after entering the command.
Optional: Single-node cluster installations only.
-Because the Debian packages start the Cassandra service automatically, you do not need to start the service.
-Verify that DataStax Distribution of Apache Cassandra is running:
$ nodetool status
-- Address Load Tokens Owns Host ID Rack
UN 127.0.0.147.66 KB 47.66 KB 256 100% aaa1b7c1-6049-4a08-ad3e-3697a0e30e10 rack1
Because the Debian packages start the Cassandra service automatically, you must stop the server and clear the data:
Doing this removes the default cluster_name (Test Cluster) from the system table. All nodes must use the same cluster name.
$ sudo service cassandra stop
$ sudo rm -rf /var/lib/cassandra/data/system/*
Cassandra incorporates a number of architectural best practices that affect performance. None are unique to Cassandra, but Cassandra is the only NoSQL system that incorporates all of them.
Fully distributed: Every Cassandra machine handles a proportionate share of every activity in the system. There are no special cases like the HDFS namenode, MongoDB mongos, or the MySQL Fabric process that require special treatment. And with every node the same, Cassandra is far simpler to install and operate, which has long-term implications for troubleshooting. Even when everything works perfectly, master/slave designs have a bottleneck at the master. Cassandra leverage's its masterless design to deliver lower latency as well as uninterrupted uptime.
Log-structured storage engine: A log-structured engine that avoids overwrites to turn updates into sequential i/o is essential both on hard disks (HDD) and solid-state disks (SSD). On HDD, because the seek penalty is so high; on SSD, to avoid write amplification and disk failure. This is why you see mongodb performance go through the floor as the dataset size exceeds RAM. Couchbase’s append-only b-trees avoids overwrites, but requires several seeks when updating or inserting new documents and does not support durable writes without a large performance penalty.
Locally-managed storage: HBase has an integrated, log-structured storage engine, but relies on HDFS for replication instead of managing storage locally. This means HBase is architecturally incapable of supporting Cassandra-style optimizations like putting the commitlog on a separate disk, mixing SSD and HDD in a single cluster with appropriate data pinned to each, orincrementally pulling compacted sstables into the page cache.
Prepared statements: Five years ago, NoSQL systems were characterized by only allowing primary key lookups, and there was no query planning to speak of. Today, Cassandra and most other systems2 support indexes and increasingly complex queries. The Cassandra Query Language allows Cassandra to pre-parse and re-use query plans, reducing overhead. Others remain stuck with primitive JSON APIs or even raw Java Scanner objects. CQL also allows Cassandra to express more sophisticated operations like lightweight transactions with a minimal impact on clients, resulting in wide support across many languages. The closest alternative is Apache Phoenix, a Java-only SQL layer for HBase.
Check out the top Cassandra Interview Questions now!
Cassandra does not support full fledged SQL. It provides a command line interface called cqlsh to run Casandra commands. It’s data access language is very limited and supports only getting the data with keys as below:
CREATE KEYSPACE sampledb with strategy_class = ‘SimpleStrategy’ AND strategy_options:replication_factor = 3 ;
CREATE TABLE EMP (EmpID int, DeptID int, First_name varchar, Last_name varchar, PRIMARY KEY (empID, deptID));
INSERT INTO EMP (empID, deptID, first_name, last_name)VALUES (104, 15, ‘jane’,’smith’);
SELECT * FROM EMP WHERE empid = 104 and deptid = 15;
SELECT *FROM EMP WHERE empID IN(130,104) ORDER BY deptID DESC;
UPDATE EMP set first_name = ‘Gena’ where empid = 104;
Cassandra allows searches only by the primary key or any indexes created. Searches by other columns will give an error.
-Node: Where you store your data. It is the basic infrastructure component of Cassandra.
-Data center: A collection of related nodes. A data center can be a physical data center or virtual data center. Different workloads should use separate data centers, either physical or virtual. Replication is set by data center. Using separate data centers prevents Cassandra transactions from being impacted by other workloads and keeps requests close to each other for lower latency. Depending on the replication factor, data can be written to multiple data centers. However, data centers should never span physical locations.
-Cluster: A cluster contains one or more data centers. It can span physical locations.
-Commit log: All data is written first to the commit log for durability. After all its data has been flushed to SSTables, it can be archived, deleted, or recycled.
-Table: A collection of ordered columns fetched by row. A row consists of columns and have a primary key. The first part of the key is a column name.
-SSTable: A sorted string table (SSTable) is an immutable data file to which Cassandra writes memtables periodically. SSTables are append only and stored on disk sequentially and maintained for each Cassandra table.
Cassandra is a Key-Value based No-SQL database that has lot of promise in terms of unique ring-based architecture, no single point of failure and fast writes. It is highly scalable and is ready for big data. But its limited query support and lack of joins/aggregates make it not suitable for all big data based applications. It also has the limitation of being supported by only a few vendors who are providing hadoop support. Its success will depend on how quickly the development community can build robust interfaces to other hadoop ecosystem products like Flume, hive and hbase.
For an Indepth knowledge on Cassandra, click on below