Wednesday, October 04, 2017

Enable Distributed Data Processing for Cassandra With Spark

DZone Database Zone
Enable Distributed Data Processing for Cassandra With Spark
Enable Distributed Data Processing for Cassandra With Spark

Cassandra is a distributed database system that offers linear scale performance with high availability over a cluster of commodity servers. A distributed data model or data partitioning is the primary technique in Cassandra as many others distributed storage systems to achieve scalable performance and fault tolerance. In Hadoop, the distributed data model of HDFS brings another value: MapReduce, a distributed programming model, which allows parallel data processing on its data partitions (data blocks). In Hadoop ecosystem, it's commonly known as, "bring computation closer to data." 

Is it possible a similar data processing model with Cassandra? That is, take advantage of the distributed nature in Cassandra and apply any data processing logic parallelly on each data partitions? Yes, that's possible with Datastax Spark-Cassandra Connector, which provides the RDD abstraction for data collections in Cassandra. 

No comments:

Fun With SQL: Functions in Postgres

DZone Database Zone Fun With SQL: Functions in Postgres In our previous  Fun with SQL  post on the  Citus Data  blog, we covered w...