Category: Apache Spark

Real-time Big Data Pipeline with Hadoop, Spark & Kafka

Real-time Big Data Pipeline with Hadoop, Spark & Kafka

Defined by 3Vs that are velocity, volume, and variety of the data, big data sits in the separate row from the regular data. Though big data was the buzzword since last few years for data analysis, the new fuss about big data analytics is to build up real-time big data pipeline. In a single sentence, … Continue reading Real-time Big Data Pipeline with Hadoop, Spark & Kafka

Advertisements
Spark: Programming with RDDs

Spark: Programming with RDDs

A RDD known as Resilient Distributed Dataset in Spark is simply an immutable distributed huge collection of objects sets. Each RDD is split into multiple partitions (a smaller units), which may be computed on different aspects of nodes of the cluster. RDDs can contain any type of languages such as Python, Java, or Scala objects, … Continue reading Spark: Programming with RDDs