Tag: Apache Spark

Comprehensive Introduction to Apache Spark, RDDs & Dataframes (using PySpark)

Comprehensive Introduction to Apache Spark, RDDs & Dataframes (using PySpark)

Introduction Industry estimates that we are creating more than 2.5 Quintillion bytes of data every year. Think of it for a moment – 1 Qunitillion = 1 Million Billion! Can you imagine how many drives / CDs / Blue-ray DVDs would be required to store them? It is difficult to imagine this scale of data … Continue reading Comprehensive Introduction to Apache Spark, RDDs & Dataframes (using PySpark)

Advertisements
Apache Flink – Flink vs Spark vs Hadoop

Apache Flink – Flink vs Spark vs Hadoop

Here is a comprehensive table, which shows the comparison between three most popular big data frameworks: Apache Flink, Apache Spark and Apache Hadoop. Apache Hadoop Apache Spark Apache Flink Year of Origin 2005 2009 2009 Place of Origin MapReduce (Google) Hadoop (Yahoo) University of California, Berkeley Technical University of Berlin Data Processing Engine Batch Batch … Continue reading Apache Flink – Flink vs Spark vs Hadoop

The Hadoop Module & High-level Architecture

The Hadoop Module & High-level Architecture

The Apache Hadoop Module: Hadoop Common: this includes the common utilities that support the other Hadoop modules HDFS: the Hadoop Distributed File System provides unrestricted, high-speed access to the application data. Hadoop YARN: this technology accomplishes scheduling of job and efficient management of the cluster resource. MapReduce: highly efficient methodology for parallel processing of huge … Continue reading The Hadoop Module & High-level Architecture