Author: LearnAnalytics

Spark: Programming with RDDs

Spark: Programming with RDDs

A RDD known as Resilient Distributed Dataset in Spark is simply an immutable distributed huge collection of objects sets. Each RDD is split into multiple partitions (a smaller units), which may be computed on different aspects of nodes of the cluster. RDDs can contain any type of languages such as Python, Java, or Scala objects, … Continue reading Spark: Programming with RDDs

Advertisements

Apache Spark Architecture

In order to understand the way Spark runs, it is very important to know the architecture of Spark. Following diagram and discussion will give you a clearer view into it. There are three ways Apache Spark can run : Standalone – The Hadoop cluster can be equipped with all the resources statically and Spark can … Continue reading Apache Spark Architecture

Hadoop Multi Node Clusters

Hadoop Multi Node Clusters

Installing Java Syntax of java version command $ java -version Following output is presented. java version "1.7.0_71" Java(TM) SE Runtime Environment (build 1.7.0_71-b13) Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode) Creating User Account System user account on both master and slave systems should be created to use the Hadoop installation. # useradd hadoop # … Continue reading Hadoop Multi Node Clusters

Hadoop: Features, Components, Cluster & Topology

Hadoop: Features, Components, Cluster & Topology

Apache HADOOP is a framework used to develop data processing applications which are executed in a distributed computing environment. Components of Hadoop Features Of 'Hadoop' Network Topology In Hadoop Similar to data residing in a local file system of personal computer system, in Hadoop, data resides in a distributed file system which is called as … Continue reading Hadoop: Features, Components, Cluster & Topology

What is MapReduce? How it Works

What is MapReduce? How it Works

MapReduce is a programming model suitable for processing of huge data. Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python, and C++. MapReduce programs are parallel in nature, thus are very useful for performing large-scale data analysis using multiple machines in the cluster. MapReduce programs work in two phases: 1. … Continue reading What is MapReduce? How it Works

Big Data Testing: Functional & Performance

Big Data Testing: Functional & Performance

What is Big Data? Big data is a collection of large datasets that cannot be processed using traditional computing techniques. Testing of these datasets involves various tools, techniques and frameworks to process. Big data relates to data creation, storage, retrieval and analysis that is remarkable in terms of volume, variety, and velocity. Big Data Testing … Continue reading Big Data Testing: Functional & Performance

Using Materialized Views with Big Data SQL to Accelerate Performance

Using Materialized Views with Big Data SQL to Accelerate Performance

One of Big Data SQL’s key benefits is that it leverages the great performance capabilities of Oracle Database 12c. I thought it would be interesting to illustrate an example – and in this case we’ll review a performance optimization that has been around for quite a while and is used at thousands of customers: Materialized … Continue reading Using Materialized Views with Big Data SQL to Accelerate Performance