Spark: Programming with RDDs

Spark: Programming with RDDs

A RDD known as Resilient Distributed Dataset in Spark is simply an immutable distributed huge collection of objects sets. Each RDD is split into multiple partitions (a smaller units), which may be computed on different aspects of nodes of the cluster. RDDs can contain any type of languages such as Python, Java, or Scala objects, including user-defined classes.

RDDs can be created in two distinct ways: by loading an external dataset, or by distributing a set of collection of objects (e.g., a list or set) in their driver program created.

Every Spark program and shell session will do its functioning as follows:

  • Create some input RDDs programming from external data.
  • Transform RDDs to define new RDDs programs using transformations such as filter ().
  • Ask Spark to persist () any intermediate RDDs that will need to be reused as per the requirements.
  • Create the actions such as count() and first() to kick off a parallel computation, which is then optimized and executed by Spark.
  • Creating RDDs

The pretty simplest way to create RDDs is to take an already existing collection in your program and pass the same to the Spark Context’s parallelize() method, as shown in Examples:

parallelize() method in Python

lines = sc.parallelize([“pandas”, “i like pandas”])

parallelize() method in Scala

val lines = sc.parallelize(List(“pandas”, “i like pandas”))

parallelize() method in Java

JavaRDD<String> lines = sc.parallelize(Arrays.asList(“pandas”, “i like pandas”));

RDD Operations

RDD performs two types of operations such as

  • Transformations.
  • Actions.

Transformations : It returns a new type of RDD programs.

Take an example, suppose that we have a logfile, log.txt, with a number of messages, and we want to select only particularly error messages. We can use the filter() transformation. We’ll show a filter in all three of Spark’s language APIs(Application Programming Language).

filter() transformation in Python programming Language :

inputRDD = sc.textFile(“log.txt”)

errorsRDD = inputRDD.filter(lambda x: “error” in x)

filter() transformation in Scala platform :

val inputRDD = sc.textFile(“log.txt”)

val errorsRDD = inputRDD.filter(line => line.contains(“error”))

filter() transformation in Java

JavaRDD<String> inputRDD = sc.textFile(“log.txt”);

JavaRDD<String> errorsRDD = inputRDD.filter(

new Function<String, Boolean>() {

public Boolean call(String x) { return x.contains(“error”); }



filter() operation does not mutate the already existing input RDD. Instead, it will returns a pointer to an entirely new RDD.

Actions : They are the certain operations that will return a final value to the driver program or write data to an external storage system. Actions performed will force the evaluation of the transformations required for the RDD they were called on, since they need to actually produce output as those are required.

Python error count records using actions

print “Input had ” + badLinesRDD.count() + ” concerning lines”

print “Here are 10 examples:”

for line in badLinesRDD.take(10):

print line

Scala error count records using actions

println(“Input had ” + badLinesRDD.count() + ” concerning lines”)

println(“Here are 10 examples:”)


Java error count records using actions

System.out.println(“Input had ” + badLinesRDD.count() + ” concerning lines”)

System.out.println(“Here are 10 examples:”)

for (String line: badLinesRDD.take(10)) {



In the above lines of code, take() is used to get elements in RDD at the driver program which are then iterated to write an output to the driver. Also the RDD consists of collect() that fetches the complete RDD programming concept.

Each time new action is been called, the entire RDD must be computed “from scratch”. To avoid this type of inefficiency, users can persist intermediate results.

Lazy Evaluation

Spark Tutorial, Spark, Big Data

Lazy evaluation means that when if we call a transformation or an action on an RDD (for instance, calling map()), the operation is not immediately performed. Instead, Spark internally records metadata to indicate that this typical operation has been requested. Rather than thinking of an RDD as containing specific data, it is best to think of each RDD methodology as consisting of instructions on how to compute the data that we build up through transformations phases. Loading data into an RDD is lazily evaluated in the same way transformations are done. So, when we call sc.textFile(), the data is not loaded until it is necessary things required. As with transformations, the operation (in this case, reading the data) can occur multiple amounts of times.

Passing Functions to Spark

Python : In Python, we have mostly three choices for passing functions into Spark. Firstly, we can pass in the functions through lambda expressions. Secondly, pass a function that is already a member of an object, or contains references of the fields within an object itself. Thirdly, simply extract the fields you are required from your object into a local variable and pass that to it.

Scala : Using Scala we are able pass in functions defined inline, references to methods, or static functions as we do for Scala’s other functional APIs(Application Programming Interface).

Java : In Java, functions are particularly specified as objects that implement one of Spark’s function interfaces from the package.

Common Transformations and Actions

There are relatively the two most common transformations you mostly be using are map() and filter() transformations. The map() transformation takes in a function and applies it to each element in the RDD with the result of the function being the new value of each element in the resulting RDD. The filter() transformation takes in a function and returns an RDD that only has elements that pass the filter() function.

Spark Tutorial, Spark, Big Data

mapped and filtered rdd from an input rdd

Basic example of map() that squares all of the numbers in an RDD

Python squaring the values in an RDD

nums = sc.parallelize([1, 2, 3, 4])

squared = x: x * x).collect()

for num in squared:

print “%i ” % (num)

Sometimes we want to produce multiple output elements for each input element. The operation to do this is called flatMap().

A simple usage of flatMap() is splitting up an input string into words, as shown in Example:

flatMap() in Python, splitting lines into words

lines = sc.parallelize([“hello world”, “hi”])

words = lines.flatMap(lambda line: line.split(” “))

words.first() # returns “hello”


The most common action on basic RDDs faced is reduce(), which takes a function that operates on two elements of the same type your RDD owns and returns a new element of the same type. Similar to reduce() is fold(), which also takes a function with the same signature as required for the reduce() function, but in addition takes a “zero value” to be used for the initial call on each partition.

Both fold() and reduce() operations are required that the return type of our result be the same type as that of the elements in the RDD we are operating on. This works well for operations like sum, but sometimes we want to return a different type as per the operations assigned.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s