Resilient Distributed Dataset (RDD)

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant, immutable collection of elements that can be operated on in parallel.

The ability to always recompute an RDD is actually why RDDs are called “resilient.” When a machine holding RDD data fails, Spark uses this ability to recompute the missing partitions following lineage graph, transparent to the user.

Spark keeps track of the set of dependencies between different RDDs, called the lineage graph.

Keep in mind that your entire dataset must fit in memory on a single machine to use collect() on it, so collect() shouldn’t be used on large datasets.

It is important to note that each time we call a new action, the entire RDD must be computed “from scratch.” To avoid this inefficiency, users can persist intermediate results.

cache() is the same as calling persist() with the default storage level.

Java Functions:

Example:

RDD<String> errors = lines.filter(new Function<String, Boolean>() {
 public Boolean call(String x) { return x.contains("error"); }
});

class ContainsError implements Function<String, Boolean>() {
 public Boolean call(String x) { return x.contains("error"); }
}
RDD<String> errors = lines.filter(new ContainsError());

Using Constructor in named class

class Contains implements Function<String, Boolean>() {
 private String query;
 public Contains(String query) { this.query = query; }
 public Boolean call(String x) { return x.contains(query); }
}
RDD<String> errors = lines.filter(new Contains("error"));

Basic properties of RDD are,

RDD is immutable in nature
RDD is lazily evaluated
RDD is cacheable

Characteristics of RDD

RDD is an array of reference of partition objects
Partition is a basic unit of parallelism and each partition holds the reference to the subset of the data
All the partitions are assigned to the nodes of the cluster with respect to the data locality and/or with minimum data transfer
Before processing each partition is loaded in memory (RAM)

Narrow & Wide Operation

Narrow Operation : RDD operations like map, union, filter can operate on a single partition and map the data of that partition to resulting single partition

Wide Operation : RDD operations like groupByKey, distinct, join may require to map the data across the partitions in new RDD. These kind of operations which maps data from one to many partitions are referred as Wide operations.

Resilient Distributed Dataset

Resilient Distributed Dataset (RDD)

Java Functions:

Characteristics of RDD

Narrow & Wide Operation

results matching ""

No results matching ""