Resilient Distributed Dataset (RDD)

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant, immutable collection of elements that can be operated on in parallel.

The ability to always recompute an RDD is actually why RDDs are called “resilient.” When a machine holding RDD data fails, Spark uses this ability to recompute the missing partitions following lineage graph, transparent to the user.

Spark keeps track of the set of dependencies between different RDDs, called the lineage graph.

Keep in mind that your entire dataset must fit in memory on a single machine to use collect() on it, so collect() shouldn’t be used on large datasets.

It is important to note that each time we call a new action, the entire RDD must be computed “from scratch.” To avoid this inefficiency, users can persist intermediate results.

cache() is the same as calling persist() with the default storage level.

Java Functions:

Standard Java Function Interfaces

Example:

RDD<String> errors = lines.filter(new Function<String, Boolean>() {
 public Boolean call(String x) { return x.contains("error"); }
});

or

class ContainsError implements Function<String, Boolean>() {
 public Boolean call(String x) { return x.contains("error"); }
}
RDD<String> errors = lines.filter(new ContainsError());

Using Constructor in named class

class Contains implements Function<String, Boolean>() {
 private String query;
 public Contains(String query) { this.query = query; }
 public Boolean call(String x) { return x.contains(query); }
}
RDD<String> errors = lines.filter(new Contains("error"));

Basic properties of RDD are,

  • RDD is immutable in nature
  • RDD is lazily evaluated
  • RDD is cacheable

Apache Spark RDD

Characteristics of RDD
  • RDD is an array of reference of partition objects
  • Partition is a basic unit of parallelism and each partition holds the reference to the subset of the data
  • All the partitions are assigned to the nodes of the cluster with respect to the data locality and/or with minimum data transfer
  • Before processing each partition is loaded in memory (RAM)

Narrow & Wide Operation

Narrow Operation : RDD operations like map, union, filter can operate on a single partition and map the data of that partition to resulting single partition

Spark RDD Narrow Operations

Wide Operation : RDD operations like groupByKey, distinct, join may require to map the data across the partitions in new RDD. These kind of operations which maps data from one to many partitions are referred as Wide operations.

Spark RDD Wide Operations

results matching ""

    No results matching ""