Spark: Alogrithm Design for Big Data

Embarrassingly parallel problems

Problems that can be easily decomposed into many independent problems. For example:

Word Count
K-means
PageRank

Classical Divide-and-Conquer

Divide problem into 2 parts → Recursively solve each part → Combine the results together.

D & C under big data systems

Divide problem into $p$ partitions, where (ideally) $p$ is the number of executors in the system.
Solve the problem on each partition.
Combine the results together.

Example: sum(), reduce()

Prefix Sums

Input: Sequence $x$ of $n$ elements, binary associative operator +

Output: Sequence $y$ of $n$ elements, with $y_k=x_1+ ... + x_k$

Example: $x=[1,4,3,5,6,7,0,1]$ $y=[1,5,8,13,19,26,26,27]$

x = [1, 4, 3, 5, 6, 7, 0, 1]

rdd = sc.parallelize(x, 4).cache()

def f(iterator):
    yield sum(iterator)

sums = rdd.mapPartitions(f).collect()

print(sums)

for i in range(1, len(sums)):
    sums[i] += sums[i-1]

print(sums)

def g(index, iterator):
    global sums
    if index == 0:
        s = 0
    else:
        s = sums[index-1]
    for i in iterator:
        s += i
        yield s

prefix_sums = rdd.mapPartitionsWithIndex(g)
print(prefix_sums.collect())

Sample Sort

Step1: Sampling

Collect a sample of $4p\ ln(p/2)$ elements

Step2: Broadcast splitters

Pick every $(i\times4ln(p/2))-th$ element in the sample as spliters. Then broadcast them to all machines.

Step3: Shuffling

Each machine partitions its data using the splitters. Send data to the target machine.

Step4: Sort each partition

Each machine sorts all data received.

Q: How to sample one element uniformly from $n$ elements stored on $p$ servers?

First randomly sample a server (partition)
Ask that server to return an element randomly chosen from its $n/p$ elements.
The probability of each element being sampled is $\frac{1}{p}\times \frac{p}{n}=\frac{1}{n}$

Q: How to sample many elements at once?

A: Do each of the two steps above in batch mode

First sample $plnp$ servers with replacement (this can be done at the master node).
If a server is sampled $k$ times, we ask that server to returan $k$ samples (with replacement) from its local data.