Spark: Alogrithm Design for Big Data

Embarrassingly parallel problems

Problems that can be easily decomposed into many independent problems. For example:

  • Word Count
  • K-means
  • PageRank

Classical Divide-and-Conquer

Divide problem into 2 parts → Recursively solve each part → Combine the results together.

D & C under big data systems

  • Divide problem into pp partitions, where (ideally) pp is the number of executors in the system.
  • Solve the problem on each partition.
  • Combine the results together.

Example: sum(), reduce()

Prefix Sums

Input: Sequence xx of nn elements, binary associative operator +

Output: Sequence yy of nn elements, with yk=x1+...+xky_k=x_1+ ... + x_k

Example: x=[1,4,3,5,6,7,0,1]x=[1,4,3,5,6,7,0,1] y=[1,5,8,13,19,26,26,27]y=[1,5,8,13,19,26,26,27]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
x = [1, 4, 3, 5, 6, 7, 0, 1]

rdd = sc.parallelize(x, 4).cache()

def f(iterator):
yield sum(iterator)

sums = rdd.mapPartitions(f).collect()

print(sums)

for i in range(1, len(sums)):
sums[i] += sums[i-1]

print(sums)

def g(index, iterator):
global sums
if index == 0:
s = 0
else:
s = sums[index-1]
for i in iterator:
s += i
yield s

prefix_sums = rdd.mapPartitionsWithIndex(g)
print(prefix_sums.collect())

Sample Sort

Step1: Sampling

Collect a sample of 4p ln(p/2)4p\ ln(p/2) elements

Step2: Broadcast splitters

Pick every (i×4ln(p/2))th(i\times4ln(p/2))-th element in the sample as spliters. Then broadcast them to all machines.

Step3: Shuffling

Each machine partitions its data using the splitters. Send data to the target machine.

Step4: Sort each partition

Each machine sorts all data received.


Q: How to sample one element uniformly from nn elements stored on pp servers?

A:

  • First randomly sample a server (partition)
  • Ask that server to return an element randomly chosen from its n/pn/p elements.
  • The probability of each element being sampled is 1p×pn=1n\frac{1}{p}\times \frac{p}{n}=\frac{1}{n}

Q: How to sample many elements at once?

A: Do each of the two steps above in batch mode

  • First sample plnpplnp servers with replacement (this can be done at the master node).
  • If a server is sampled kk times, we ask that server to returan kk samples (with replacement) from its local data.