PySpark 08 - Spark Alogrithm for Big Data
Spark: Alogrithm Design for Big Data
Embarrassingly parallel problems
Problems that can be easily decomposed into many independent problems. For example:
- Word Count
- K-means
- PageRank
Classical Divide-and-Conquer
Divide problem into 2 parts → Recursively solve each part → Combine the results together.
D & C under big data systems
- Divide problem into partitions, where (ideally) is the number of executors in the system.
- Solve the problem on each partition.
- Combine the results together.
Example: sum()
, reduce()
Prefix Sums
Input: Sequence of elements, binary associative operator +
Output: Sequence of elements, with
Example:
1 | x = [1, 4, 3, 5, 6, 7, 0, 1] |
Sample Sort
Step1: Sampling
Collect a sample of elements
Step2: Broadcast splitters
Pick every element in the sample as spliters. Then broadcast them to all machines.
Step3: Shuffling
Each machine partitions its data using the splitters. Send data to the target machine.
Step4: Sort each partition
Each machine sorts all data received.
Q: How to sample one element uniformly from elements stored on servers?
A:
- First randomly sample a server (partition)
- Ask that server to return an element randomly chosen from its elements.
- The probability of each element being sampled is
Q: How to sample many elements at once?
A: Do each of the two steps above in batch mode
- First sample servers with replacement (this can be done at the master node).
- If a server is sampled times, we ask that server to returan samples (with replacement) from its local data.