PySpark 08 - Spark Alogrithm for Big Data
Spark: Alogrithm Design for Big Data
Embarrassingly parallel problems
Problems that can be easily decomposed into many independent problems. For example:
Word Count
K-means
PageRank
Classical Divide-and-Conquer
Divide problem into 2 parts → Recursively solve each part → Combine the results together.
D & C under big data systems
Divide problem into ppp partitions, where (ideally) ppp is the number of executors in the system.
Solve the problem on each partition.
Combine the results together ...
PySpark 07 - Spark Job Scheduling
Spark: Job Scheduling
Operations on RDDs
Narrow Dependencies and Wide Dependencies
HDFS files: Input RDD, one partition for each block of the file.
Map: Transforms each record of the RDD.
Filter: Select a subset of records.
Union: Returns the union of two RDDs.
Join: Narrow or wide dependency.
The scheduler examines the RDD’s lineage graph to build a DAG of stages. The boundaires are the shuffle stages. Pipelined parallel execution within one stage.
Shuffle operations
Spark uses shuffles to im ...
PySpark 06 - Spark Partitions
PySpark 06 - Spark Partitions
RDDs are stored in partitions. Programmer specifies number of partitions for an RDD (Default value used if unspecified). More partitions means more parallelism but also more overhead.
RDDs are stored in partitions. When performing computations on RDDs, these partitions can be operated on in parallel.
You get better parallelism when the partitions are balanced.
When RDDs are first created, the partitions are balanced.
However, partitions may get out of balance after ...
PySpark 05 - Spark SQL
PySpark 05 - Spark SQL
DataFrames
Idea borrowed from pandas and R
A DataFrame is an RDD of Row objects.
The fields is a Row can be accessed like attributes.
Create a DataFrame
DataFrame can be created with Row()
123from pyspark.sql import Rowrow = Row(name="Alice", age=11)
Elements in a DataFrame can be accessed through row.name or row['name']. In some cases, row['name'] should be chosen to avoid confilcts with row’s own attributes.
123row = Row(name="Alice", age=11, count= ...
PySpark 04 - Key Value Pairs
PySpark 04 - Key Value Pairs
While most Spark operations work on RDDs containing any type of objects, a few special operations are only available on RDDs of key-value pairs.
For example, the following code uses the reduceByKey opertaion on key-value pairs to count how many times each line of text occurs in a file:
123lines = sc.textFile("README.md")pairs = lines.map(lambda s:(s, 1))counts = pairs.reduceByKey(lambda a, b:a+b)
More examples of key-value pairs operations
reduceByKey()
123 ...
PySpark 03 - Use take() instead of Collect()
PySpark 03 - Use take() instead of Collect()
This note discuss the mistake that can caused by “Lazy execution” and global variables.
The following code is the implementation of Linear-select problem.
Problem:
Input: an array A of n numbers (unordered), and k.
Output: the k-th smallest number (counting from 0).
Algorithm:
x = A[0]
partition A into A[0…mid-1] < A[mid] = x < A[mid+1…n-1]
if mid = k then return x
if k < mid then A = A[0…mid-1] , if k > mid then A = A[mid+1,n-1], k = k ...
PySpark 02 - Closure and Persistence
PySpark 02 - Closure and Persistence
1 Closure
A task’s closure is those variables and methods which must be visible for the executor to perform its computations on the RDD.
Functions that run on RDDs at executors
Any global variables used by those executors
The variables within the closure sent to each executor are copies.
This closure is serialized and sent to each executor from the driver when an action is invoked.
For example:
123456789101112counter = 0rdd = sc.parallelize(range(10))def in ...
PySpark 01 - RDD basics
PySpark 01 RDD basics
1 RDDs and partitions
RDDs: Resilient Distributed Datasets.
A real or virtual file consisting of records.
Partitioned into partitions.
Created through deterministic transformations on “Data in persistent storage” or other RDDs.
Do not need to be materialized.
Users can control the persistence and the partitioning(e,g, key of the record) of RDDs.
Programers specified number of partitions for an RDD (Default value used if not specified). More partitions means better paralle ...
Leading FastSpeech2 to Controllable Text-To-Speech
Leading FastSpeech2 to Controllable Text-To-Speech
This is a final conclusion of my B.E graduation work, code is open-source at: https://github.com/aucki6144/ctts
Introduction
Speech synthesis, the task of converting text into natural-sounding speech, is a central topic in AI, NLP, and speech processing. While recent advancements in deep learning have significantly improved the naturalness and robustness of speech synthesis, there are still challenges in controlling the nuances of synthesized sp ...
Data Mining and Knowledge Discovery
Data Mining and Knowledge Discovery
1 Association Rule Mining
1.1 Definitions
n-itemset: Number of items in a set. Example: {B,C} is a 2-itemset.
support: Time of an itemset appears in the dataset.
Large item sets: item sets with support ≥ a threshold
Association rule, Support and Confidence
Rule: X → Y . For example: {B,C} → E
Support: The support of a rule equals to the support of {X,Y}
Confidence: Support(Y)/Support(X)Support(Y)/Support(X)Support(Y)/Support(X)
We want to find association ru ...