Shiranai Blog

PySpark 08 - Spark Alogrithm for Big Data

发表于2024-10-06|Note

Spark: Alogrithm Design for Big Data Embarrassingly parallel problems Problems that can be easily decomposed into many independent problems. For example: Word Count K-means PageRank Classical Divide-and-Conquer Divide problem into 2 parts → Recursively solve each part → Combine the results together. D & C under big data systems Divide problem into ppp partitions, where (ideally) ppp is the number of executors in the system. Solve the problem on each partition. Combine the results together ...

PySpark 07 - Spark Job Scheduling

发表于2024-10-05|Note

Spark: Job Scheduling Operations on RDDs Narrow Dependencies and Wide Dependencies HDFS files: Input RDD, one partition for each block of the file. Map: Transforms each record of the RDD. Filter: Select a subset of records. Union: Returns the union of two RDDs. Join: Narrow or wide dependency. The scheduler examines the RDD’s lineage graph to build a DAG of stages. The boundaires are the shuffle stages. Pipelined parallel execution within one stage. Shuffle operations Spark uses shuffles to im ...

PySpark 06 - Spark Partitions

发表于2024-10-03|Note

PySpark 06 - Spark Partitions RDDs are stored in partitions. Programmer specifies number of partitions for an RDD (Default value used if unspecified). More partitions means more parallelism but also more overhead. RDDs are stored in partitions. When performing computations on RDDs, these partitions can be operated on in parallel. You get better parallelism when the partitions are balanced. When RDDs are first created, the partitions are balanced. However, partitions may get out of balance after ...

PySpark 05 - Spark SQL

发表于2024-10-01|Note

PySpark 05 - Spark SQL DataFrames Idea borrowed from pandas and R A DataFrame is an RDD of Row objects. The fields is a Row can be accessed like attributes. Create a DataFrame DataFrame can be created with Row() 123from pyspark.sql import Rowrow = Row(name="Alice", age=11) Elements in a DataFrame can be accessed through row.name or row['name']. In some cases, row['name'] should be chosen to avoid confilcts with row’s own attributes. 123row = Row(name="Alice", age=11, count= ...

PySpark 04 - Key Value Pairs

发表于2024-09-28|Note

PySpark 04 - Key Value Pairs While most Spark operations work on RDDs containing any type of objects, a few special operations are only available on RDDs of key-value pairs. For example, the following code uses the reduceByKey opertaion on key-value pairs to count how many times each line of text occurs in a file: 123lines = sc.textFile("README.md")pairs = lines.map(lambda s:(s, 1))counts = pairs.reduceByKey(lambda a, b:a+b) More examples of key-value pairs operations reduceByKey() 123 ...

PySpark 03 - Use take() instead of Collect()

发表于2024-09-27|Note

PySpark 03 - Use take() instead of Collect() This note discuss the mistake that can caused by “Lazy execution” and global variables. The following code is the implementation of Linear-select problem. Problem: Input: an array A of n numbers (unordered), and k. Output: the k-th smallest number (counting from 0). Algorithm: x = A[0] partition A into A[0…mid-1] < A[mid] = x < A[mid+1…n-1] if mid = k then return x if k < mid then A = A[0…mid-1] , if k > mid then A = A[mid+1,n-1], k = k ...

PySpark 02 - Closure and Persistence

发表于2024-09-24|Note

PySpark 02 - Closure and Persistence 1 Closure A task’s closure is those variables and methods which must be visible for the executor to perform its computations on the RDD. Functions that run on RDDs at executors Any global variables used by those executors The variables within the closure sent to each executor are copies. This closure is serialized and sent to each executor from the driver when an action is invoked. For example: 123456789101112counter = 0rdd = sc.parallelize(range(10))def in ...

PySpark 01 - RDD basics

发表于2024-09-22|Note

PySpark 01 RDD basics 1 RDDs and partitions RDDs: Resilient Distributed Datasets. A real or virtual file consisting of records. Partitioned into partitions. Created through deterministic transformations on “Data in persistent storage” or other RDDs. Do not need to be materialized. Users can control the persistence and the partitioning(e,g, key of the record) of RDDs. Programers specified number of partitions for an RDD (Default value used if not specified). More partitions means better paralle ...

Leading FastSpeech2 to Controllable Text-To-Speech

发表于2024-08-11|Blog

Leading FastSpeech2 to Controllable Text-To-Speech This is a final conclusion of my B.E graduation work, code is open-source at: https://github.com/aucki6144/ctts Introduction Speech synthesis, the task of converting text into natural-sounding speech, is a central topic in AI, NLP, and speech processing. While recent advancements in deep learning have significantly improved the naturalness and robustness of speech synthesis, there are still challenges in controlling the nuances of synthesized sp ...

Data Mining and Knowledge Discovery

发表于2024-03-05|Note

Data Mining and Knowledge Discovery 1 Association Rule Mining 1.1 Definitions n-itemset: Number of items in a set. Example: {B,C} is a 2-itemset. support: Time of an itemset appears in the dataset. Large item sets: item sets with support ≥ a threshold Association rule, Support and Confidence Rule: X → Y . For example: {B,C} → E Support: The support of a rule equals to the support of {X,Y} Confidence: Support(Y)/Support(X)Support(Y)/Support(X)Support(Y)/Support(X) We want to find association ru ...