Data Mining and Knowledge Discovery

1 Association Rule Mining

1.1 Definitions

n-itemset: Number of items in a set. Example: {B,C} is a 2-itemset.

support: Time of an itemset appears in the dataset.

Large item sets: item sets with support ≥ a threshold

Association rule, Support and Confidence

  • Rule: X → Y . For example: {B,C} → E
  • Support: The support of a rule equals to the support of {X,Y}
  • Confidence: Support(Y)/Support(X)Support(Y)/Support(X)

We want to find association rules with Support > a threshold & Confidence > a threshold in association rule mining.

K-frequent N-itemset: Item sets whose support is at least K and having N items.


1.2 A priori

Two property

  1. If an itemset S is large, then any proper subset of S must be large.
  2. If an itemset S is NOT large, then any proper superset of S must NOT be large.

Algorithm of A priori Method

04.png

Candidate Generation: Join and Prune

Join Step:

  • Input: Lk1L_{k-1}, a set of large (k-1)-itemsets

  • Output: CkC_k, a set of candidates k-itemsets

  • Algorithm:

    ​ Select two large k-1 itemsets p and q with the same prefix

    ​ Insert into CkC_k: p and the last item of q, where q.itemk1>p.itemk1q.item_{k-1} > p.item_{k-1}

Prune Step:

  • For all itemsets in cCkc\in C_{k} , do:
    • For all (k-1)-subsets s of c , do:
      • if s not large, delete c form k.

After the candidate generation with Join Step and Prune Step, scan the database to obtain the count of each itemset in the candidate set then produce LkL_k


1.3 FP-Tree

Disadvantage of apriori method: 1. It’s costly to handle a large number of candidate sets. 2. It is tedious to repeatedly scan the whole database.

FP-tree only scan the database once to store all essential information in the tree.

FP-tree steps:

  1. Deduce the ordered frequent items. For items with the same frequency, the order is given by the alphabetical order.
  2. Construct the FP-tree from the above data.
  3. From the FP-tree above, construct the FP-conditional tree for each item (or itemset).
  4. Determine the frequent patterns.

Example (Threshold=3)

Step1: Deduce the ordered frequent items. For items with the same frequency, the order is given by the alphabetical order.

01.png

Step2: Construct the FP-tree from the above data.

02.png

Step3: From the FP-tree above, construct the FP-conditional tree for each item (or itemset).

(From all nodes with “g”, reverse to the root from this node)

03.png

Step4: Determine the frequent patterns.

20.png

Complexity of FP-tree:

Require two scans of the transactions DB: Collect Frequent Items and Construct the FP-tree.

Cost to insert one transaction: Number of frequent items in this transaction.

The size of the FP-tree is bounded by the overall occurrences of the frequent items in the database.

The height of the tree is bounded by the maximum number of frequent items in any transaction in the database


2 Clustering

2.1 K-means Clustering

Suppose we have n example feature vectors and we know that they fall into k compact clusters.

Let mim_i be the mean of the vectors in cluster ii. We can say that xx is in cluster ii if distance from xx to mim_i is the minimum of all the kk distances.

Procedure for finding k-means

  • Make initial guess for the means: m1,m2,m3,...,mkm_1, m_2, m_3, ..., m_k

  • Until there is no change in any mean:

    • Assign each point to its nearest cluster (mean)
    • Calculate the new mean for each cluster
    • Replace all the means with new means

Initialization of k-means

Randomly choose kk of the examples.

The result of K-means clustering depends on the initial means and it frequently happens that suboptimal results are found. The standard solution is to try a number of different initial means.

Disadvantages of k-means

  • “Bad” initial guess.
  • The value of k is not user-friendly (User cannot choose the best k for unknown data)

2.2 Sequential K-means Clustering

Update the means with one example at a time. Useful when the data is acquired over a period of time.

Procedure for sequential k-means clustering

  • Make initial guess for the means: m1,m2,m3,...,mkm_1, m_2, m_3, ..., m_k
  • Set the counts n1,n2,n3,...,nkn_1, n_2, n_3, ..., n_k to zero
  • Until data is used up / interrupted:
    • Acquire the next example xx
    • If mim_i is closest to xx
      • Increment nin_i
      • Replace mim_i by mi+1ni×(xmi)m_i+\frac{1}{n_i}\times(x-m_i)

2.3 Forgetful Sequential K-means Clustering

Let aa be a constant between 0 and 1

Procedure for forgetful sequential k-means clustering

  • Make initial guess for the means: m1,m2,m3,...,mkm_1, m_2, m_3, ..., m_k
  • If mim_i is closest to xx, replace mim_i by mi+a(xmi)m_i+a(x-m_i)

In this method, mim_i is a weighted average of the examples that were closest to mim_i. where the weight decreases exponentially with the “age” to the example.

Let m0m_0 be the initial value of the mean vector and xjx_j is the jthj-th example out of nn examples that were used to form mim_i

mn=(1a)nm0+ak=1n(1a)nkxkm_n = (1-a)^n m_0 + a\sum^{n}_{k=1}(1-a)^{n-k}x_k

2.4 Hierarchical Clustering

Two varieties of hierarchical clustering algorithms:

  • Agglomerative
  • Divisive

Hierarchic grouping can be represented by two-dimensional diagram known as dendrogram.

05.png

2.4a Agglomerative hierarchical clustering

Single Linkage

Also known as the nearest neighbor technique.

Distance between groups equals to the distance of the closest pair of data.

Complete Linkage

Distance between groups equals to the distance of the most distant pair of data.

Group Average Linkage

Distance between groups equals to the average of the distances between all pairs of records (one from each cluster).

Centroid Linkage

Distance between groups equals to the distance between the mean vectors of the two clusters.

Median Linkage

Disadvantage of the Centroid Clustering: When a large cluster is merged with a small one, the centroid of the combined cluster would be closed to the large one, ie. The characteristic properties of the small one are lost/

In median linkage, After we have combined two groups, the mid-point of the original two cluster centers is used as the new center of the newly combined group.

2.4B Divisive Methods

Polythetic Approach (Based on Group Average Linkage)

For the first iteration, calculate the distance between each point and the rest points. Choose the one with the largest distance to form the initial division.

Example:

06.png

In the following iterations:

  • Calculate the distance between each point and A/B as DistADist_A and DistBDist_B
  • Calculate the difference of DistADist_A and DistBDist_B = DistBDistADist_B - Dist_A
  • Choose the point with the largest distance difference and put it into A.

Example:

07.png

End the iteration when all differences are negative. The process would continue on each subgroup separately.

Monothetic Approach

It is usually used when the data consists of binary variables.

08.png

09.png

In this case, choose attribute A for dividing the data into two groups {2,3,4} and {1,5}


3 Classification

3.1 Decision Tree

ID3

Entropy is used to measure how informative is a node. When given a probability distribution P=(p1,p2,...,pn)P = (p_1 , p_2 , ... , p_n), then the Information (or the Entropy of P) is:

I(P)=p1log p1p2log p2...pnlog pnI(P) = - p_1 log\ p_1 - p_2 log\ p_2 - ... - p_n log\ p_n

The Entropy is a way to measure the amount of information. The smaller the entropy is, the more information we have.

ID3 Impurity Measurement:

Gain(A,T)=Info(T)Info(A,T)Gain(A,T) = Info(T)-Info(A,T)

C4.5

C4.5 Impurity Measurement:

Gain(A,T)=(Info(T)Info(A,T))/SplitInfo(A)where SplitInfo(A)=ΣvA p(v)log p(v)Gain(A,T) = (Info(T) - Info(A,T))/SplitInfo(A)\\ where\ SplitInfo(A)=-\Sigma_{v\in A}\ p(v)log\ p(v)

CART

CART Impurity Measurement:

Gini: Info(P)=1Σjpj2Gini:\ Info(P)=1-\Sigma _ j p_ j ^ 2

3.2 Naïve Bayesian Classifier

Bayes Rule:

P(AB)=P(BA)P(A)P(B), where A,B are random variablesP(A|B)=\frac{P(B|A)P(A)}{P(B)},\ where\ A,B\ are\ random\ variables

Independent Assumption:

  • Each attribute are independent

P(X,Y,Z  A)=P(XA)×P(YA)×P(ZA)P(X,Y,Z\ |\ A)=P(X|A)\times P(Y|A)\times P(Z|A)

Naïve Bayes

Naïve Bayes Classification is based on the independent assumption.

Race Income Child Insurance
black high no yes
white high yes yes
white low yes yes
white low yes yes
black low no no
black low no no
black low no no
white low no no

Suppose we want to calculate: P(Yes  Race=white,Income=high,child=no)P(Yes\ |\ Race=white,Income=high,child=no)

P(Yes  Race=white,Income=high,child=no)=P(Race=white,Income=high,child=no  Yes)P(Yes)P(Race=white,Income=high,child=no)=P(Race=whiteYes)P(Income=highYes)P(child=noYes)P(Yes)P(Race=white,Income=high,child=no)=0.09375×0.5P(Race=white,Income=high,child=no)=0.046875P(Race=white,Income=high,child=no)P(Yes\ |\ Race=white,Income=high,child=no)\\ =\frac {P(Race=white,Income=high,child=no\ |\ Yes)P(Yes)} {P(Race=white,Income=high,child=no)}\\ =\frac {P(Race=white|Yes)P(Income=high|Yes)P(child=no|Yes)P(Yes)} {P(Race=white,Income=high,child=no)}\\ =\frac {0.09375\times 0.5} {P(Race=white,Income=high,child=no)}\\ =\frac {0.046875} {P(Race=white,Income=high,child=no)}

With the same method, calculate:

P(Yes  Race=white,Income=high,child=no)=0P(Race=white,Income=high,child=no)P(Yes\ |\ Race=white,Income=high,child=no)\\ =\frac {0} {P(Race=white,Income=high,child=no)}

Therefore, we predict the following new person will buy an insurance:

Race Income Child Insurance
white high no ?

3.3 Bayesian Belief Network

Bayesian Belief Network do not have independent assumption. Some attributes are dependent on other attributes. For example:

10.png

Conditionally independent

If P(x  y,z)=P(x  z)P(x\ |\ y,z) = P(x\ |\ z), x is conditionally independent of y given z. Also, if x is conditionally independent of y given z, P(x,y  z)=P(x  z)×P(y  z)P(x,y\ |\ z)=P(x\ |\ z)\times P(y\ |\ z)

In Bayesian Belief Network, a node is conditionally independent of its non-descendants if its parents are known.

Algorithm

Suppose there is a new person and we want to know whether he is likely to have Heart Disease:

Exercise Diet Heartburn Blood Pressure Chest Pain Heart Disease
? ? ? ? ? ?

P(HD=Yes)=x{yes,no}y{healthy,unhealthy}P(HD=Yes  E=x,D=y)×P(E=x,D=y)=x{yes,no}y{healthy,unhealthy}P(HD=Yes  E=x,D=y)×P(E=x)×P(D=y)=0.25×0.7×0.25+0.45×0.7×0.75+0.55×0.3×0.25+0.75×0.3×0.75=0.49P(HD=No)=1P(HD=Yes)=10.49=0.51P(HD=Yes)\\ = \sum_{x\in\{yes,no\}} \sum_{y\in\{healthy,unhealthy\}} P(HD=Yes\ |\ E=x,D=y)\times P(E=x,D=y)\\ = \sum_{x\in\{yes,no\}} \sum_{y\in\{healthy,unhealthy\}} P(HD=Yes\ |\ E=x,D=y)\times P(E=x)\times P(D=y)\\ = 0.25\times0.7\times0.25+ 0.45\times0.7\times0.75+ 0.55\times0.3\times0.25+ 0.75\times0.3\times0.75 =0.49\\ P(HD=No)=1-P(HD=Yes)=1-0.49=0.51


Suppose there is a new person and we want to know whether he is likely to have Heart Disease:

Exercise Diet Heartburn Blood Pressure Chest Pain Heart Disease
? ? ? High ? ?

P(BP=High)=0.49×0.85+0.51×0.2=0.5185P(HD=Yes  BP=High)=P(BP=High  HD=Yes)P(HD=yes)P(BP=High)=0.85×0.490.5185=0.8033P(HD=No  BP=High)=1P(HD=Yes  BP=High)=10.8033=0.1967P(BP=High)=0.49\times0.85+0.51\times0.2=0.5185 \\ P(HD=Yes\ |\ BP=High)=\frac {P(BP=High\ |\ HD=Yes)P(HD=yes)} {P(BP=High)}\\ =\frac {0.85\times0.49} {0.5185}=0.8033\\ P(HD=No\ |\ BP=High)=1-P(HD=Yes\ |\ BP=High)=1-0.8033=0.1967


Suppose there is a new person and we want to know whether he is likely to have Heart Disease:

Exercise Diet Heartburn Blood Pressure Chest Pain Heart Disease
Yes Healthy ? High ? ?

P(HD=Yes  E=Yes,D=Healthy,BP=High)=P(BP=High  E=Yes,D=Healthy,HD=Yes)P(BP=High  E=Yes,D=Healthy)×P(HD=Yes  D=Healthy,E=Yes)=P(BP=High  HD=Yes)×P(HD=Yes  D=Healthy,E=Yes)x{yes,no}P(BP=High  HD=x)×P(HD=x  E=Yes,D=Healthy)=0.85×0.250.85×0.25+0.2×0.75=0.5862P(HD=No  E=Yes,D=Healthy,BP=High)=1P(HD=Yes  E=Yes,D=Healthy,BP=High)=10.5862=0.4138P(HD=Yes\ |\ E=Yes, D=Healthy, BP=High)\\ = \frac {P(BP=High\ |\ E=Yes, D=Healthy, HD=Yes)} {P(BP=High\ |\ E=Yes, D=Healthy)} \times P(HD=Yes\ |\ D=Healthy, E=Yes) \\ = \frac {P(BP=High\ |\ HD=Yes)\times P(HD=Yes\ |\ D=Healthy, E=Yes)} {\sum_{x\in\{yes,no\}}P(BP=High\ |\ HD=x)\times P(HD=x\ |\ E=Yes, D=Healthy)} \\ = \frac {0.85\times0.25} {0.85\times0.25+0.2\times0.75}=0.5862 \\ P(HD=No\ |\ E=Yes, D=Healthy, BP=High)\\ = 1- P(HD=Yes\ |\ E=Yes, D=Healthy, BP=High)\\ = 1-0.5862=0.4138

3.4 Nearest Neighbor Classifier

Algorithm

  • Step 1: Find the nearest neighbor
  • Step 2: Use the “label” of this neighbor

3.5 Support Vector Machine (SVM)

Advantages:

  • Can be visualized.
  • Accurate when data is well partitioned.

11.png

In this case: Margin equals to:

Margin=(b+1)(b1)ω12+ω22=2ω12+ω22Margin = \frac {|(b+1)-(b-1)|} {\sqrt{\omega_1^2+\omega_2^2}} = \frac {2} {\sqrt{\omega_1^2+\omega_2^2}}

Therefore, maximize margin equals to minimize ω12+ω22\omega_1^2+\omega_2^2, subject to y(ω1x1+ω2x2+b)1y(\omega_1x_1+\omega_2x_2+b)≥1. where yy is the label of the point (+1/1).(+1/-1).

In this circumstance, we just described a 2-dimensional space. The space can be divided into two parts by a line.

For n-dimensional space where n ≥ 2. We use a hyperplane to divide the space into two parts.

Non-Linear Support Vector Machine:

  • Step 1: Transform the data into a higher dimensional space using a “nonlinear” mapping.
  • Step 2: Use the Linear Support Vector Machine in this high-dimensional space.

4 Neural Network

Neural Network is a computing system made up of simple and highly interconnected processing elements.

Information processing occurs at many identical and simple processing elements called neurons (or also called units, cells or nodes). Interneuron connection strengths known as synaptic weights are used to store the knowledge.

Advantage or Neural Network

  • Parallel Processing.
  • Fault Tolerance.

4.1 Activation Function

Threshold function

y={1,     if x00,     if x<0y= \begin{cases} 1,\ \ \ \ \ if\ x\geq0\\ 0,\ \ \ \ \ if\ x<0 \end{cases}

Linear function

y=xy=x

Rectified Linear Unit (ReLU)

y={x,     if x00,     if x<0y= \begin{cases} x,\ \ \ \ \ if\ x\geq0\\ 0,\ \ \ \ \ if\ x<0\\ \end{cases}

Sigmoid

y=11+enety= \frac {1} {1+e^{-net}}

tanh function

y=e2x1e2x+1y= \frac {e^{2x}-1} {e^{2x}+1}

4.2 Learning

Let α\alpha be the learning rate.

Learning is done by:

  • ωi=ωi+α(dy)xi\omega_i=\omega_i+\alpha(d-y)x_i
  • where dd is the desired output, yy is the output of the neural network
  • b=b+α(dy)b = b+\alpha (d-y)

When we processed all data points in the whole dataset, we say that we processed the dataset in one epoch. Data can be processed for multiple epochs.

Stopping condition:

The stopping condition could be specified with the following:

  • The no. of epochs is equal to a specified number.

4.3 Multi-Layer Perception (MLP)

12.png

MLP can solve linear separable problems as well as non-linear separable problems.

MLP has proven to be a universal approximator. It can model all types function y=f(x)y=f(x).

If the total number of layers is very large, this neural network is called a “deep network”. The processing of training a deep network and perform prediction with it is called “deep learning”.

Forward Propagation:

Based on the input and the current values of weight and bias variables, compute output yy.

Backward Propagation

Based on the output yy and the desired output dd, compute the error and update the variables.


5 Recurrent Neutral Network (RNN)

In neural network, we assume that records in the table are “independent”. In some cases, the current record is related to the previous records in the table. Thus, records in the table are “dependent”, we also want to capture this “dependency” in the model.

13.png

Limitation of RNN

  • It may “memorize” a lot of past events.
  • Due to its complex structure, it is more time-consuming for training.

5.1 Basic RNN

14.png

The Activation function of basic RNN is usually “tanh” or “ReLU”.

Similar to the neural network, the basic RNN model has two steps:

  • Input Forward Propagation.
  • Error Backward Propagation.

In the following, we focus on “Input Forward Propagation”. For RNN, “Error Backward Propagation” could be solved by an existing optimization tool.

5.2 LSTM (Long SHORT-term memory)

The basic RNN model is too “simple”. It could not easy for the basic RNN model to converge.

LSTM can simulate the brain process:

Forget Feature : It can forget a portion of the internal state variable.

Input Feature : It could decide to input a portion of the input variable for the model; It could decide the strength of the input for the model.

Output Feature : It could decide to output a portion of the output for the model; It could decide the strength of the output for the model.

15.png

Similar to the “neural network”, the LSTM model could also have multiple layers and have multiple memory units in each layer.

5.3 GRU (Gated Recurrent Unit)

GRU is a variation of the traditional LSTM model. Its structure is similar to the traditional LSTM model, but its structure is “simpler”.

Advantage of GRU compared to traditional LSTM:

  • The training time is shorter.
  • It requires fewer data points to capture the properties of the data.

GRU model does not have an internal state variable to store our memory. It regards the “predicted” target attribute value of the previous record as a reference to store our memory.

Reset Feature : It could regard the “predicted” target attribute value of the previous record as a reference to store the memory.

Input Feature : It could decide the strength of the input for the model.

Output Feature : It could combine a portion of the predicted target attribute value of the previous record and a portion of the “processed” input variable.

16.png


6 Data Warehousing

Data warehousing is also called Online Analytical Processing (OLA)

Views structure:

17.png

Selective Materialization Decision Problem is NP-hard

Algorithm (Greedy):

  • S = {top view}
  • For i = 1 to k do:
    • Select that view v not in S such that the benefit of selecting v is maximized.
    • Append S with v
  • Resulting S is the greedy selection

7 Web Databases

7.1 HITS Algorithm

HITS is a ranking algorithm which ranks “hubs” and “authorities”.

18.png

a(v)=Σuvh(u)h(v)=Σvua(u)a(v)=\Sigma_{u\to v} h(u)\\ h(v)=\Sigma_{v\to u} a(u)

HITS involves two major steps:

  • Step1: Sampling Step
  • Step2: Iteration Step

Sampling Step:

Collect a set of pages that are very relevant – called the base set.

Iteration Step

19.png

h=Maa=MThTherefore, h=MMTh, a=MTMah=Ma\\ a=M^Th\\ Therefore,\ h=MM^Th,\ a=M^TMa

Rank Method:

  • Rank in descending order of hub only
  • Rank in descending order of authority only
  • Rank in descending order of the value computed from both hub and authority

7.2 PageRank Algorithm (Google)

PageRank Algorithm makes use of Stochastic approach to rank the pages.