CSE5334 Data Mining

CSE5334 Data Mining

CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Lecture 15: Association Rule Mining (2) Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides courtesy of Vipin Kumar and Jiawei Han) Pattern Evaluation Pattern Evaluation Association rule algorithms tend to produce too many rules many of them are uninteresting or redundant Redundant if {A,B,C} {D} and {A,B} {D} have same support & confidence Interestingness measures can be used to prune/rank the derived patterns In the original formulation of association rules, support & confidence are the only measures used 3

There are lots of measures proposed in the literature Some measures are good for certain applications, but not for others What criteria should we use to determine whether a measure is good or bad? What about Aprioristyle support based pruning? How does it affect these measures? 4 Computing Interestingness Measure Given a rule X Y, information needed to compute rule interestingness can be obtained from a contingency table Contingency table for X Y Y

Y X f11 f10 f1+ X f01 f00 fo+ f+1 f+0 |T| f11: count of X and Y f10: count of X and Y f01: count of X and Y f00: count of X and Y Used to define various measures support, confidence, lift, Gini, J-measure, etc.

5 Drawback of Confidence Coffee Coffee Tea 15 5 20 Tea 75 5 80 90 10 100 Association Rule: Tea Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9

Although confidence is high, rule is misleading P(Coffee|Tea) = 0.9375 6 Statistical Independence Population of 1000 students 600 students know how to swim (S) 700 students know how to bike (B) 420 students know how to swim and bike (S,B) P(SB) = 420/1000 = 0.42 P(S) P(B) = 0.6 0.7 = 0.42 P(SB) = P(S) P(B) => Statistical independence P(SB) > P(S) P(B) => Positively correlated P(SB) < P(S) P(B) => Negatively correlated 7 Statistical-based Measures Measures that take into account statistical dependence

conf ( X Y ) P (Y | X ) Lift sup(Y ) P (Y ) Interest _ factor coefficient P( X , Y ) P ( X ) P (Y ) = 1, independent > 1, positively correlated < 1, negatively correlated = 0, independent P ( X , Y ) P ( X ) P (Y ) > 0, positively correlated P ( X )[1 P ( X )]P (Y )[1 P (Y )]< 0, negatively correlated 8 Example: Lift/Interest Coffee Coffee Tea 15 5

20 Tea 75 5 80 90 10 100 Association Rule: Tea Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated) 9 Drawback of Lift & Interest Y Y X 10

0 10 X 0 90 90 10 90 100 0.1 Lift 10 (0.1)(0.1) Y Y X 90 0

90 X 0 10 10 90 10 100 0.9 Lift 1.11 (0.9)(0.9) Statistical independence: If P(X,Y)=P(X)P(Y) => Lift = 1 10 Example: -Coefficient Coffee Coffee Tea 15 5

20 Tea 75 5 80 90 10 100 Association Rule: Tea Coffee 0.15 0.9 0.2 0.9 0.1 0.2 0.8 0.25(< 0, therefore is negatively correlated) 11 Drawback of -Coefficient Y Y X

60 10 70 X 10 20 30 70 30 100 0.6 0.7 0.7 0.7 0.3 0.7 0.3 0.5238 Y Y X 20 10

30 X 10 60 70 30 70 100 0.2 0.3 0.3 0.7 0.3 0.7 0.3 0.5238 Coefficient is the same for both tables 12

Recently Viewed Presentations

  • Themes and Symbols: Appearances vs. Reality Objective: To ...

    Themes and Symbols: Appearances vs. Reality Objective: To ...

    Tragic Character ArcsObjective: To evaluate the tragic journey of key characters and the causes of their regression.. Lesson 9 [Character Arc]The journey of a character and the dynamics of their character traits throughout the course of a narrative. [Tyranny] Cruel...
  • Visual Rhetoric/Visual Literacy: Writing About Paintings

    Visual Rhetoric/Visual Literacy: Writing About Paintings

    Use the OPTIC strategyto Analyze Visual Media. O = Overview. P = Parts of the picture (Parts, Color, Placement, Attitude, Size, and Orientation) T = Title. I = Interrelationships. C = Conclusion
  • Drumming and Singing - University of Sheffield

    Drumming and Singing - University of Sheffield

    Many world music textbooks refer to "musics" in the plural (e.g. May, Musics of Many Cultures; Titon et al., Worlds of Music) In reality, all music exists in the same world. Thanks to research and the Internet, we now have...
  • The Private Marginal Benefit of Pollution

    The Private Marginal Benefit of Pollution

    The Private Marginal Benefit of Pollution. As the firm attempts to abate more and more pollution, stronger and more expensive interventions (such as stack scrubbers) must be employed→ increasing marginal costs to abatement. If we were to graph the private...
  •                                                       , .   .    .    :                         .                   .   2:

    , . . . : . . 2:

    اجزای کتله: 2240 Pound (lb) پاوند 9807000 Neton (N) نیوتن 160 Stone ستون 1000000 Kilogram (Kg) کیلو گرام 22.4 Short hundredweight (US) 10000 Centner کینتنر 20 Long hundredweight (UK) 9807 Kilonewton (KN) کیلونیوتن 1.12 Short ton (US) شارت تن 1000...
  • Hypertension in Pregnancy NICE Guidelines with additional diagnostic

    Hypertension in Pregnancy NICE Guidelines with additional diagnostic

    Algorithm version 3.0 Jan 2016 ©2016 King's College London . This protocol is an example used in the NICE medical technology guidance adoption support resource for PlGF-based testing to help diagnose suspected pre-eclampsia. It was not produced for or commissioned...
  • Chapter 4, Socialization

    Chapter 4, Socialization

    Chapter 4 Socialization ... Socialization as Social Control Socialization and Self-Esteem How much value one sees in oneself is greatly affected by socialization how you are seen by society. A national study of 9th and 12th graders examined the eating...
  • CH 104: DETERMINATION OF A SOLUBILITY PRODUCT CONSTANT

    CH 104: DETERMINATION OF A SOLUBILITY PRODUCT CONSTANT

    The equilibrium concentrations of Ca2+(aq) and F-(aq) are given algebraic variables based on the stoichiometric coefficients from the balance reaction. Write these equilibrium concentrations of Ca2+(aq) and F-(aq). [Ca2+] = x [F-] = 2x CALCULATING SOLUBILITY FROM Ksp Step #3:...