07FPAdvanced.ppt

资源描述

1、1,1,Data Mining:Concepts and Techniques(3rd ed.)Chapter 7,Jiawei Han,Micheline Kamber,and Jian PeiUniversity of Illinois at Urbana-Champaign&Simon Fraser University2011 Han,Kamber&Pei.All rights reserved.,April 29,2024,Data Mining:Concepts and Techniques,2,3,Chapter 7:Advanced Frequent Pattern Minin

2、g,Pattern Mining:A Road MapPattern Mining in Multi-Level,Multi-Dimensional SpaceConstraint-Based Frequent Pattern MiningMining High-Dimensional Data and Colossal PatternsMining Compressed or Approximate Patterns Pattern Exploration and ApplicationSummary,Research on Pattern Mining:A Road Map,4,5,Cha

3、pter 7:Advanced Frequent Pattern Mining,Pattern Mining:A Road MapPattern Mining in Multi-Level,Multi-Dimensional SpaceMining Multi-Level AssociationMining Multi-Dimensional AssociationMining Quantitative Association RulesMining Rare Patterns and Negative PatternsConstraint-Based Frequent Pattern Min

4、ingMining High-Dimensional Data and Colossal PatternsMining Compressed or Approximate Patterns Pattern Exploration and ApplicationSummary,6,Mining Multiple-Level Association Rules,Items often form hierarchiesFlexible support settings Items at the lower level are expected to have lower supportExplora

5、tion of shared multi-level mining(Agrawal&SrikantVLB95,Han&FuVLDB95),7,Multi-level Association:Flexible Support and Redundancy filtering,Flexible min-support thresholds:Some items are more valuable but less frequentUse non-uniform,group-based min-supportE.g.,diamond,watch,camera:0.05%;bread,milk:5%;

6、Redundancy Filtering:Some rules may be redundant due to“ancestor”relationships between itemsmilk wheat bread support=8%,confidence=70%2%milk wheat bread support=2%,confidence=72%The first rule is an ancestor of the second ruleA rule is redundant if its support is close to the“expected”value,based on

7、 the rules ancestor,8,Chapter 7:Advanced Frequent Pattern Mining,Pattern Mining:A Road MapPattern Mining in Multi-Level,Multi-Dimensional SpaceMining Multi-Level AssociationMining Multi-Dimensional AssociationMining Quantitative Association RulesMining Rare Patterns and Negative PatternsConstraint-B

8、ased Frequent Pattern MiningMining High-Dimensional Data and Colossal PatternsMining Compressed or Approximate Patterns Pattern Exploration and ApplicationSummary,9,Mining Multi-Dimensional Association,Single-dimensional rules:buys(X,“milk”)buys(X,“bread”)Multi-dimensional rules:2 dimensions or pred

9、icatesInter-dimension assoc.rules(no repeated predicates)age(X,”19-25”)occupation(X,“student”)buys(X,“coke”)hybrid-dimension assoc.rules(repeated predicates)age(X,”19-25”)buys(X,“popcorn”)buys(X,“coke”)Categorical Attributes:finite number of possible values,no ordering among valuesdata cube approach

10、Quantitative Attributes:Numeric,implicit ordering among valuesdiscretization,clustering,and gradient approaches,10,Chapter 7:Advanced Frequent Pattern Mining,Pattern Mining:A Road MapPattern Mining in Multi-Level,Multi-Dimensional SpaceMining Multi-Level AssociationMining Multi-Dimensional Associati

11、onMining Quantitative Association RulesMining Rare Patterns and Negative PatternsConstraint-Based Frequent Pattern MiningMining High-Dimensional Data and Colossal PatternsMining Compressed or Approximate Patterns Pattern Exploration and ApplicationSummary,11,Mining Quantitative Associations,Techniqu

12、es can be categorized by how numerical attributes,such as age or salary are treatedStatic discretization based on predefined concept hierarchies(data cube methods)Dynamic discretization based on data distribution(quantitative rules,e.g.,Agrawal&SrikantSIGMOD96)Clustering:Distance-based association(e

13、.g.,Yang&MillerSIGMOD97)One dimensional clustering then associationDeviation:(such as Aumann and LindellKDD99)Sex=female=Wage:mean=$7/hr(overall mean=$9),12,Static Discretization of Quantitative Attributes,Discretized prior to mining using concept hierarchy.Numeric values are replaced by rangesIn re

14、lational database,finding all frequent k-predicate sets will require k or k+1 table scansData cube is well suited for miningThe cells of an n-dimensional cuboid correspond to the predicate setsMining from data cubescan be much faster,13,Quantitative Association Rules Based on Statistical Inference T

15、heory Aumann and LindellDMKD03,Finding extraordinary and therefore interesting phenomena,e.g.,(Sex=female)=Wage:mean=$7/hr(overall mean=$9)LHS:a subset of the population RHS:an extraordinary behavior of this subsetThe rule is accepted only if a statistical test(e.g.,Z-test)confirms the inference wit

16、h high confidenceSubrule:highlights the extraordinary behavior of a subset of the pop.of the super rule E.g.,(Sex=female)(South=yes)=mean wage=$6.3/hrTwo forms of rulesCategorical=quantitative rules,or Quantitative=quantitative rulesE.g.,Education in 14-18(yrs)=mean wage=$11.64/hrOpen problem:Effici

17、ent methods for LHS containing two or more quantitative attributes,14,Chapter 7:Advanced Frequent Pattern Mining,Pattern Mining:A Road MapPattern Mining in Multi-Level,Multi-Dimensional SpaceMining Multi-Level AssociationMining Multi-Dimensional AssociationMining Quantitative Association RulesMining

18、 Rare Patterns and Negative PatternsConstraint-Based Frequent Pattern MiningMining High-Dimensional Data and Colossal PatternsMining Compressed or Approximate Patterns Pattern Exploration and ApplicationSummary,15,Negative and Rare Patterns,Rare patterns:Very low support but interestingE.g.,buying R

19、olex watchesMining:Setting individual-based or special group-based support threshold for valuable itemsNegative patternsSince it is unlikely that one buys Ford Expedition(an SUV car)and Toyota Prius(a hybrid car)together,Ford Expedition and Toyota Prius are likely negatively correlated patternsNegat

20、ively correlated patterns that are infrequent tend to be more interesting than those that are frequent,16,Defining Negative Correlated Patterns(I),Definition 1(support-based)If itemsets X and Y are both frequent but rarely occur together,i.e.,sup(X U Y)s(A)*s(B)Where is the problem?Null transactions

21、,i.e.,the support-based definition is not null-invariant!,17,Defining Negative Correlated Patterns(II),Definition 2(negative itemset-based)X is a negative itemset if(1)X=U B,where B is a set of positive items,and is a set of negative items,|1,and(2)s(X)Itemsets X is negatively correlated,ifThis defi

22、nition suffers a similar null-invariant problemDefinition 3(Kulzynski measure-based)If itemsets X and Y are frequent,but(P(X|Y)+P(Y|X)/2,where is a negative pattern threshold,then X and Y are negatively correlated.Ex.For the same needle package problem,when no matter there are 200 or 105 transaction

23、s,if=0.01,we have(P(A|B)+P(B|A)/2=(0.01+0.01)/2,18,Chapter 7:Advanced Frequent Pattern Mining,Pattern Mining:A Road MapPattern Mining in Multi-Level,Multi-Dimensional SpaceConstraint-Based Frequent Pattern MiningMining High-Dimensional Data and Colossal PatternsMining Compressed or Approximate Patte

24、rns Pattern Exploration and ApplicationSummary,19,Constraint-based(Query-Directed)Mining,Finding all the patterns in a database autonomously?unrealistic!The patterns could be too many but not focused!Data mining should be an interactive process User directs what to be mined using a data mining query

25、 language(or a graphical user interface)Constraint-based miningUser flexibility:provides constraints on what to be minedOptimization:explores such constraints for efficient mining constraint-based mining:constraint-pushing,similar to push selection first in DB query processingNote:still find all the

26、 answers satisfying constraints,not finding some answers in“heuristic search”,20,Constraints in Data Mining,Knowledge type constraint:classification,association,etc.Data constraint using SQL-like queries find product pairs sold together in stores in Chicago this yearDimension/level constraintin rele

27、vance to region,price,brand,customer categoryRule(or pattern)constraintsmall sales(price$200)Interestingness constraintstrong rules:min_support 3%,min_confidence 60%,Meta-Rule Guided Mining,Meta-rule can be in the rule form with partially instantiated predicates and constants P1(X,Y)P2(X,W)=buys(X,“

28、iPad”)The resulting rule derived can beage(X,“15-25”)profession(X,“student”)=buys(X,“iPad”)In general,it can be in the form of P1 P2 Pl=Q1 Q2 Qr Method to find meta-rulesFind frequent(l+r)predicates(based on min-support threshold)Push constants deeply when possible into the mining process(see the re

29、maining discussions on constraint-push techniques)Use confidence,correlation,and other measures when possible,21,22,Constraint-Based Frequent Pattern Mining,Pattern space pruning constraintsAnti-monotonic:If constraint c is violated,its further mining can be terminatedMonotonic:If c is satisfied,no

30、need to check c againSuccinct:c must be satisfied,so one can start with the data sets satisfying cConvertible:c is not monotonic nor anti-monotonic,but it can be converted into it if items in the transaction can be properly orderedData space pruning constraintData succinct:Data space can be pruned a

31、t the initial pattern mining processData anti-monotonic:If a transaction t does not satisfy c,t can be pruned from its further mining,23,Pattern Space Pruning with Anti-Monotonicity Constraints,A constraint C is anti-monotone if the super pattern satisfies C,all of its sub-patterns do so tooIn other

32、 words,anti-monotonicity:If an itemset S violates the constraint,so does any of its superset Ex.1.sum(S.price)v is anti-monotoneEx.2.range(S.profit)15 is anti-monotoneItemset ab violates CSo does every superset of abEx.3.sum(S.Price)v is not anti-monotoneEx.4.support count is anti-monotone:core prop

33、erty used in Apriori,TDB(min_sup=2),24,Pattern Space Pruning with Monotonicity Constraints,A constraint C is monotone if the pattern satisfies C,we do not need to check C in subsequent miningAlternatively,monotonicity:If an itemset S satisfies the constraint,so does any of its superset Ex.1.sum(S.Pr

34、ice)v is monotoneEx.2.min(S.Price)v is monotoneEx.3.C:range(S.profit)15Itemset ab satisfies CSo does every superset of ab,TDB(min_sup=2),25,Data Space Pruning with Data Anti-monotonicity,A constraint c is data anti-monotone if for a pattern p cannot satisfy a transaction t under c,ps superset cannot

35、 satisfy t under c eitherThe key for data anti-monotone is recursive data reductionEx.1.sum(S.Price)v is data anti-monotoneEx.2.min(S.Price)v is data anti-monotoneEx.3.C:range(S.profit)25 is data anti-monotoneItemset b,cs projected DB:T10:d,f,h,T20:d,f,g,h,T30:d,f,gsince C cannot satisfy T10,T10 can

36、 be pruned,TDB(min_sup=2),26,Pattern Space Pruning with Succinctness,Succinctness:Given A1,the set of items satisfying a succinctness constraint C,then any set S satisfying C is based on A1,i.e.,S contains a subset belonging to A1Idea:Without looking at the transaction database,whether an itemset S

37、satisfies constraint C can be determined based on the selection of items min(S.Price)v is succinctsum(S.Price)v is not succinctOptimization:If C is succinct,C is pre-counting pushable,27,Apriori+Constraint,Database D,Scan D,C1,L1,L2,C2,C2,Scan D,C3,L3,Scan D,Constraint:SumS.price 5,28,Constrained Ap

38、riori:Push a Succinct Constraint Deep,Database D,Scan D,C1,L1,L2,C2,C2,Scan D,C3,L3,Scan D,Constraint:minS.price=1,not immediately to be used,29,Constrained FP-Growth:Push a Succinct Constraint Deep,Constraint:minS.price=1,Remove infrequentlength 1,FP-Tree,1-Projected DB,No Need to project on 2,3,or

39、 5,30,Constrained FP-Growth:Push a Data Anti-monotonic Constraint Deep,Constraint:minS.price=1,FP-Tree,Single branch,we are done,Remove from data,31,Constrained FP-Growth:Push a Data Anti-monotonic Constraint Deep,Constraint:rangeS.price 25min_sup=2,FP-Tree,B-Projected DB,BFP-Tree,RecursiveData Prun

40、ing,Single branch:bcdfg:2,32,Convertible Constraints:Ordering Data in Transactions,Convert tough constraints into anti-monotone or monotone by properly ordering itemsExamine C:avg(S.profit)25Order items in value-descending orderIf an itemset afb violates CSo does afbh,afb*It becomes anti-monotone!,T

41、DB(min_sup=2),33,Strongly Convertible Constraints,avg(X)25 is convertible anti-monotone w.r.t.item value descending order R:If an itemset af violates a constraint C,so does every itemset with af as prefix,such as afd avg(X)25 is convertible monotone w.r.t.item value ascending order R-1:If an itemset

42、 d satisfies a constraint C,so does itemsets df and dfa,which having d as a prefixThus,avg(X)25 is strongly convertible,34,Can Apriori Handle Convertible Constraints?,A convertible,not monotone nor anti-monotone nor succinct constraint cannot be pushed deep into the an Apriori mining algorithmWithin

43、 the level wise framework,no direct pruning based on the constraint can be madeItemset df violates constraint C:avg(X)=25Since adf satisfies C,Apriori needs df to assemble adf,df cannot be prunedBut it can be pushed into frequent-pattern growth framework!,35,Pattern Space Pruning w.Convertible Const

44、raints,C:avg(X)=25,min_sup=2List items in every transaction in value descending order R:C is convertible anti-monotone w.r.t.RScan TDB onceremove infrequent itemsItem h is droppedItemsets a and f are good,Projection-based miningImposing an appropriate order on item projectionMany tough constraints c

45、an be converted into(anti)-monotone,TDB(min_sup=2),36,Handling Multiple Constraints,Different constraints may require different or even conflicting item-orderingIf there exists an order R s.t.both C1 and C2 are convertible w.r.t.R,then there is no conflict between the two convertible constraintsIf t

46、here exists conflict on order of itemsTry to satisfy one constraint firstThen using the order for the other constraint to mine frequent itemsets in the corresponding projected database,37,Constraint-Based Mining A General Picture,38,What Constraints Are Convertible?,39,Chapter 7:Advanced Frequent Pa

47、ttern Mining,Pattern Mining:A Road MapPattern Mining in Multi-Level,Multi-Dimensional SpaceConstraint-Based Frequent Pattern MiningMining High-Dimensional Data and Colossal PatternsMining Compressed or Approximate Patterns Pattern Exploration and ApplicationSummary,40,Mining Colossal Frequent Patter

48、ns,F.Zhu,X.Yan,J.Han,P.S.Yu,and H.Cheng,“Mining Colossal Frequent Patterns by Core Pattern Fusion”,ICDE07.We have many algorithms,but can we mine large(i.e.,colossal)patterns?such as just size around 50 to 100?Unfortunately,not!Why not?the curse of“downward closure”of frequent patternsThe“downward c

49、losure”propertyAny sub-pattern of a frequent pattern is frequent.Example.If(a1,a2,a100)is frequent,then a1,a2,a100,(a1,a2),(a1,a3),(a1,a100),(a1,a2,a3),are all frequent!There are about 2100 such frequent itemsets!No matter using breadth-first search(e.g.,Apriori)or depth-first search(FPgrowth),we ha

50、ve to examine so many patternsThus the downward closure property leads to explosion!,41,Closed/maximal patterns may partially alleviate the problem but not really solve it:We often need to mine scattered large patterns!Let the minimum support threshold=20There are frequent patterns of size 20Each is

展开阅读全文