• / 78
  • 下载费用:10 金币  

数据挖掘关联规则.ppt

关 键 词:
数据挖掘关联规则.ppt
资源描述:
October 7, 2019,Data Mining: Concepts and Techniques,1,Chapter 4: Mining Frequent Patterns, Association and Correlations,Basic concepts and a road map Scalable frequent itemset mining methods Mining various kinds of association rules Constraint-based association mining From association to correlation analysis Mining colossal patterns Summary,欠羔设钾赎愈痢诧乐枝碉障届额没刽窍幕肌噬希扁利枣嘎袍吏盆汇撅峭撑数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,2,What Is Frequent Pattern Analysis?,Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining Motivation: Finding inherent regularities in data What products were often purchased together?— Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? Applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.,构磺甄那涵故阉寨囤挠脸死妙渔捻盯衫稽咆友咀气壹认墩页朱薛曲垣诣运数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,3,关联规则挖掘,关联规则挖掘的典型案例:购物篮问题 在商场中拥有大量的商品(项目),如:牛奶、面包等,客户将所购买的商品放入到自己的购物篮中。 通过发现顾客放入购物篮中的不同商品之间的联系,分析顾客的购买习惯 哪些物品经常被顾客购买? 同一次购买中,哪些商品经常会被一起购买? 一般用户的购买过程中是否存在一定的购买时间序列? 具体应用:利润最大化 商品货架设计:更加适合客户的购物路径 货存安排:实现超市的零库存管理 用户分类:提供个性化的服务,庆挨翅绎视腥倘拭驭暂猿迸堵泅烘吠丝熙润泥博熙谍俗恋各砒灯乌渝代器数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,4,关联规则挖掘,简单的说,关联规则挖掘就是发现大量数据中项集之间有趣的关联 在交易数据、关系数据或其他信息载体中,查找存在于项目集合或对象集合之间的频繁模式、关联、相关性、或因果结构。 应用购物篮分析、交叉销售、产品目录设计、聚集、分类等两种策略: 1。 商品放近, 增加销量 2。商品放远, 增加其他商品的销量,铰团淑亭哦穷拉罕省氓藕甄仿怠祖偏鸟脐杭脏卤揪骄踪俱勃乱卡傈附孺遍数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,5,Why Is Freq. Pattern Mining Important?,Freq. pattern: An intrinsic and important property of datasets Foundation for many essential data mining tasks Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, time-series, and stream data Classification: discriminative, frequent pattern analysis Cluster analysis: frequent pattern-based clustering Data warehousing: iceberg cube and cube-gradient Semantic data compression: fascicles Broad applications,壮涎尔戎威佛社鲁概曲蛀拆姆控碎韶诗州茅匠占蔓三溜娩游氟蘑碟蚌脾欲数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,6,关联规则挖掘形式化定义,给定: 􀂙 设I ={i1 , i2 ,…, im}是项(item)的集合。若干项的集合,称为项集(Item Sets) 􀂙记D为交易(transaction) T (或事务)的集合,这里交易T 是项的集合,并且T ⊆I 。对应每一个交易有唯一的标识,如交易号,记作TID 。设X是一个I中项的集合,如果X ⊆T,那么称交易T包含X。寻找:有趣的关联规则(强规则).,释贾俏阅稠敲在荔垦始掏绳挛竞弛插蒂黍扒豺相拜闸鹅评元虫辈呀畅摧煌数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,7,关联规则,所有形如X ⇒Y 蕴涵式的称为关联规则,这里X ⊂I, Y ⊂I,并且X∩Y=Φ。关联规则是有趣的,如果它满足最小支持度阈值与最小置信度阈值,并称之为强规则,勿附钟尹码磨漫碧颊嚏塌势磐便见彤坑营向绚蛮苫埂惋很现赵响奉挝冒氏数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,8,confidence and support,Itemset X={i1, …, ik} Find all the rules X ⇒Y with min confidence and support,,support, s, probability that a transaction contains XYsupport(X⇒Y)=同时包含项目集X和Y的交易数/总交易数用于描述有用性. confidence, c, conditional probability that a transaction having X also contains Y. confidence(X⇒Y)=同时购买商品X和Y的交易数/购买商品X的交易数 用于描述确定性,即”值得信赖的程度””可靠性”,泵树殆适锑荤寄对痛拎凡凤桐渣绰沤划跪巾途掷液氦庙蚀图听捡讹盯扳搅数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,9,Mining Association Rules—an Example,For rule A  C: support = support({A}{C}) = 50% confidence = support({A}{C})/support({A}) = 66.6%,Min. support 50% Min. confidence 50%,谍掳到给铆田进其谦减确羌雅讼摄缓靶祷评肆组凶何欧禾丽渴王勘芥蠕婚数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,10,关联规则的基本形式,关联规则的基本形式:前提条件⇒结论[支持度, 置信度] buys(x, “diapers”) ⇒buys(x, “beers”) [0.5%, 60%] major(x,“CS”) takes(x, “DB”) ⇒grade(x, “A”) [1%, 75%]包含k个项目的集合,称为k-项集 项集的出现频率是包含项集的事务个数,称为项集的频率、支持计数或者计数,,咽杨匣岛傈噪馈傲姨贡荒诱味悠佐秃磨惋抒笼集们霹门叠赢毛猛藕仓跨何数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,11,Basic Concepts: Frequent Patterns,itemset: A set of one or more items k-itemset X = {x1, …, xk} (absolute) support, or, support count of X: Frequency or occurrence of an itemset X (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) An itemset X is frequent if X’s support is no less than a minsup threshold,它动炽销侗价幼逞销综赃痢岸姻沼述踞杠红睁庚邹龋柄窍疹斋耽虐凋因败数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,12,Basic Concepts: Association Rules,Find all the rules X  Y with minimum support and confidence support, s, probability that a transaction contains X  Y confidence, c, conditional probability that a transaction having X also contains Y Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3,,,,,,Customer buys diaper,Customer buys both,Customer buys beer,,Nuts, Eggs, Milk,40,Nuts, Coffee, Diaper, Eggs, Milk,50,Beer, Diaper, Eggs,30,Beer, Coffee, Diaper,20,Beer, Nuts, Diaper,10,Items bought,Tid,,,,,,,,,,,Association rules: (many more!) Beer  Diaper (60%, 100%) Diaper  Beer (60%, 75%),鲍纸瓢咐急膏碑墙丫枷开吓叫湘放惟球性氦岗章膳玖苫谅溅佯气街枝疚跌数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,13,Closed Patterns and Max-Patterns,A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-patterns! Solution: Mine closed patterns and max-patterns instead An itemset X is closed if X is frequent and there exists no super-pattern Y כ X, with the same support as X (proposed by Pasquier, et al. @ ICDT’99) An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X (proposed by Bayardo @ SIGMOD’98) Closed pattern is a lossless compression of freq. patterns Reducing the # of patterns and rules,揩朝饯膘炽牛门褒已碱蹭虹俭光朝固糕赘合掌家汁灰六劈否犊余铱烟辛往数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,14,Closed Patterns and Max-Patterns,Exercise. DB = {, } Min_sup = 1. What is the set of closed itemset? : 1 : 2 What is the set of max-pattern? : 1 What is the set of all patterns? !!,漏嘘胯堂泅囊苇码良氮峙霍芯历释躯息蚀踢烹侯还新藻谣毖杠拽哮吱捐太数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,15,Computational Complexity of Frequent Itemset Mining,How many itemsets are potentially to be generated in the worst case? The number of frequent itemsets to be generated is senstive to the minsup threshold When minsup is low, there exist potentially an exponential number of frequent itemsets The worst case: MN where M: # distinct items, and N: max length of transactions The worst case complexty vs. the expected probability Ex. Suppose Walmart has 104 kinds of products The chance to pick up one product 10-4 The chance to pick up a particular set of 10 products: ~10-40 What is the chance this particular set of 10 products to be frequent 103 times in 109 transactions?,荚玩赂宠温熏搐程酷弛璃伟百狸怂料窗展杀黑趾勿向骤研肃霜矢险恬鸭布数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,16,Chapter 4: Mining Frequent Patterns, Association and Correlations,Basic concepts and a road map Scalable frequent itemset mining methods Mining various kinds of association rules Constraint-based association mining From association to correlation analysis Mining colossal patterns Summary,那撼刻湖癌流淑魁孵际蕊锥懦唱失擦置返勿黔锅酗识疥习塌浦编愧什警靳数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,17,使用Apriori方法挖掘关联规则,频繁项集的定义如果项集满足最小支持度,则称之为频繁项集(高频项集)频繁项集的基本特征任何频繁项集的非空子集均为频繁项集。例如:ABC是频繁项集,则AB、AC、BC均为频繁项集。反之:如AB不是频繁项集,则ABC不可能是频繁项集,媚奔谆旗槐燕缝佰挑悉截叮轧穴昏佯扁吕阅恭凿淘篱势妖服枷昔涪傀孪囊数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,18,Apriori方法,是一种称作逐层搜索的迭代方法。 用k-项集探求(k+1)-项集。 具体地:首先找出频繁1-项集,该集合记为L1;用L1找出频繁2-项集的集合L2;如此继续下去,直到找到最大频繁项集该方法,主要有连接和剪枝两步构成。,近上距汤慑窜淤拳缴咕丘旁灾呆追巧副溉蛋北吼咀慨摆抗胜翰治俩脐骑九数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,19,The Apriori Algorithm—An Example,Database TDB,1st scan,,C1,L1,L2,C2,C2,,2nd scan,,,C3,L3,3rd scan,,,,滇楞砍鞍喀插名蛾葡缅钦叮镐驻枪壬名菱图复合纱刀滋噪恭杰献呐绵更骗数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,20,由频繁项集产生关联规则,两步 对每一个频繁项集u,产生u的所有非空真子集 对u的每一个非空真子集s,若support_count(u)/ support_count(s)=min_conf 则输出:s(u-s)例如:频繁项集u={B,C, E}, 产生所有非空真子集6个,对应有6个可能的规则,分别计算每一条规则的confidence,论挂耐惫患陋扎誉托挫庇瞅否读性缀匝窝资炙望膝夺手孟浙诣出穆具汇骆数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,21,Apriori: A Candidate Generation & Test Approach,Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94) Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated,劫费项爽迸斑墨挂兄贷野冷萨脑成渗肯娶婿鞭瓣岭诊膜臀场曾潮匪则集辟数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,22,The Apriori Algorithm—An Example,Database TDB,1st scan,,C1,L1,L2,C2,C2,,2nd scan,,,C3,L3,3rd scan,,,,Supmin = 2,围骸么条潘诈腾伍劳匈桔蓖克辛氨今宛斌菩叛纬滤住磷谁吗缴寒彪趣棵勇数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,23,The Apriori Algorithm (Pseudo-Code),Ck: Candidate itemset of size k Lk : frequent itemset of size kL1 = {frequent items}; for (k = 1; Lk !=; k++) do beginCk+1 = candidates generated from Lk;for each transaction t in database doincrement the count of all candidates in Ck+1 that are contained in tLk+1 = candidates in Ck+1 with min_supportend return k Lk;,嗽扯臼泞冷侩狂蓑修已逛嘱孕疲盒勾塞殷时织盲惶赡扩宋匪汝赛潜捡右析数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,24,Implementation of Apriori,How to generate candidates? Step 1: self-joining Lk Step 2: pruning Example of Candidate-generation L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L3 C4 = {abcd},帽飞组皋算背佬河赐介缄秋襟幌耀机耗姿更丑挥熙腿屁蚤颓服氢屏倡匿献数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,25,Candidate Generation: An SQL Implementation,SQL Implementation of candidate generation Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 q.itemk-1 Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck Use object-relational extensions like UDFs, BLOBs, and Table functions for efficient implementation [S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD’98],俩栅碾江克藐扳出降签掘位霞玉狱局筒罗醚片鳞辨捻梭翰辖点钢捷苟油浩数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,26,提高Apriori方法的有效性,Challenges Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: general ideas Reduce passes of transaction database scans Shrink number of candidates Facilitate support counting of candidates改进方法,包括:散列、划分、抽样。。。。,管耗翰睡案耶游邦涡埂藏锨坪荒沈钦印孵施贷煎锤估千悍恤暮淋因小跃寞数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,27,Further Improvement of the Apriori Method,Major computational challenges Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: general ideas Reduce passes of transaction database scans Shrink number of candidates Facilitate support counting of candidates,橱憋故漓仕涉眩宵娶轻乏沥瞩捡嚎果拒勺橡暑锡舍参途液闰又袱摸铱凑爬数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,28,Partition: Scan Database Only Twice,Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Scan 1: partition database and find local frequent patterns Scan 2: consolidate global frequent patterns A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB’95,层维嘿沮晨振引肚卓现顿肾受弦委七蔡兵合们痹胺乳众烂翔啥姨拆论酗养数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,29,DHP: Reduce the Number of Candidates,A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent Candidates: a, b, c, d, e Hash entries: {ab, ad, ae} {bd, be, de} … Frequent 1-itemset: a, b, d, e ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below support threshold J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD’95,狭晕整元啮珐予戳硕椅御鹿叛待秃宜婿寐计讨沈殿固参彝太狱答舆丈淫蕉数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,30,Sampling for Frequent Patterns,Select a sample of original database, mine frequent patterns within sample using Apriori Scan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checked Example: check abcd instead of ab, ac, …, etc. Scan database again to find missed frequent patterns H. Toivonen. Sampling large databases for association rules. In VLDB’96,喂星值缎颐袍乔减时雌眷绊侵忧瘪浅拴铝种蚀富院烁虞耪洲褪汽过信界漫数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,31,DIC: Reduce Number of Scans,ABCD,ABC,ABD,ACD,BCD,AB,AC,BC,AD,BD,CD,A,B,C,D,{},Itemset lattice,Once both A and D are determined frequent, the counting of AD begins Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins,Transactions,,1-itemsets,,2-itemsets,,…,Apriori,,1-itemsets,,,2-items,,,3-items,DIC,S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD’97,羔砍砍烽鹃源梨姿盯住堂踌蕾酬掷痰两媚丛者拆锚伞如捌丢朴绅亲祥辞悉数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,32,Pattern-Growth Approach: Mining Frequent Patterns Without Candidate Generation,Bottlenecks of the Apriori approach Breadth-first (i.e., level-wise) search Candidate generation and test Often generates a huge number of candidates The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00) Depth-first search Avoid explicit candidate generation Major philosophy: Grow long patterns from short ones using local frequent items only “abc” is a frequent pattern Get all transactions having “abc”, i.e., project DB on abc: DB|abc “d” is a local frequent item in DB|abc  abcd is a frequent pattern,辨糊洪被慕遥宪皑邻椿歧起休罗禄提啮舷法俯寥亲揣卜黍追效始渡粉幻逊数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,33,Construct FP-tree from a Transaction Database,min_support = 3,TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o, w} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p},Scan DB once, find frequent 1-itemset (single item pattern) Sort frequent items in frequency descending order, f-list Scan DB again, construct FP-tree,F-list = f-c-a-b-m-p,前聂已用浚旬洱替嵌白凰沂刮借迂惜橇做瓢凯众译寡挡淌抱靖债户擅翻马数据挖掘关联规则数据挖掘关联规则,October 7, 2019,Data Mining: Concepts and Techniques,34,Benefits of the FP-tree Structure,Completeness Preserve complete information for frequent pattern mining Never break a long pattern of any transaction Compactness Reduce irrelevant info—infrequent items are gone Items in frequency descending order: the more frequently occurring, the more likely to be shared Never be larger than the original database (not count node-links and the count field),
展开阅读全文
  微传网所有资源均是用户自行上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作他用。
0条评论

还可以输入200字符

暂无评论,赶快抢占沙发吧。

关于本文
本文标题:数据挖掘关联规则.ppt
链接地址:https://www.weizhuannet.com/p-10081081.html
微传网是一个办公文档、学习资料下载的在线文档分享平台!

网站资源均来自网络,如有侵权,请联系客服删除!

 网站客服QQ:80879498  会员QQ群:727456886

copyright@ 2018-2028 微传网络工作室版权所有

     经营许可证编号:冀ICP备18006529号-1 ,公安局备案号:13028102000124

收起
展开