1、1,1,Data Mining:Concepts and Techniques(3rd ed.)Chapter 4,Jiawei Han,Micheline Kamber,and Jian PeiUniversity of Illinois at Urbana-Champaign&Simon Fraser University2011 Han,Kamber&Pei.All rights reserved.,3,Chapter 4:Data Warehousing and On-line Analytical Processing,Data Warehouse:Basic ConceptsDat
2、a Warehouse Modeling:Data Cube and OLAPData Warehouse Design and UsageData Warehouse ImplementationData Generalization by Attribute-Oriented InductionSummary,4,What is a Data Warehouse?,Defined in many different ways,but not rigorously.A decision support database that is maintained separately from t
3、he organizations operational databaseSupport information processing by providing a solid platform of consolidated,historical data for analysis.“A data warehouse is a subject-oriented,integrated,time-variant,and nonvolatile collection of data in support of managements decision-making process.”W.H.Inm
4、onData warehousing:The process of constructing and using data warehouses,5,Data WarehouseSubject-Oriented,Organized around major subjects,such as customer,product,salesFocusing on the modeling and analysis of data for decision makers,not on daily operations or transaction processingProvide a simple
5、and concise view around particular subject issues by excluding data that are not useful in the decision support process,6,Data WarehouseIntegrated,Constructed by integrating multiple,heterogeneous data sourcesrelational databases,flat files,on-line transaction recordsData cleaning and data integrati
6、on techniques are applied.Ensure consistency in naming conventions,encoding structures,attribute measures,etc.among different data sourcesE.g.,Hotel price:currency,tax,breakfast covered,etc.When data is moved to the warehouse,it is converted.,7,Data WarehouseTime Variant,The time horizon for the dat
7、a warehouse is significantly longer than that of operational systemsOperational database:current value dataData warehouse data:provide information from a historical perspective(e.g.,past 5-10 years)Every key structure in the data warehouseContains an element of time,explicitly or implicitlyBut the k
8、ey of operational data may or may not contain“time element”,8,Data WarehouseNonvolatile,A physically separate store of data transformed from the operational environmentOperational update of data does not occur in the data warehouse environmentDoes not require transaction processing,recovery,and conc
9、urrency control mechanismsRequires only two operations in data accessing:initial loading of data and access of data,9,OLTP vs.OLAP,10,Why a Separate Data Warehouse?,High performance for both systemsDBMS tuned for OLTP:access methods,indexing,concurrency control,recoveryWarehousetuned for OLAP:comple
10、x OLAP queries,multidimensional view,consolidationDifferent functions and different data:missing data:Decision support requires historical data which operational DBs do not typically maintaindata consolidation:DS requires consolidation(aggregation,summarization)of data from heterogeneous sourcesdata
11、 quality:different sources typically use inconsistent data representations,codes and formats which have to be reconciledNote:There are more and more systems which perform OLAP analysis directly on relational databases,11,Data Warehouse:A Multi-Tiered Architecture,DataWarehouse,OLAP Engine,AnalysisQu
12、eryReportsData mining,Monitor&Integrator,Metadata,Data Sources,Front-End Tools,Serve,Data Marts,Data Storage,OLAP Server,12,Three Data Warehouse Models,Enterprise warehousecollects all of the information about subjects spanning the entire organizationData Marta subset of corporate-wide data that is
13、of value to a specific groups of users.Its scope is confined to specific,selected groups,such as marketing data martIndependent vs.dependent(directly from warehouse)data martVirtual warehouseA set of views over operational databasesOnly some of the possible summary views may be materialized,13,Extra
14、ction,Transformation,and Loading(ETL),Data extractionget data from multiple,heterogeneous,and external sourcesData cleaningdetect errors in the data and rectify them when possibleData transformationconvert data from legacy or host format to warehouse formatLoadsort,summarize,consolidate,compute view
15、s,check integrity,and build indicies and partitionsRefreshpropagate the updates from the data sources to the warehouse,14,Metadata Repository,Meta data is the data defining warehouse objects.It stores:Description of the structure of the data warehouseschema,view,dimensions,hierarchies,derived data d
16、efn,data mart locations and contentsOperational meta-datadata lineage(history of migrated data and transformation path),currency of data(active,archived,or purged),monitoring information(warehouse usage statistics,error reports,audit trails)The algorithms used for summarizationThe mapping from opera
17、tional environment to the data warehouseData related to system performancewarehouse schema,view and derived data definitionsBusiness databusiness terms and definitions,ownership of data,charging policies,15,Chapter 4:Data Warehousing and On-line Analytical Processing,Data Warehouse:Basic ConceptsDat
18、a Warehouse Modeling:Data Cube and OLAPData Warehouse Design and UsageData Warehouse ImplementationData Generalization by Attribute-Oriented InductionSummary,16,From Tables and Spreadsheets to Data Cubes,A data warehouse is based on a multidimensional data model which views data in the form of a dat
19、a cubeA data cube,such as sales,allows data to be modeled and viewed in multiple dimensionsDimension tables,such as item(item_name,brand,type),or time(day,week,month,quarter,year)Fact table contains measures(such as dollars_sold)and keys to each of the related dimension tablesIn data warehousing lit
20、erature,an n-D base cube is called a base cuboid.The top most 0-D cuboid,which holds the highest-level of summarization,is called the apex cuboid.The lattice of cuboids forms a data cube.,17,Cube:A Lattice of Cuboids,time,item,time,item,location,time,item,location,supplier,18,Conceptual Modeling of
21、Data Warehouses,Modeling data warehouses:dimensions&measuresStar schema:A fact table in the middle connected to a set of dimension tables Snowflake schema:A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables,forming a shape similar to snow
22、flakeFact constellations:Multiple fact tables share dimension tables,viewed as a collection of stars,therefore called galaxy schema or fact constellation,19,Example of Star Schema,Sales Fact Table,time_key,item_key,branch_key,location_key,units_sold,dollars_sold,avg_sales,Measures,20,Example of Snow
23、flake Schema,Sales Fact Table,time_key,item_key,branch_key,location_key,units_sold,dollars_sold,avg_sales,Measures,21,Example of Fact Constellation,Sales Fact Table,time_key,item_key,branch_key,location_key,units_sold,dollars_sold,avg_sales,Measures,Shipping Fact Table,time_key,item_key,shipper_key,
24、from_location,to_location,dollars_cost,units_shipped,22,A Concept Hierarchy:Dimension(location),all,Europe,North_America,Mexico,Canada,Spain,Germany,Vancouver,M.Wind,L.Chan,.,.,.,.,.,.,all,region,office,country,Toronto,Frankfurt,city,23,Data Cube Measures:Three Categories,Distributive:if the result
25、derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioningE.g.,count(),sum(),min(),max()Algebraic:if it can be computed by an algebraic function with M arguments(where M is a bounded integer),each of which is obtai
26、ned by applying a distributive aggregate functionE.g.,avg(),min_N(),standard_deviation()Holistic:if there is no constant bound on the storage size needed to describe a subaggregate.E.g.,median(),mode(),rank(),24,View of Warehouses and Hierarchies,Specification of hierarchiesSchema hierarchyday month quarter;week yearSet_grouping hierarchy1.10 inexpensive,25,Multidimensional Data,Sales volume as a function of product,month,and region,