Data Mining and Data Warehousing

Table of Contents

  1. Data Mining Concepts and Architecture
  2. Data Preprocessing
  3. Association Rule Mining
  4. Classification Methods
  5. Clustering Techniques
  6. Outlier Detection
  7. Data Warehouse Architecture
  8. ETL Process
  9. OLAP vs OLTP
  10. Star Schema and Snowflake Schema
  11. Fact and Dimension Tables
  12. Data Marts
  13. Metadata
  14. Data Lake vs Data Warehouse
  15. Big Data Analytics Lifecycle

1. Data Mining Concepts and Architecture

Definition

Data Mining is the process of discovering meaningful patterns, correlations, anomalies, and trends from large datasets using techniques from statistics, machine learning, and database systems.

Key Characteristics

Data Mining Architecture

┌─────────────────────────────────────────────────┐
│                  USER INTERFACE                  │
├─────────────────────────────────────────────────┤
│              DATA MINING ENGINE                  │
│  (Pattern Evaluation, Classification,           │
│   Clustering, Association Rules)                 │
├─────────────────────────────────────────────────┤
│            DATABASE / DATA WAREHOUSE             │
│         (Data Cleaning & Integration)            │
├─────────────────────────────────────────────────┤
│              DATA SOURCES                        │
│   (Databases, Flat Files, Web, Sensors)         │
└─────────────────────────────────────────────────┘

Steps in KDD Process

  1. Data Cleaning — Remove noise and inconsistent data
  2. Data Integration — Combine data from multiple sources
  3. Data Selection — Choose relevant data for analysis
  4. Data Transformation — Transform data into appropriate forms
  5. Data Mining — Apply intelligent methods to extract patterns
  6. Pattern Evaluation — Identify truly interesting patterns
  7. Knowledge Presentation — Visualize and present knowledge

Types of Data Mining Tasks

Task Description Example
Classification Predict categorical label Spam detection
Regression Predict continuous value Stock price prediction
Clustering Group similar data points Customer segmentation
Association Find co-occurrence rules Market basket analysis
Anomaly Detection Identify unusual patterns Fraud detection
Summarization Compact representation Report generation

2. Data Preprocessing

Real-world data is often incomplete, noisy, and inconsistent. Preprocessing improves data quality and mining results.

2.1 Data Cleaning

2.2 Data Integration

2.3 Data Reduction

Reduces data volume while maintaining analytical integrity.

Technique Description
Dimensionality Reduction Remove irrelevant attributes (PCA, feature selection)
Numerosity Reduction Replace data with smaller representations (parametric: regression; non-parametric: histograms, clustering, sampling)
Data Compression Lossless or lossy compression techniques

2.4 Data Transformation


3. Association Rule Mining

Definition

Finds interesting relationships (associations) among items in transactional databases.

Key Concepts

3.1 Apriori Algorithm

Principle: If an itemset is frequent, then all its subsets must also be frequent (anti-monotonicity property).

Steps:
1. Scan database to find frequent 1-itemsets (L₁)
2. Use L₁ to generate candidate 2-itemsets (C₂)
3. Scan database to find frequent 2-itemsets (L₂)
4. Repeat until no more frequent itemsets are found
5. Generate association rules from frequent itemsets

Example:

Transactions:
T1: {bread, milk}
T2: {bread, diaper, beer, eggs}
T3: {milk, diaper, beer, cola}
T4: {bread, milk, diaper, beer}
T5: {bread, milk, diaper, cola}

Min Support = 60%, Min Confidence = 80%
Frequent 1-itemsets: {bread}, {milk}, {diaper}, {beer}
Frequent 2-itemsets: {bread, milk}, {bread, diaper}, {milk, diaper}, {diaper, beer}

Limitations: Multiple database scans, expensive candidate generation

3.2 FP-Growth (Frequent Pattern Growth)

Advantage over Apriori: No candidate generation, fewer database scans

Steps:
1. Scan DB once, find frequent 1-itemsets
2. Compress database into FP-Tree (Frequent Pattern Tree)
3. Mine frequent itemsets directly from FP-Tree by recursively building conditional pattern bases

FP-Tree Structure:
- Root node is null
- Each path represents a transaction
- Nodes store item names and counts
- Header table links nodes with same item


4. Classification Methods

Classification is the task of learning a target function that maps each attribute set to one of the predefined class labels.

4.1 Decision Trees

4.2 Bayesian Classification

4.3 k-Nearest Neighbors (k-NN)

4.4 Support Vector Machines (SVM) for Classification


5. Clustering Techniques

Clustering groups data objects so that objects in the same cluster are similar and objects in different clusters are dissimilar.

5.1 K-Means Clustering

Algorithm:
1. Choose k initial centroids
2. Assign each point to the nearest centroid
3. Recalculate centroids as mean of assigned points
4. Repeat until convergence (centroids don't change)

Key Properties:
- Time complexity: O(n × k × t) where t = iterations
- Sensitive to initial centroid selection
- Works well with spherical clusters
- K-Means++: Smart initialization to improve convergence

5.2 Hierarchical Clustering

Agglomerative (Bottom-Up):
1. Each object starts as its own cluster
2. Merge closest pair of clusters
3. Repeat until all objects in one cluster

Divisive (Top-Down):
1. All objects start in one cluster
2. Split clusters recursively
3. Stop when desired number of clusters reached

Linkage Criteria:
| Method | Description |
|--------|-------------|
| Single Linkage | Minimum distance between clusters |
| Complete Linkage | Maximum distance between clusters |
| Average Linkage | Average distance between all pairs |
| Ward's Method | Minimizes total within-cluster variance |

Dendrogram: Tree diagram showing cluster hierarchy; cutting at different levels gives different numbers of clusters.

5.3 DBSCAN (Density-Based Spatial Clustering)

Key Concepts:
- ε (eps): Neighborhood radius
- MinPts: Minimum points to form a dense region
- Core Point: Has ≥ MinPts within ε radius
- Border Point: Within ε of a core point but not core itself
- Noise Point: Neither core nor border

Advantages: Discovers clusters of arbitrary shape, handles noise, doesn't require specifying k
Disadvantages: Sensitive to ε and MinPts parameters, struggles with varying densities

5.4 Comparison of Clustering Methods

Feature K-Means Hierarchical DBSCAN
Cluster Shape Spherical Any Any
Number of Clusters Required Not required Not required
Handles Noise No No Yes
Scalability Good Poor Moderate
Time Complexity O(nkt) O(n² log n) O(n log n)

6. Outlier Detection

An outlier is a data object that deviates significantly from the normal objects.

Types of Outliers

Methods

Method Approach
Statistical Assume data follows a distribution; flag points beyond threshold (e.g., z-score > 3)
Distance-Based Outlier if fraction of points within distance d is less than threshold p
Density-Based Outlier if local density significantly differs from neighbors (LOF - Local Outlier Factor)
Clustering-Based Points not belonging to any cluster are outliers

7. Data Warehouse Architecture

Definition

A data warehouse is a subject-oriented, integrated, non-volatile, time-variant collection of data in support of management's decisions.

Key Characteristics (Bill Inmon's Definition)

Property Description
Subject-Oriented Organized around major subjects (customer, sales, product)
Integrated Data from multiple sources is cleaned and standardized
Non-Volatile Data is not updated in real-time; loaded and accessed
Time-Variant Data is stored with a time dimension for historical analysis

Architecture Components

┌──────────────────────────────────────────────────────┐
                  DATA SOURCES                         
  (OLTP Systems, ERP, CRM, Flat Files, External)     
└──────────────────┬───────────────────────────────────┘
                   
                   
┌──────────────────────────────────────────────────────┐
              ETL LAYER                                
  (Extract  Transform  Load)                        
└──────────────────┬───────────────────────────────────┘
                   
                   
┌──────────────────────────────────────────────────────┐
            DATA WAREHOUSE DATABASE                    
  (Central Repository - Star/Snowflake Schema)        
├──────────────────────────────────────────────────────┤
              METADATA REPOSITORY                      
  (Data about data - definitions, mappings, rules)    
└──────────────────┬───────────────────────────────────┘
                   
                   
┌──────────────────────────────────────────────────────┐
           FRONT-END / ANALYTICS TOOLS                 
  (OLAP, Data Mining, Reporting, Dashboards, BI)      
└──────────────────────────────────────────────────────┘

Top-Down vs Bottom-Up Approaches


8. ETL Process

Extract

Transform

Load


9. OLAP vs OLTP

Feature OLTP OLAP
Purpose Day-to-day operations Decision support, analysis
Data Current, detailed Historical, summarized, multidimensional
Operations Insert, Update, Delete, Simple queries Complex queries with aggregations
Users Clerks, DBAs, online users Managers, analysts, executives
Database Size 100 MB – 100 GB 100 GB – 1 TB+
Response Time Milliseconds Seconds to minutes
Normalization Highly normalized (3NF) Denormalized (star/snowflake)
Concurrent Users Thousands Hundreds
Example ATM transactions, order entry Sales trend analysis, forecasting

OLAP Operations

OLAP Types


10. Star Schema and Snowflake Schema

Star Schema

         ┌──────────┐
         │  TIME    │
         │ DIMENSION│
         └────┬─────┘
              │
┌──────────┐  │  ┌──────────┐
│ PRODUCT  ├──┼──┤  SALES   │
│DIMENSION │  │  │ FACT     │
└──────────┘  │  │ TABLE    │
              │  └────┬─────┘
         ┌────┴─────┐ │
         │ STORE    │ │
         │DIMENSION │ │
         └──────────┘ │
              ┌────────┴──┐
              │ CUSTOMER  │
              │ DIMENSION │
              └───────────┘

Advantages: Simpler queries, better query performance, easy to understand
Disadvantages: Data redundancy in dimension tables, more storage

Snowflake Schema

Advantages: Less data redundancy, saves storage space, better data integrity
Disadvantages: More complex queries (more joins), potentially slower performance

Comparison

Feature Star Schema Snowflake Schema
Normalization Denormalized dimensions Normalized dimensions
Query Complexity Simple (fewer joins) Complex (more joins)
Query Performance Faster Slower
Storage More (redundancy) Less (normalized)
Design Complexity Simple Complex
ETL Complexity Lower Higher

11. Fact and Dimension Tables

Fact Table

Types of Fact Tables:
| Type | Description | Example |
|------|-------------|---------|
| Transaction | One row per transaction | Sales receipt |
| Periodic Snapshot | State at regular intervals | Monthly account balance |
| Accumulating Snapshot | Tracks process milestones | Order fulfillment pipeline |
| Factless | No measures, only foreign keys | Attendance tracking |

Types of Facts:
- Additive: Can be summed across all dimensions (sales amount)
- Semi-Additive: Can be summed across some dimensions (account balance)
- Non-Additive: Cannot be summed (ratios, percentages)

Dimension Table

Common Dimensions:
- Time: Date, month, quarter, year, day of week
- Geography: Country, state, city, pin code
- Product: Category, subcategory, brand, SKU
- Customer: Name, age, gender, segment

Slowly Changing Dimensions (SCD):
| Type | Strategy | Use Case |
|------|----------|----------|
| Type 1 | Overwrite | Correcting errors |
| Type 2 | Add new row with effective dates | Tracking history |
| Type 3 | Add new column | Limited history (previous + current) |


12. Data Marts

Definition

A data mart is a subset of a data warehouse focused on a specific business line, department, or subject area.

Types

Type Description
Dependent Sourced directly from enterprise data warehouse
Independent Built directly from operational systems (no warehouse)
Hybrid Combination of warehouse and operational data

Data Mart vs Data Warehouse

Feature Data Mart Data Warehouse
Scope Departmental Enterprise-wide
Size Small (GB) Large (TB)
Implementation Time Weeks to months Months to years
Cost Lower Higher
Users Specific department Entire organization

13. Metadata

Definition

Metadata is "data about data" — it describes the structure, content, quality, and other characteristics of data.

Types of Metadata

Type Description Examples
Business Metadata Business context and meaning Definitions, business rules, ownership
Technical Metadata Technical details of storage Table names, column types, indexes, ETL mappings
Operational Metadata Data processing information Load timestamps, record counts, data lineage

Metadata Repository


14. Data Lake vs Data Warehouse

Feature Data Warehouse Data Lake
Data Type Structured only Structured, semi-structured, unstructured
Schema Schema-on-write Schema-on-read
Storage Cost Higher (proprietary) Lower (commodity hardware, HDFS, S3)
Users Business analysts Data scientists, engineers
Processing Optimized for SQL queries Batch + real-time processing
Data Quality High (cleaned, curated) Raw (may include low-quality data)
Flexibility Rigid schema Highly flexible
Examples Amazon Redshift, Snowflake Hadoop HDFS, AWS S3, Azure Data Lake

Modern Architecture: Data Lakehouse


15. Big Data Analytics Lifecycle

Phases

┌─────────────┐    ┌──────────────┐    ┌──────────────┐
│  Business    │───▶│  Data        │───▶│  Data        │
│  Problem     │    │  Preparation │    │  Exploration │
└─────────────┘    └──────────────┘    └──────┬───────┘
                                              │
┌─────────────┐    ┌──────────────┐    ┌──────▼───────┐
│  Communicate│◀───│  Validate &  │◀───│  Model       │
│  Results    │    │  Deploy      │    │  Building    │
└─────────────┘    └──────────────┘    └──────────────┘

Phase Details

Phase Activities
1. Business Problem Definition Define objectives, success criteria, scope
2. Data Acquisition Identify sources, collect data, assess quality
3. Data Preparation Clean, transform, integrate, feature engineering
4. Data Exploration EDA, statistical summaries, visualization
5. Model Planning Select algorithms, define evaluation metrics
6. Model Building Train models, tune hyperparameters, cross-validation
7. Communicate Results Visualizations, reports, dashboards, storytelling
8. Operationalize Deploy model, monitor performance, retrain as needed

Analytics Types

Type Question Example
Descriptive What happened? Sales reports
Diagnostic Why did it happen? Root cause analysis
Predictive What will happen? Demand forecasting
Prescriptive What should we do? Optimization recommendations

Key Formulas Summary

Concept Formula
Support P(A ∩ B)
Confidence P(B\|A) = P(A ∩ B) / P(A)
Lift P(A ∩ B) / (P(A) × P(B))
Entropy -Σ pᵢ log₂(pᵢ)
Information Gain Entropy(S) - Σ(\|Sv\|/\|S\|) × Entropy(Sv)
Gini Index 1 - Σ pᵢ²
Euclidean Distance √(Σ(xᵢ - yᵢ)²)
Z-Score (x - μ) / σ
Min-Max Normalization (v - min) / (max - min)

Exam Tips


Practice Questions

11 MCQs for Data Mining and Data Warehousing with detailed explanations.

Q1. Regarding the following concept: '— Apply intelligent methods to extract patterns

6....', which statement is correct?

✅ Correct Answer: Option C

Explanation:
The correct answer is Option C — — Apply intelligent methods to extract patterns
6..

This concept is covered under Data Mining and Data Warehousing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q2. Regarding the following concept: '- Contains foreign keys to dimension tables...', which statement is correct?

✅ Correct Answer: Option D

Explanation:
The correct answer is Option D — - Contains foreign keys to dimension tables.

This concept is covered under Data Mining and Data Warehousing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q3. Which of the following best describes - Understand SCD types (Type 2?

✅ Correct Answer: Option D

Explanation:
The correct answer is Option D — most commonly asked).

This concept is covered under Data Mining and Data Warehousing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q4. Which of the following best describes - Dimension tables?

✅ Correct Answer: Option C

Explanation:
The correct answer is Option C — denormalized.

This concept is covered under Data Mining and Data Warehousing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q5. Regarding the following concept: '— Transform data into appropriate forms

5....', which statement is correct?

✅ Correct Answer: Option B

Explanation:
The correct answer is Option B — — Transform data into appropriate forms
5..

This concept is covered under Data Mining and Data Warehousing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q6. Regarding the following concept: '— Identify truly interesting patterns

7....', which statement is correct?

✅ Correct Answer: Option C

Explanation:
The correct answer is Option C — — Identify truly interesting patterns
7..

This concept is covered under Data Mining and Data Warehousing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q7. Which of the following best describes - Dimension tables?

✅ Correct Answer: Option A

Explanation:
The correct answer is Option A — broken into sub-dimensions.

This concept is covered under Data Mining and Data Warehousing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q8. Which of the following best describes - Decimal Scaling: v' = v / 10^j where j?

✅ Correct Answer: Option C

Explanation:
The correct answer is Option C — smallest integer such that max(|v'|) < 1.

This concept is covered under Data Mining and Data Warehousing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q9. Which of the following best describes Real-world data?

✅ Correct Answer: Option D

Explanation:
The correct answer is Option D — often incomplete, noisy, and inconsistent. Preprocessing improves data quality and mining results..

This concept is covered under Data Mining and Data Warehousing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q10. Which of the following best describes Data Mining?

✅ Correct Answer: Option A

Explanation:
The correct answer is Option A — the process of discovering meaningful patterns, correlations, anomalies, and trends from large datasets using techniques from statistics, machine lear.

This concept is covered under Data Mining and Data Warehousing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q11. Which of the following best describes outlier?

✅ Correct Answer: Option C

Explanation:
The correct answer is Option C — a data object that deviates significantly from the normal objects..

This concept is covered under Data Mining and Data Warehousing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.