Data Mining and Data Warehousing

Data Mining Concepts and Architecture
Data Preprocessing
Association Rule Mining
Classification Methods
Clustering Techniques
Outlier Detection
Data Warehouse Architecture
ETL Process
OLAP vs OLTP
Star Schema and Snowflake Schema
Fact and Dimension Tables
Data Marts
Metadata
Data Lake vs Data Warehouse
Big Data Analytics Lifecycle

1. Data Mining Concepts and Architecture

Definition

Data Mining is the process of discovering meaningful patterns, correlations, anomalies, and trends from large datasets using techniques from statistics, machine learning, and database systems.

Key Characteristics

Knowledge Discovery in Databases (KDD): Data mining is a core step in the KDD process
Automated/semi-automated exploration of large quantities of data
Goal: Extract actionable intelligence from raw data

Data Mining Architecture

┌─────────────────────────────────────────────────┐
│                  USER INTERFACE                  │
├─────────────────────────────────────────────────┤
│              DATA MINING ENGINE                  │
│  (Pattern Evaluation, Classification,           │
│   Clustering, Association Rules)                 │
├─────────────────────────────────────────────────┤
│            DATABASE / DATA WAREHOUSE             │
│         (Data Cleaning & Integration)            │
├─────────────────────────────────────────────────┤
│              DATA SOURCES                        │
│   (Databases, Flat Files, Web, Sensors)         │
└─────────────────────────────────────────────────┘

Steps in KDD Process

Data Cleaning — Remove noise and inconsistent data
Data Integration — Combine data from multiple sources
Data Selection — Choose relevant data for analysis
Data Transformation — Transform data into appropriate forms
Data Mining — Apply intelligent methods to extract patterns
Pattern Evaluation — Identify truly interesting patterns
Knowledge Presentation — Visualize and present knowledge

Types of Data Mining Tasks

Task	Description	Example
Classification	Predict categorical label	Spam detection
Regression	Predict continuous value	Stock price prediction
Clustering	Group similar data points	Customer segmentation
Association	Find co-occurrence rules	Market basket analysis
Anomaly Detection	Identify unusual patterns	Fraud detection
Summarization	Compact representation	Report generation

2. Data Preprocessing

Real-world data is often incomplete, noisy, and inconsistent. Preprocessing improves data quality and mining results.

2.1 Data Cleaning

Missing Values: Fill using mean/median/mode, regression, or ignore tuples
Noisy Data: Apply binning (smoothing by bin means/medians/bin boundaries), regression, or clustering
Inconsistent Data: Correct using domain knowledge, external references

2.2 Data Integration

Combine data from multiple sources into a coherent data store
Schema Integration: Resolve entity identification problems (e.g., "customer_id" vs "cust_no")
Redundancy Detection: Correlated attributes may be redundant (e.g., age and date of birth)
Conflict Detection: Different sources may have different scales/formats

2.3 Data Reduction

Reduces data volume while maintaining analytical integrity.

Technique	Description
Dimensionality Reduction	Remove irrelevant attributes (PCA, feature selection)
Numerosity Reduction	Replace data with smaller representations (parametric: regression; non-parametric: histograms, clustering, sampling)
Data Compression	Lossless or lossy compression techniques

2.4 Data Transformation

Smoothing: Remove noise (binning, regression, clustering)
Aggregation: Summary operations (e.g., daily → monthly sales)
Generalization: Replace low-level data with higher-level concepts (e.g., street → city)
Normalization: Scale data to a specific range
Min-Mormalization: v' = (v - min) / (max - min)
Z-Score Normalization: v' = (v - mean) / std_dev
Decimal Scaling: v' = v / 10^j where j is smallest integer such that max(|v'|) < 1
Attribute Construction: Create new attributes from existing ones

3. Association Rule Mining

Definition

Finds interesting relationships (associations) among items in transactional databases.

Key Concepts

Itemset: A collection of items (e.g., {bread, butter})
Support: Fraction of transactions containing an itemset
Support(X) = (Number of transactions containing X) / (Total transactions)
Confidence: How often the rule is found true
Confidence(X → Y) = Support(X ∪ Y) / Support(X)
Lift: Measures strength of association
Lift(X → Y) = Support(X ∪ Y) / (Support(X) × Support(Y))
Lift > 1: Positive correlation; Lift = 1: Independent; Lift < 1: Negative correlation

3.1 Apriori Algorithm

Principle: If an itemset is frequent, then all its subsets must also be frequent (anti-monotonicity property).

Steps:
1. Scan database to find frequent 1-itemsets (L₁)
2. Use L₁ to generate candidate 2-itemsets (C₂)
3. Scan database to find frequent 2-itemsets (L₂)
4. Repeat until no more frequent itemsets are found
5. Generate association rules from frequent itemsets

Example:

Transactions:
T1: {bread, milk}
T2: {bread, diaper, beer, eggs}
T3: {milk, diaper, beer, cola}
T4: {bread, milk, diaper, beer}
T5: {bread, milk, diaper, cola}

Min Support = 60%, Min Confidence = 80%
Frequent 1-itemsets: {bread}, {milk}, {diaper}, {beer}
Frequent 2-itemsets: {bread, milk}, {bread, diaper}, {milk, diaper}, {diaper, beer}

Limitations: Multiple database scans, expensive candidate generation

3.2 FP-Growth (Frequent Pattern Growth)

Advantage over Apriori: No candidate generation, fewer database scans

Steps:
1. Scan DB once, find frequent 1-itemsets
2. Compress database into FP-Tree (Frequent Pattern Tree)
3. Mine frequent itemsets directly from FP-Tree by recursively building conditional pattern bases

FP-Tree Structure:
- Root node is null
- Each path represents a transaction
- Nodes store item names and counts
- Header table links nodes with same item

4. Classification Methods

Classification is the task of learning a target function that maps each attribute set to one of the predefined class labels.

4.1 Decision Trees

Structure: Internal nodes = attribute tests, Branches = test outcomes, Leaf nodes = class labels
Splitting Criteria:
Information Gain (ID3): Gain(S,A) = Entropy(S) - Σ(|Sv|/|S|) × Entropy(Sv)
Entropy: Entropy(S) = -Σ pᵢ log₂(pᵢ)
Gain Ratio (C4.5): GainRatio = Gain / SplitInfo (normalizes for bias toward multi-valued attributes)
Gini Index (CART): Gini(S) = 1 - Σ pᵢ²
Pruning: Pre-pruning (stop early) and Post-pruning (grow full tree then prune)

4.2 Bayesian Classification

Based on Bayes' Theorem: P(C|X) = P(X|C) × P(C) / P(X)
Naive Bayes: Assumes conditional independence of attributes given the class
P(C|X) ∝ P(C) × ∏ P(xᵢ|C)
Advantages: Fast, handles high-dimensional data, works well with small training sets
Disadvantages: Independence assumption rarely holds in practice

4.3 k-Nearest Neighbors (k-NN)

Instance-based (lazy) learning — no explicit model building
Classify new instance based on majority vote of k nearest neighbors
Distance Metrics:
Euclidean: d(x,y) = √(Σ(xᵢ - yᵢ)²)
Manhattan: d(x,y) = Σ|xᵢ - yᵢ|
Choosing k: Small k → noise sensitive; Large k → boundary smoothing. Typically k = √n

4.4 Support Vector Machines (SVM) for Classification

Finds the optimal hyperplane that maximizes the margin between classes
Margin: Distance between hyperplane and nearest data points (support vectors)
Linear SVM: w·x + b = 0 with margin = 2/||w||
Soft Margin: Allows some misclassification using slack variables (C parameter)
Kernel Trick: Maps data to higher dimensions for non-linear separation
Linear: K(x,y) = x·y
Polynomial: K(x,y) = (x·y + c)^d
RBF (Gaussian): K(x,y) = exp(-γ||x-y||²)
Sigmoid: K(x,y) = tanh(αx·y + c)

5. Clustering Techniques

Clustering groups data objects so that objects in the same cluster are similar and objects in different clusters are dissimilar.

5.1 K-Means Clustering

Algorithm:
1. Choose k initial centroids
2. Assign each point to the nearest centroid
3. Recalculate centroids as mean of assigned points
4. Repeat until convergence (centroids don't change)

Key Properties:
- Time complexity: O(n × k × t) where t = iterations
- Sensitive to initial centroid selection
- Works well with spherical clusters
- K-Means++: Smart initialization to improve convergence

5.2 Hierarchical Clustering

Agglomerative (Bottom-Up):
1. Each object starts as its own cluster
2. Merge closest pair of clusters
3. Repeat until all objects in one cluster

Divisive (Top-Down):
1. All objects start in one cluster
2. Split clusters recursively
3. Stop when desired number of clusters reached

Linkage Criteria:
| Method | Description |
|--------|-------------|
| Single Linkage | Minimum distance between clusters |
| Complete Linkage | Maximum distance between clusters |
| Average Linkage | Average distance between all pairs |
| Ward's Method | Minimizes total within-cluster variance |

Dendrogram: Tree diagram showing cluster hierarchy; cutting at different levels gives different numbers of clusters.

5.3 DBSCAN (Density-Based Spatial Clustering)

Key Concepts:
- ε (eps): Neighborhood radius
- MinPts: Minimum points to form a dense region
- Core Point: Has ≥ MinPts within ε radius
- Border Point: Within ε of a core point but not core itself
- Noise Point: Neither core nor border

Advantages: Discovers clusters of arbitrary shape, handles noise, doesn't require specifying k
Disadvantages: Sensitive to ε and MinPts parameters, struggles with varying densities

5.4 Comparison of Clustering Methods

Feature	K-Means	Hierarchical	DBSCAN
Cluster Shape	Spherical	Any	Any
Number of Clusters	Required	Not required	Not required
Handles Noise	No	No	Yes
Scalability	Good	Poor	Moderate
Time Complexity	O(nkt)	O(n² log n)	O(n log n)

6. Outlier Detection

An outlier is a data object that deviates significantly from the normal objects.

Types of Outliers

Global Outlier: Deviates from entire dataset
Contextual Outlier: Deviates within a specific context
Collective Outlier: A collection of data objects deviates from the entire dataset

Methods

Method	Approach
Statistical	Assume data follows a distribution; flag points beyond threshold (e.g., z-score > 3)
Distance-Based	Outlier if fraction of points within distance d is less than threshold p
Density-Based	Outlier if local density significantly differs from neighbors (LOF - Local Outlier Factor)
Clustering-Based	Points not belonging to any cluster are outliers

7. Data Warehouse Architecture

Definition

A data warehouse is a subject-oriented, integrated, non-volatile, time-variant collection of data in support of management's decisions.

Key Characteristics (Bill Inmon's Definition)

Property	Description
Subject-Oriented	Organized around major subjects (customer, sales, product)
Integrated	Data from multiple sources is cleaned and standardized
Non-Volatile	Data is not updated in real-time; loaded and accessed
Time-Variant	Data is stored with a time dimension for historical analysis

Architecture Components

┌──────────────────────────────────────────────────────┐
│                  DATA SOURCES                         │
│  (OLTP Systems, ERP, CRM, Flat Files, External)     │
└──────────────────┬───────────────────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────────────────┐
│              ETL LAYER                                │
│  (Extract → Transform → Load)                        │
└──────────────────┬───────────────────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────────────────┐
│            DATA WAREHOUSE DATABASE                    │
│  (Central Repository - Star/Snowflake Schema)        │
├──────────────────────────────────────────────────────┤
│              METADATA REPOSITORY                      │
│  (Data about data - definitions, mappings, rules)    │
└──────────────────┬───────────────────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────────────────┐
│           FRONT-END / ANALYTICS TOOLS                 │
│  (OLAP, Data Mining, Reporting, Dashboards, BI)      │
└──────────────────────────────────────────────────────┘

Top-Down vs Bottom-Up Approaches

Top-Down (Inmon): Build enterprise data warehouse first, then data marts
Bottom-Up (Kimball): Build data marts first, integrate into data warehouse

8. ETL Process

Extract

Pull data from various source systems
Methods: Full extraction, incremental extraction (change data capture)
Handle different formats: relational, flat files, XML, JSON

Transform

Data Cleansing: Standardize formats, handle nulls, remove duplicates
Business Rules: Apply calculations, aggregations, derivations
Integration: Map source fields to warehouse schema
Surrogate Keys: Generate artificial keys for dimension tables
Slowly Changing Dimensions (SCD):
Type 1: Overwrite old value
Type 2: Add new row with versioning (most common)
Type 3: Add new column for previous value

Load

Full Load: Replace all data (initial load)
Incremental Load: Load only changed data (daily/hourly)
Refresh vs Update: Refresh = complete replacement; Update = insert new + update changed

9. OLAP vs OLTP

Feature	OLTP	OLAP
Purpose	Day-to-day operations	Decision support, analysis
Data	Current, detailed	Historical, summarized, multidimensional
Operations	Insert, Update, Delete, Simple queries	Complex queries with aggregations
Users	Clerks, DBAs, online users	Managers, analysts, executives
Database Size	100 MB – 100 GB	100 GB – 1 TB+
Response Time	Milliseconds	Seconds to minutes
Normalization	Highly normalized (3NF)	Denormalized (star/snowflake)
Concurrent Users	Thousands	Hundreds
Example	ATM transactions, order entry	Sales trend analysis, forecasting

OLAP Operations

Roll-up: Summarize data (e.g., city → state → country)
Drill-down: Go from summary to detail (e.g., year → quarter → month)
Slice: Select one dimension (e.g., sales in 2023)
Dice: Select a subcube (e.g., sales in 2023 for electronics in Mumbai)
Pivot/Rotate: Rotate the data axes for different perspectives

OLAP Types

MOLAP (Multidimensional OLAP): Data stored in multidimensional cubes; fast query performance
ROLAP (Relational OLAP): Data stored in relational databases; more scalable
HOLAP (Hybrid OLAP): Combination of MOLAP and ROLAP

10. Star Schema and Snowflake Schema

Star Schema

Central fact table connected to multiple dimension tables
Dimension tables are denormalized
Resembles a star shape

         ┌──────────┐
         │  TIME    │
         │ DIMENSION│
         └────┬─────┘
              │
┌──────────┐  │  ┌──────────┐
│ PRODUCT  ├──┼──┤  SALES   │
│DIMENSION │  │  │ FACT     │
└──────────┘  │  │ TABLE    │
              │  └────┬─────┘
         ┌────┴─────┐ │
         │ STORE    │ │
         │DIMENSION │ │
         └──────────┘ │
              ┌────────┴──┐
              │ CUSTOMER  │
              │ DIMENSION │
              └───────────┘

Advantages: Simpler queries, better query performance, easy to understand
Disadvantages: Data redundancy in dimension tables, more storage

Snowflake Schema

Star schema with normalized dimension tables
Dimension tables are broken into sub-dimensions
Resembles a snowflake shape

Advantages: Less data redundancy, saves storage space, better data integrity
Disadvantages: More complex queries (more joins), potentially slower performance

Comparison

Feature	Star Schema	Snowflake Schema
Normalization	Denormalized dimensions	Normalized dimensions
Query Complexity	Simple (fewer joins)	Complex (more joins)
Query Performance	Faster	Slower
Storage	More (redundancy)	Less (normalized)
Design Complexity	Simple	Complex
ETL Complexity	Lower	Higher

11. Fact and Dimension Tables

Fact Table

Contains measures/metrics (quantitative data)
Contains foreign keys to dimension tables
Typically the largest table in the warehouse

Types of Fact Tables:
| Type | Description | Example |
|------|-------------|---------|
| Transaction | One row per transaction | Sales receipt |
| Periodic Snapshot | State at regular intervals | Monthly account balance |
| Accumulating Snapshot | Tracks process milestones | Order fulfillment pipeline |
| Factless | No measures, only foreign keys | Attendance tracking |

Types of Facts:
- Additive: Can be summed across all dimensions (sales amount)
- Semi-Additive: Can be summed across some dimensions (account balance)
- Non-Additive: Cannot be summed (ratios, percentages)

Dimension Table

Contains descriptive attributes (context for facts)
Contains a primary key that links to fact table
Typically wide with many descriptive columns

Common Dimensions:
- Time: Date, month, quarter, year, day of week
- Geography: Country, state, city, pin code
- Product: Category, subcategory, brand, SKU
- Customer: Name, age, gender, segment

Slowly Changing Dimensions (SCD):
| Type | Strategy | Use Case |
|------|----------|----------|
| Type 1 | Overwrite | Correcting errors |
| Type 2 | Add new row with effective dates | Tracking history |
| Type 3 | Add new column | Limited history (previous + current) |

12. Data Marts

Definition

A data mart is a subset of a data warehouse focused on a specific business line, department, or subject area.

Types

Type	Description
Dependent	Sourced directly from enterprise data warehouse
Independent	Built directly from operational systems (no warehouse)
Hybrid	Combination of warehouse and operational data

Data Mart vs Data Warehouse

Feature	Data Mart	Data Warehouse
Scope	Departmental	Enterprise-wide
Size	Small (GB)	Large (TB)
Implementation Time	Weeks to months	Months to years
Cost	Lower	Higher
Users	Specific department	Entire organization

13. Metadata

Definition

Metadata is "data about data" — it describes the structure, content, quality, and other characteristics of data.

Types of Metadata

Type	Description	Examples
Business Metadata	Business context and meaning	Definitions, business rules, ownership
Technical Metadata	Technical details of storage	Table names, column types, indexes, ETL mappings
Operational Metadata	Data processing information	Load timestamps, record counts, data lineage

Metadata Repository

Centralized store for all metadata
Supports data governance, impact analysis, data lineage tracking
Tools: Apache Atlas, Collibra, Informatica Metadata Manager

14. Data Lake vs Data Warehouse

Feature	Data Warehouse	Data Lake
Data Type	Structured only	Structured, semi-structured, unstructured
Schema	Schema-on-write	Schema-on-read
Storage Cost	Higher (proprietary)	Lower (commodity hardware, HDFS, S3)
Users	Business analysts	Data scientists, engineers
Processing	Optimized for SQL queries	Batch + real-time processing
Data Quality	High (cleaned, curated)	Raw (may include low-quality data)
Flexibility	Rigid schema	Highly flexible
Examples	Amazon Redshift, Snowflake	Hadoop HDFS, AWS S3, Azure Data Lake

Modern Architecture: Data Lakehouse

Combines best of data lake (flexibility, cost) and data warehouse (ACID transactions, governance)
Technologies: Delta Lake, Apache Iceberg, Apache Hudi

15. Big Data Analytics Lifecycle

Phases

┌─────────────┐    ┌──────────────┐    ┌──────────────┐
│  Business    │───▶│  Data        │───▶│  Data        │
│  Problem     │    │  Preparation │    │  Exploration │
└─────────────┘    └──────────────┘    └──────┬───────┘
                                              │
┌─────────────┐    ┌──────────────┐    ┌──────▼───────┐
│  Communicate│◀───│  Validate &  │◀───│  Model       │
│  Results    │    │  Deploy      │    │  Building    │
└─────────────┘    └──────────────┘    └──────────────┘

Phase Details

Phase	Activities
1. Business Problem Definition	Define objectives, success criteria, scope
2. Data Acquisition	Identify sources, collect data, assess quality
3. Data Preparation	Clean, transform, integrate, feature engineering
4. Data Exploration	EDA, statistical summaries, visualization
5. Model Planning	Select algorithms, define evaluation metrics
6. Model Building	Train models, tune hyperparameters, cross-validation
7. Communicate Results	Visualizations, reports, dashboards, storytelling
8. Operationalize	Deploy model, monitor performance, retrain as needed

Analytics Types

Type	Question	Example
Descriptive	What happened?	Sales reports
Diagnostic	Why did it happen?	Root cause analysis
Predictive	What will happen?	Demand forecasting
Prescriptive	What should we do?	Optimization recommendations

Key Formulas Summary

Concept	Formula
Support	`P(A ∩ B)`
Confidence	`P(B\\|A) = P(A ∩ B) / P(A)`
Lift	`P(A ∩ B) / (P(A) × P(B))`
Entropy	`-Σ pᵢ log₂(pᵢ)`
Information Gain	`Entropy(S) - Σ(\\|Sv\\|/\\|S\\|) × Entropy(Sv)`
Gini Index	`1 - Σ pᵢ²`
Euclidean Distance	`√(Σ(xᵢ - yᵢ)²)`
Z-Score	`(x - μ) / σ`
Min-Max Normalization	`(v - min) / (max - min)`

Exam Tips

Understand the difference between classification and clustering (supervised vs unsupervised)
Know Apriori vs FP-Growth trade-offs
Be clear on star vs snowflake schema and when to use each
Remember OLTP vs OLAP characteristics
Understand SCD types (Type 2 is most commonly asked)
Know the data warehouse characteristics (subject-oriented, integrated, non-volatile, time-variant)
Understand ETL process and its importance

Practice Questions

11 MCQs for Data Mining and Data Warehousing with detailed explanations.

Q1. Regarding the following concept: '— Apply intelligent methods to extract patterns

6....', which statement is correct?

A. This is defined exclusively at the physical layer of system design
B. This concept applies only to analog systems and not digital ones
C. — Apply intelligent methods to extract patterns
6.
D. This approach has been deprecated in all modern implementations

✅ Correct Answer: Option C

Explanation:
The correct answer is Option C — — Apply intelligent methods to extract patterns
6..

This concept is covered under Data Mining and Data Warehousing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.

Q2. Regarding the following concept: '- Contains foreign keys to dimension tables...', which statement is correct?

A. This concept applies only to analog systems and not digital ones
B. This is defined exclusively at the physical layer of system design
C. This approach has been deprecated in all modern implementations
D. - Contains foreign keys to dimension tables

✅ Correct Answer: Option D

Explanation:
The correct answer is Option D — - Contains foreign keys to dimension tables.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.

Q3. Which of the following best describes - Understand SCD types (Type 2?

A. broken into sub-dimensions
B. denormalized
C. found
D. most commonly asked)

✅ Correct Answer: Option D

Explanation:
The correct answer is Option D — most commonly asked).

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.

Q4. Which of the following best describes - Dimension tables?

A. found
B. most commonly asked)
C. denormalized
D. broken into sub-dimensions

✅ Correct Answer: Option C

Explanation:
The correct answer is Option C — denormalized.

Q5. Regarding the following concept: '— Transform data into appropriate forms

5....', which statement is correct?

A. This concept applies only to analog systems and not digital ones
B. — Transform data into appropriate forms
5.
C. This approach has been deprecated in all modern implementations
D. This is defined exclusively at the physical layer of system design

✅ Correct Answer: Option B

Explanation:
The correct answer is Option B — — Transform data into appropriate forms
5..

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.

Q6. Regarding the following concept: '— Identify truly interesting patterns

7....', which statement is correct?

A. This approach has been deprecated in all modern implementations
B. This is defined exclusively at the physical layer of system design
C. — Identify truly interesting patterns
7.
D. This concept applies only to analog systems and not digital ones

✅ Correct Answer: Option C

Explanation:
The correct answer is Option C — — Identify truly interesting patterns
7..

Q7. Which of the following best describes - Dimension tables?

A. broken into sub-dimensions
B. found
C. most commonly asked)
D. denormalized

✅ Correct Answer: Option A

Explanation:
The correct answer is Option A — broken into sub-dimensions.

Why other options are incorrect:
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.

Q8. Which of the following best describes - Decimal Scaling: `v' = v / 10^j` where j?

A. found
B. most commonly asked)
C. smallest integer such that max(|v'|) < 1
D. denormalized

✅ Correct Answer: Option C

Explanation:
The correct answer is Option C — smallest integer such that max(|v'|) < 1.

Q9. Which of the following best describes Real-world data?

A. denormalized
B. found
C. most commonly asked)
D. often incomplete, noisy, and inconsistent. Preprocessing improves data quality and mining results.

✅ Correct Answer: Option D

Explanation:
The correct answer is Option D — often incomplete, noisy, and inconsistent. Preprocessing improves data quality and mining results..

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.

Q10. Which of the following best describes Data Mining?

A. the process of discovering meaningful patterns, correlations, anomalies, and trends from large datasets using techniques from statistics, machine lear
B. denormalized
C. most commonly asked)
D. found

✅ Correct Answer: Option A

Explanation:
The correct answer is Option A — the process of discovering meaningful patterns, correlations, anomalies, and trends from large datasets using techniques from statistics, machine lear.

Q11. Which of the following best describes outlier?

A. denormalized
B. found
C. a data object that deviates significantly from the normal objects.
D. most commonly asked)

✅ Correct Answer: Option C

Explanation:
The correct answer is Option C — a data object that deviates significantly from the normal objects..

Data Mining and Data Warehousing

Table of Contents

1. Data Mining Concepts and Architecture

Definition

Key Characteristics

Data Mining Architecture

Steps in KDD Process

Types of Data Mining Tasks

2. Data Preprocessing

2.1 Data Cleaning

2.2 Data Integration

2.3 Data Reduction

2.4 Data Transformation

3. Association Rule Mining

Definition

Key Concepts

3.1 Apriori Algorithm

3.2 FP-Growth (Frequent Pattern Growth)

4. Classification Methods

4.1 Decision Trees

4.2 Bayesian Classification

4.3 k-Nearest Neighbors (k-NN)

4.4 Support Vector Machines (SVM) for Classification

5. Clustering Techniques

5.1 K-Means Clustering

5.2 Hierarchical Clustering

5.3 DBSCAN (Density-Based Spatial Clustering)

5.4 Comparison of Clustering Methods

6. Outlier Detection

Types of Outliers

Methods

7. Data Warehouse Architecture

Definition

Key Characteristics (Bill Inmon's Definition)

Architecture Components

Top-Down vs Bottom-Up Approaches

8. ETL Process

Extract

Transform

Load

9. OLAP vs OLTP

OLAP Operations

OLAP Types

10. Star Schema and Snowflake Schema

Star Schema

Snowflake Schema

Comparison

11. Fact and Dimension Tables

Fact Table

Dimension Table

12. Data Marts

Definition

Types

Data Mart vs Data Warehouse

13. Metadata

Definition

Types of Metadata

Metadata Repository

14. Data Lake vs Data Warehouse

Modern Architecture: Data Lakehouse

15. Big Data Analytics Lifecycle

Phases

Phase Details

Analytics Types

Key Formulas Summary

Exam Tips

Practice Questions

Q1. Regarding the following concept: '— Apply intelligent methods to extract patterns

Q2. Regarding the following concept: '- Contains foreign keys to dimension tables...', which statement is correct?

Q3. Which of the following best describes - Understand SCD types (Type 2?

Q4. Which of the following best describes - Dimension tables?

Q5. Regarding the following concept: '— Transform data into appropriate forms

Q6. Regarding the following concept: '— Identify truly interesting patterns

Q7. Which of the following best describes - Dimension tables?

Q8. Which of the following best describes - Decimal Scaling: v' = v / 10^j where j?

Q9. Which of the following best describes Real-world data?

Q10. Which of the following best describes Data Mining?

Q11. Which of the following best describes outlier?

Q8. Which of the following best describes - Decimal Scaling: `v' = v / 10^j` where j?