Supervised Learning: Regression & Classification

Assignment Report — Introduction to Machine Learning (Nhập môn học máy)

Faculty of Information Technology, VNU-HCM University of Science

1. About The Project
2. Datasets
- Regression Dataset: Appliances Energy Prediction
- Classification Dataset: Room Occupancy Estimation
3. Part 1 — Regression
- Regression Models Implemented
- Regression Key Techniques
4. Part 2 — Classification
- Classification Models Implemented
- Classification Key Techniques
5. Repository Structure
6. Getting Started
7. Contributors
8. License & Acknowledgments

1. About The Project

This project is the Assignment Report No. 1 for the Introduction to Machine Learning course. It provides a comprehensive, from-scratch implementation of fundamental supervised learning algorithms — covering both regression and classification — applied to real-world IoT sensor datasets.

The project is organized into two interconnected parts:

Part 1: Regression	Part 2: Classification
Predict appliance energy consumption (Wh) from indoor sensor readings and outdoor weather data.	Estimate the number of occupants in a room (0–3 people) from multi-modal IoT sensor streams.
Implements OLS, Ridge, Lasso, Elastic Net, WLS, and Kernel methods from scratch.	Implements Perceptron, Logistic/Probit Regression, LDA, QDA, Naive Bayes, and Bayesian models from scratch.
Explores basis function expansion, regularization, and Bayesian linear regression.	Explores multiclass strategies (OvR, OvO), discriminant analysis, and probabilistic calibration.

Core Philosophy: Every algorithm in this project is implemented from scratch using NumPy only — no black-box sklearn estimators for the core models — to build a deep, principled understanding of machine learning fundamentals.

2. Datasets

Regression Dataset: Appliances Energy Prediction

Info	Details
Primary Source	Appliances Energy Prediction (Kaggle)
Original Custodian	UCI ML Repository (ID: 374)
Temporal Coverage	~4.5 months, sampled every 10 minutes
Dataset Size	19,735 records × 29 attributes
Target Variable	`Appliances` — appliance energy consumption (Wh)

The dataset records energy use of home appliances alongside temperature and humidity readings from a ZigBee wireless sensor network inside a low-energy house in Mons, Belgium, merged with outdoor weather data from the nearest airport weather station. Two random noise variables (rv1, rv2) are included to test feature selection quality.

Key Feature Groups:

Indoor Conditions: Temperature (T1–T9) & Relative Humidity (RH_1–RH_9) across 9 zones (kitchen, living room, laundry, office, bathroom, ironing room, teen room, parents' room, etc.)
Outdoor / Weather: T_out, T6, RH_out, RH_6, Press_mm_hg, Windspeed, Visibility, Tdewpoint
Other Energy: lights (lighting energy in Wh)
Random Noise: rv1, rv2 (used for feature selection validation)

Classification Dataset: Room Occupancy Estimation

Info	Details
Primary Source	Room Occupancy Estimation (Kaggle)
Original Custodian	UCI ML Repository (ID: 864)
Temporal Coverage	4 continuous days (starting 22/12/2017)
Environment	Standard lab room, 6m × 4.6m, equipped with 7 sensor nodes
Sampling Rate	Every 30 seconds
Dataset Size	10,129 records × 19 attributes
Target Variable	`Room_Occupancy_Count` — number of people present (0, 1, 2, or 3)

The dataset captures multi-modal environmental signals from IoT sensors for non-intrusive occupancy estimation — no cameras, no wearables.

Key Feature Groups:

Thermodynamic: S1_Temp, S2_Temp, S3_Temp, S4_Temp
Illuminance: S1_Light, S2_Light, S3_Light, S4_Light
Acoustic: S1_Sound, S2_Sound, S3_Sound, S4_Sound
Air Quality: S5_CO2, S5_CO2_Slope (rate of CO₂ change)
Motion (PIR): S6_PIR, S7_PIR

3. Part 1 — Regression

Regression Models Implemented

All regression models are implemented from scratch in code/Part1_Regression/models.py.

Model	Description
OLS (Ordinary Least Squares)	Closed-form normal equations: $\mathbf{w} = (\Phi^T\Phi)^{-1}\Phi^T\mathbf{y}$
Mini-Batch Gradient Descent	Iterative OLS optimization with Step Decay & Cosine Annealing learning rate schedules
WLS (Weighted Least Squares)	Observation-weighted regression for heteroscedastic noise; weights estimated from OLS residuals
Ridge Regression	L2-penalized closed-form solution; bias term excluded from regularization
Lasso Regression	L1-penalized via Coordinate Descent with warm-start for efficient $\lambda$ path search
Elastic Net	Combined L1+L2 penalty via Coordinate Descent
Kernel Ridge Regression (KRR)	Dual formulation with RBF and Polynomial kernels; solves $(K + \lambda I)\alpha = \mathbf{y}$
Gaussian Process Regression (GPR)	Full Bayesian non-parametric model; hyperparameters optimized via gradient ascent on log-marginal-likelihood (LML)
Bayesian Linear Regression	Analytically computes posterior $p(\mathbf{w} \mid \mathbf{t})$ and predictive distribution $\bar{f}^* \pm 2\sigma_N$ using Gaussian RBF basis functions
Robust Regression (IRLS + Huber)	Iteratively Reweighted Least Squares with Huber loss for outlier robustness

Regression Key Techniques

Basis Function Expansion: Four families implemented — Polynomial, RBF (Radial Basis Function), Sigmoid, and Natural Cubic Spline — all constructable via a unified make_design_matrix() API.
Feature Engineering & Selection:
- Interaction terms between feature groups (temperature × humidity, etc.)
- Forward Selection and Backward Elimination using ridge-penalized validation loss
- Feature group identification by name prefix (select_feature_groups())
Regularization & Hyperparameter Tuning:
- Time-Series K-Fold Cross-Validation (expanding window) to respect temporal ordering
- Warm-start grid search for Lasso $\lambda$
- Evidence Maximization (Empirical Bayes) for Bayesian hyperparameters $\alpha$ and $\beta$
Heteroscedasticity Testing: Breusch-Pagan test implemented from first principles.
Model Diagnostics:
- Bias–Variance decomposition via Bootstrap (200 resamples)
- Residual plots, Predicted vs. Actual plots, Learning curves
- Wilcoxon signed-rank test and paired t-test for statistical model comparison
Evaluation Metrics: MSE, RMSE, MAE, R²

4. Part 2 — Classification

Classification Models Implemented

All classification models are implemented from scratch in code/Part2_Classification/models.py.

Model	Description
Perceptron	Original Rosenblatt Perceptron with error history tracking and early stopping
Logistic Regression	Gradient descent with L1/L2 regularization and class-balanced sample weighting
Binary Logistic Regression	Supports both GD and Newton-Raphson (Hessian-free Conjugate Gradient) optimization
Softmax Regression	Multiclass logistic regression with numerically stable Log-Sum-Exp trick
One-vs-Rest (OvR)	Meta-estimator wrapping any binary classifier for multiclass; probability normalization
One-vs-One (OvO)	Meta-estimator with majority voting over all pairwise classifiers
Linear Discriminant Analysis (LDA)	Pooled covariance, Fisher projection via generalized eigenvalue problem, `transform()` for dimensionality reduction
Quadratic Discriminant Analysis (QDA)	Class-specific covariance matrices, Mahalanobis distance scoring
Probit Regression	Vectorized gradient descent using Standard Normal CDF/PDF instead of sigmoid
Bayesian Logistic Regression	MAP estimation + Laplace approximation of posterior; predictive uncertainty via probit approximation
Kernel Logistic Regression	Dual formulation with RBF kernel trick for non-linearly separable problems (e.g., XOR)
Gaussian Naive Bayes	Class-conditional Gaussian assumption with log-likelihood for numerical stability

Classification Key Techniques

Fisher Ratio Feature Ranking: Vectorized computation of between-class vs. within-class variance ratio for feature importance, available in BaseDiscriminantAnalysis.
Multiclass Strategies: Full OvR and OvO implementations support any custom binary estimator instance, with proper probability normalization.
Noise Robustness Evaluation: inject_label_noise() flips a controlled proportion of training labels to benchmark model robustness under label corruption.
Bayesian Uncertainty Quantification: BayesianLogisticRegression computes the Laplace-approximated posterior covariance $\Sigma = A^{-1}$, enabling predictive standard deviation estimates $\sigma_a$ per data point.
Visualization Module (visualizations.py):
- Convergence comparison: GD vs. Newton-Raphson (loss vs. epochs + wall-clock time)
- LDA vs. QDA decision boundaries in 2D Fisher discriminant space with confidence contours
- Logistic vs. Probit KDE probability density comparison
- Decision boundaries in PCA space with confidence margins (P = 0.1 / 0.5 / 0.9)
- Bayesian decision boundary with uncertainty bands ($\mu_a \pm k \cdot \sigma_a$)
- Reliability diagrams (calibration curves)
- Noise robustness degradation curves
Evaluation Metrics: Accuracy, Precision, Recall, F1-Score, ROC-AUC, confusion matrix.

5. Repository Structure

The project follows a modular architecture cleanly separating Data, Source Code, Notebooks, and Reports.

Supervised-Learning-Regression-Classification/
│
├── code/
│   ├── Part1_Regression/
│   │   ├── models.py                       # All regression algorithms & utilities
│   │   ├── visualizations.py               # Rich visualization utilities
│   │   ├── 01_eda_and_preprocessing.ipynb  # EDA, feature engineering & data pipeline
│   │   ├── 02_modeling.ipynb               # Core model training & evaluation
│   │   └── 03_advanced_bonus_experiments.ipynb  # Kernel Ridge, GPR, Bayesian LR, Bias-Variance
│   │
│   └── Part2_Classification/
│       ├── models.py                       # All classification algorithms
│       ├── visualizations.py               # Rich visualization utilities
│       ├── 01_eda_and_preprocessing.ipynb  # EDA, feature analysis & preprocessing pipeline
│       ├── 02_modeling.ipynb               # Core classifiers: Perceptron, LogReg, LDA, QDA, NB
│       └── 03_advanced_bonus_experiments.ipynb  # Bayesian LogReg, Kernel LR, noise robustness
│
├── data/
│   ├── raw/
│   │   ├── Energy_Use.csv                  # Raw appliance energy data
│   │   └── Room_Occupancy.csv              # Raw room occupancy data
│   ├── processed/
│   │   ├── Energy_Use_train.csv            # Regression training split
│   │   ├── Energy_Use_val.csv              # Regression validation split
│   │   ├── Energy_Use_test.csv             # Regression test split
│   │   ├── Room_Occupancy_train.csv        # Classification training split
│   │   ├── Room_Occupancy_val.csv          # Classification validation split
│   │   └── Room_Occupancy_test.csv         # Classification test split
│   └── README.md                           # Data catalog & provenance documentation
│
├── logs/
│   ├── logs_classification.json            # Execution logs and results for classification tests
│   └── logs_regression.json                # Execution logs and results for regression tests
│
├── report/
│   ├── README.md                           # Report build notes and usage
│   ├── output/                             # Compiled report outputs (PDF, artifacts)
│   └── src/
│       ├── main.tex                        # LaTeX main source
│       ├── chapters/                       # Chapter files
│       │   ├── 01_tong_quan.tex            # Chapter 1: Overview
│       │   ├── 02_hoi_quy.tex              # Chapter 2: Regression
│       │   ├── 03_phan_lop.tex             # Chapter 3: Classification
│       │   ├── 04_so_sanh.tex              # Chapter 4: Comparison
│       │   └── 05_tong_ket.tex             # Chapter 5: Conclusion
│       ├── graphics/                       # Figures and visual assets
│       ├── packages/
│       │   └── codespace.sty               # LaTeX styling package
│       ├── refs/
│       │   └── example.bib                 # BibTeX references
│       └── styles/
│           └── hcmus-report.cls            # HCMUS template class file
│
├── .gitignore
├── LICENSE
├── README.md
└── requirements.txt                        # Project dependencies

6. Getting Started

Prerequisites

Python: 3.9 or later
Package Manager: pip or conda
Git: to clone the repository

Installation

Step 1: Clone the repository

git clone https://github.com/ThanhChuong12/Supervised-Learning-Regression-Classification.git
cd Supervised-Learning-Regression-Classification

Step 2: Create a virtual environment

Option A — using venv (recommended for VS Code):

python -m venv venv

# Activate on Windows:
venv\Scripts\activate

# Activate on macOS/Linux:
source venv/bin/activate

Option B — using conda (recommended for Jupyter Lab):

conda create --name ml-env python=3.9
conda activate ml-env

Step 3: Install dependencies

pip install --upgrade pip
pip install -r requirements.txt

Step 4: Set up the data

The raw datasets must be downloaded from Kaggle and placed in data/raw/. You can use the Kaggle CLI:

# Regression dataset
kaggle datasets download -d sohommajumder21/appliances-energy-prediction-data-set -p ./data/raw/ --unzip

# Classification dataset
kaggle datasets download -d ruchikakumbhar/room-occupancy-estimation -p ./data/raw/ --unzip

Rename the CSV files to Energy_Use.csv and Room_Occupancy.csv respectively as expected by the preprocessing notebooks.

Note: Ensure your Kaggle API token (kaggle.json) is placed in ~/.kaggle/ (Linux/Mac) or C:\Users\<Username>\.kaggle\ (Windows).

Usage

The notebooks are designed to be executed in sequential order within each part to maintain the data pipeline:

Step 1: Launch Jupyter

jupyter notebook

(Or open the project folder in VS Code and select the virtual environment as the kernel.)

Step 2: Execute notebooks in order

For Part 1 — Regression:

code/Part1_Regression/01_eda_and_preprocessing.ipynb — EDA & feature engineering → produces processed CSVs
code/Part1_Regression/02_modeling.ipynb — Train & evaluate OLS, Ridge, Lasso, Elastic Net, WLS
code/Part1_Regression/03_advanced_bonus_experiments.ipynb — Kernel Ridge, GPR, Bayesian LR, Bias-Variance

For Part 2 — Classification:

code/Part2_Classification/01_eda_and_preprocessing.ipynb — EDA, Fisher ratio analysis & preprocessing
code/Part2_Classification/02_modeling.ipynb — Train & evaluate Perceptron, Logistic/Probit, LDA, QDA, GNB
code/Part2_Classification/03_advanced_bonus_experiments.ipynb — Bayesian LogReg, Kernel LR, OvR/OvO, noise robustness

Important: 01_eda_and_preprocessing.ipynb must be run first in each part, as it generates the train/val/test splits used by all subsequent notebooks.

7. Contributors

This project was developed by a team of 5 students from the Faculty of Information Technology, VNU-HCM University of Science.

Contributor	Student ID	Role	Main Responsibilities (Algorithms & Analysis)	Contribution
Lê Hà Thanh Chương	`23120195`	Project Lead	Project management; Classification EDA; LogReg (GD, IRLS, Multiclass), LDA/QDA; Probit Model, Laplace Approximation; VC Dimension analysis.	100%
Trà Văn Sỹ	`23120197`	ML Engineer	Perceptron, Regularized LogReg (L1/L2); Kernel LogReg, Gaussian Naive Bayes; Empirical VC Dimension & K-fold CV metrics visualization (ROC/AUC).	100%
Huỳnh Đức Thịnh	`23120199`	Data Scientist	Regression preprocessing & EDA; Linear Reg (OLS, Mini-batch GD); Full Bayesian Reg, Evidence Maximization; Learning curves & sensitivity analysis.	100%
Bùi Trung Hiếu	`23120257`	ML Researcher	Imputation & scaling; Ridge, Lasso, Feature Selection; Kernel Ridge Reg, GPR; Statistical tests (t-test, Wilcoxon, McNemar); Report aggregation.	100%
Lê Công Phúc	`23120330`	ML Analyst	Non-linear regression, Validation Curves, Ablation Study; Robust Regression; Bias-Variance Tradeoff, K-fold CV evaluation, decision boundaries.	100%

8. License & Acknowledgments

Academic Acknowledgments

This project is the Assignment Report No. 1 for the Introduction to Machine Learning course at VNU-HCM University of Science.

The team sincerely thanks the course instructor who provided theoretical foundations and guidance throughout the project:

Instructor: MSc. Lê Nhựt Nam

Data Attribution

Regression Dataset: Extracted from the study "Data driven prediction models of energy use of appliances in a low-energy house", published in Energy and Buildings (Vol. 140, April 2017) by researchers at the University of Mons (UMONS), Belgium. Hosted on UCI ML Repository (ID: 374). Licensed under CC BY 4.0.
Classification Dataset: Extracted from the study "Machine Learning-Based Occupancy Estimation Using Multivariate Sensor Nodes", published at IEEE Globecom Workshops 2018. Hosted on UCI ML Repository (ID: 864). Licensed under CC BY 4.0.

License

The source code of this project is distributed under the MIT License. You are free to use, copy, modify, merge, publish, and distribute this code, provided that the original copyright notice is retained. See the LICENSE file for full details.

Built with ❤️ by the ML Team | University of Science, VNU-HCM | 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Supervised Learning: Regression & Classification

Table of Contents

1. About The Project

2. Datasets

Regression Dataset: Appliances Energy Prediction

Classification Dataset: Room Occupancy Estimation

3. Part 1 — Regression

Regression Models Implemented

Regression Key Techniques

4. Part 2 — Classification

Classification Models Implemented

Classification Key Techniques

5. Repository Structure

6. Getting Started

Prerequisites

Installation

Usage

7. Contributors

8. License & Acknowledgments

Academic Acknowledgments

Data Attribution

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
code		code
data		data
logs		logs
report		report
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Supervised Learning: Regression & Classification

Table of Contents

1. About The Project

2. Datasets

Regression Dataset: Appliances Energy Prediction

Classification Dataset: Room Occupancy Estimation

3. Part 1 — Regression

Regression Models Implemented

Regression Key Techniques

4. Part 2 — Classification

Classification Models Implemented

Classification Key Techniques

5. Repository Structure

6. Getting Started

Prerequisites

Installation

Usage

7. Contributors

8. License & Acknowledgments

Academic Acknowledgments

Data Attribution

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages