Books

Machine Learning Basics

Introduction to Machine Learning

Supervised Learning

Unsupervised Learning

Feature Engineering

Model Evaluation

Conclusions

Supervised Learning

Introduction

Supervised learning is the most common type of machine learning, where the algorithm learns from labeled training data. The goal is to learn a mapping function that can predict the output for new, unseen data.

Types of Supervised Learning

Classification: Predicting a discrete class label
Regression: Predicting a continuous value

Classification

Classification algorithms are used when the output variable is a category, such as "spam" or "not spam", "disease" or "no disease".

Common Classification Algorithms:

Logistic Regression
Decision Trees
Random Forest
Support Vector Machines
Neural Networks
k-Nearest Neighbors (k-NN)

Example: Email Classification

Consider an email classification system that categorizes emails as "spam" or "not spam":

Features might include:

Email length
Number of capital letters
Presence of certain keywords
Sender's domain

Python Data Processing

python data-science pandas

2026-01-19T00:00:00

Python Data Processing Example

This snippet demonstrates data processing using pandas and numpy.

PYTHON


1
2    import pandas as pd
3    import numpy as np
4    from sklearn.preprocessing import StandardScaler
5    
6    # Create sample data
7    data = {
8        'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
9        'age': [25, 30, 35, 28, 32],
10        'salary': [50000, 60000, 70000, 55000, 65000],
11        'department': ['IT', 'HR', 'Finance', 'IT', 'Marketing']
12    }
13    
14    # Create DataFrame
15    df = pd.DataFrame(data)
16    print("Original DataFrame:")
17    print(df)
18    
19    # Data preprocessing
20    # 1. Handle missing values
21    df.fillna({'salary': df['salary'].mean()}, inplace=True)
22    
23    # 2. Standardize numerical columns
24    scaler = StandardScaler()
25    numerical_cols = ['age', 'salary']
26    df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
27    
28    # 3. One-hot encode categorical columns
29    df_encoded = pd.get_dummies(df, columns=['department'])
30    
31    print("\nProcessed DataFrame:")
32    print(df_encoded)
33    
34    # 4. Group by department and calculate mean salary
35    dept_salary = df.groupby('department')['salary'].mean()
36    print("\nAverage salary by department:")
37    print(dept_salary)

Data preprocessing is crucial for email classification. We need to:

Convert text to numerical features
Handle missing values
Normalize features
Split data into training and testing sets

Here's how we might preprocess email data: Text preprocessing involves:

Tokenization (splitting text into words)
Removing stop words
Stemming or lemmatization
Converting to numerical representations (TF-IDF, word embeddings)

Regression

Regression algorithms are used when the output variable is a real or continuous value, such as predicting house prices, stock prices, or temperature. Common Regression Algorithms:

Linear Regression
Polynomial Regression
Ridge Regression
Lasso Regression
Support Vector Regression (SVR)
Decision Tree Regression
Random Forest Regression

Example: House Price Prediction

For predicting house prices, features might include:

Square footage
Number of bedrooms
Location (latitude, longitude)
Age of the house
Proximity to amenities

Derivatives

mathematics calculus derivatives

2023-02-15T00:00:00

The derivative of a function represents the rate of change of the function at any given point.

Definition

The derivative of a function $f(x)$ with respect to $x$ is defined as:

f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}

This limit represents the instantaneous rate of change of the function at point $x$ .

Common Derivatives

Here are some common derivatives:

Power rule: $\frac{d}{dx}(x^n) = nx^{n-1}$
Exponential: $\frac{d}{dx}(e^x) = e^x$
Trigonometric:

$\frac{d}{dx}(\sin x) = \cos x$
$\frac{d}{dx}(\cos x) = -\sin x$
$\frac{d}{dx}(\tan x) = \sec^2 x$

Applications

Derivatives have numerous applications in:

Physics (velocity, acceleration)
Economics (marginal cost, marginal revenue)
Engineering (rates of change)
Optimization problems

Understanding derivatives is crucial for optimizing regression models. When using gradient descent to minimize the mean squared error in linear regression, we need to compute derivatives of the loss function with respect to the model parameters. The derivative tells us the direction of steepest ascent, so we move in the opposite direction to minimize the error.

Model Evaluation Metrics

Classification Metrics:

Accuracy
Precision
Recall
F1 Score
ROC-AUC
Confusion Matrix

Regression Metrics:

Mean Absolute Error (MAE)
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
R-squared (R²)

Overfitting and Underfitting

Overfitting: Model performs well on training data but poorly on test data Underfitting: Model is too simple and captures neither training nor test data patterns Techniques to prevent overfitting:

Cross-validation
Regularization (L1, L2)
Dropout (for neural networks)
Early stopping
Data augmentation

SQL Queries

sql database queries

2026-01-19T00:00:00

SQL Queries Example

This snippet demonstrates various SQL queries for data manipulation.

      SQL
    

    

    -- Create tables
    CREATE TABLE employees (
        id INT PRIMARY KEY AUTO_INCREMENT,
        first_name VARCHAR(50) NOT NULL,
        last_name VARCHAR(50) NOT NULL,
        email VARCHAR(100) UNIQUE NOT NULL,
        department_id INT,
        salary DECIMAL(10, 2),
        hire_date DATE,
        is_active BOOLEAN DEFAULT TRUE,
        FOREIGN KEY (department_id) REFERENCES departments(id)
    );
    
    CREATE TABLE departments (
        id INT PRIMARY KEY AUTO_INCREMENT,
        name VARCHAR(50) NOT NULL,
        location VARCHAR(100),
        manager_id INT,
        budget DECIMAL(12, 2)
    );
    
    CREATE TABLE projects (
        id INT PRIMARY KEY AUTO_INCREMENT,
        name VARCHAR(100) NOT NULL,
        start_date DATE,
        end_date DATE,
        budget DECIMAL(12, 2),
        status ENUM('Planning', 'In Progress', 'Completed', 'On Hold') DEFAULT 'Planning'
    );
    
    CREATE TABLE employee_projects (
        employee_id INT,
        project_id INT,
        role VARCHAR(50),
        hours_worked INT DEFAULT 0,
        PRIMARY KEY (employee_id, project_id),
        FOREIGN KEY (employee_id) REFERENCES employees(id),
        FOREIGN KEY (project_id) REFERENCES projects(id)
    );
    
    -- Insert sample data
    INSERT INTO departments (name, location, budget) VALUES
    ('Engineering', 'San Francisco', 500000.00),
    ('Marketing', 'New York', 300000.00),
    ('Sales', 'Chicago', 400000.00),
    ('HR', 'Remote', 200000.00);
    
    INSERT INTO employees (first_name, last_name, email, department_id, salary, hire_date) VALUES
    ('John', 'Doe', 'john.doe@company.com', 1, 95000.00, '2022-01-15'),
    ('Jane', 'Smith', 'jane.smith@company.com', 1, 105000.00, '2021-03-20'),
    ('Mike', 'Johnson', 'mike.johnson@company.com', 2, 75000.00, '2022-06-10'),
    ('Sarah', 'Williams', 'sarah.williams@company.com', 3, 85000.00, '2020-11-05'),
    ('David', 'Brown', 'david.brown@company.com', 4, 65000.00, '2023-02-28');
    
    -- Basic SELECT queries
    -- 1. Select all employees
    SELECT * FROM employees;
    
    -- 2. Select specific columns
    SELECT first_name, last_name, email, salary FROM employees;
    
    -- 3. Filter with WHERE clause
    SELECT * FROM employees WHERE salary > 80000;
    
    -- 4. Multiple conditions
    SELECT * FROM employees 
    WHERE department_id = 1 AND salary >= 90000 
    AND hire_date >= '2022-01-01';
    
    -- JOIN queries
    -- 5. Inner join with departments
    SELECT 
        e.first_name,
        e.last_name,
        e.salary,
        d.name AS department_name,
        d.location
    FROM employees e
    INNER JOIN departments d ON e.department_id = d.id;
    
    -- 6. Left join to include all departments
    SELECT 
        d.name AS department_name,
        COUNT(e.id) AS employee_count,
        AVG(e.salary) AS average_salary
    FROM departments d
    LEFT JOIN employees e ON d.id = e.department_id
    GROUP BY d.id, d.name;
    
    -- Aggregate functions
    -- 7. Count, AVG, SUM, MAX, MIN
    SELECT 
        COUNT(*) AS total_employees,
        AVG(salary) AS average_salary,
        MAX(salary) AS highest_salary,
        MIN(salary) AS lowest_salary,
        SUM(salary) AS total_payroll
    FROM employees
    WHERE is_active = TRUE;
    
    -- 8. Group by with HAVING
    SELECT 
        department_id,
        COUNT(*) AS employee_count,
        AVG(salary) AS avg_salary
    FROM employees
    GROUP BY department_id
    HAVING COUNT(*) > 1
    ORDER BY avg_salary DESC;
    
    -- Subqueries
    -- 9. Subquery in WHERE clause
    SELECT first_name, last_name, salary
    FROM employees
    WHERE salary > (
        SELECT AVG(salary) 
        FROM employees
    );
    
    -- 10. Subquery in FROM clause
    SELECT dept_name, avg_salary
    FROM (
        SELECT 
            d.name AS dept_name,
            AVG(e.salary) AS avg_salary,
            COUNT(e.id) AS emp_count
        FROM departments d
        LEFT JOIN employees e ON d.id = e.department_id
        GROUP BY d.id, d.name
    ) AS dept_stats
    WHERE emp_count > 0;
    
    -- Window functions
    -- 11. ROW_NUMBER, RANK, DENSE_RANK
    SELECT 
        first_name,
        last_name,
        salary,
        ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num,
        RANK() OVER (ORDER BY salary DESC) AS rank_num,
        DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank
    FROM employees
    ORDER BY salary DESC;
    
    -- 12. LAG, LEAD functions
    SELECT 
        first_name,
        last_name,
        salary,
        LAG(salary, 1, 0) OVER (ORDER BY salary) AS prev_salary,
        LEAD(salary, 1, 0) OVER (ORDER BY salary) AS next_salary
    FROM employees
    ORDER BY salary;
    
    -- 13. Window functions with PARTITION BY
    SELECT 
        e.first_name,
        e.last_name,
        e.salary,
        d.name AS department,
        AVG(e.salary) OVER (PARTITION BY e.department_id) AS dept_avg_salary,
        e.salary - AVG(e.salary) OVER (PARTITION BY e.department_id) AS salary_diff_from_avg
    FROM employees e
    JOIN departments d ON e.department_id = d.id
    ORDER BY d.name, e.salary DESC;
    
    -- CTE (Common Table Expression)
    -- 14. Simple CTE
    WITH high_earners AS (
        SELECT first_name, last_name, salary, department_id
        FROM employees
        WHERE salary > 80000
    )
    SELECT 
        he.first_name,
        he.last_name,
        he.salary,
        d.name AS department
    FROM high_earners he
    JOIN departments d ON he.department_id = d.id;
    
    -- 15. Recursive CTE
    WITH RECURSIVE employee_hierarchy AS (
        SELECT id, first_name, last_name, department_id, 0 AS level
        FROM employees
        WHERE department_id = 1 AND salary = (
            SELECT MAX(salary) 
            FROM employees 
            WHERE department_id = 1
        )
        
        UNION ALL
        
        SELECT 
            e.id, 
            e.first_name, 
            e.last_name, 
            e.department_id, 
            eh.level + 1
        FROM employees e
        JOIN employee_hierarchy eh ON e.department_id = eh.department_id
        WHERE e.salary < eh.salary + 20000 AND eh.level < 3
    )
    SELECT * FROM employee_hierarchy;
    
  

When working with large datasets, SQL queries are essential for data extraction and preprocessing. For example, to extract training data from a database:

SQL


1
2    SELECT 
3        house_id,
4        square_feet,
5        bedrooms,
6        bathrooms,
7        price,
8        year_built
9    FROM 
10        houses
11    WHERE 
12        price IS NOT NULL
13        AND square_feet > 0
14    ORDER BY 
15        price DESC
16    LIMIT 10000;

This query extracts relevant features for house price prediction, ensuring data quality by filtering out invalid entries.

Practical Considerations

When building supervised learning models:

Feature Selection: Choose relevant features
Data Quality: Ensure clean, consistent data
Model Selection: Choose appropriate algorithm
Hyperparameter Tuning: Optimize model parameters
Cross-validation: Validate model performance
Interpretability: Understand model decisions

Conclusion

Supervised learning forms the foundation of many real-world ML applications. By understanding both classification and regression, along with proper evaluation techniques, we can build effective predictive models. In the next chapter, we'll explore unsupervised learning, where we work with unlabeled data to discover hidden patterns and structures.

Books

Tags

Machine Learning Basics

Machine Learning Basics

Supervised Learning

Introduction

Types of Supervised Learning

Classification

Example: Email Classification

Python Data Processing

Python Data Processing Example

Regression

Example: House Price Prediction

Derivatives

Definition

Common Derivatives

Applications

Model Evaluation Metrics

Overfitting and Underfitting

SQL Queries

SQL Queries Example

Practical Considerations

Conclusion