Machine Learning Basics

by Dr. Jane Smith

Machine Learning Basics

Supervised Learning

Introduction

Supervised learning is the most common type of machine learning, where the algorithm learns from labeled training data. The goal is to learn a mapping function that can predict the output for new, unseen data.

Types of Supervised Learning

  1. Classification: Predicting a discrete class label
  2. Regression: Predicting a continuous value

Classification

Classification algorithms are used when the output variable is a category, such as "spam" or "not spam", "disease" or "no disease".

Common Classification Algorithms:

  • Logistic Regression
  • Decision Trees
  • Random Forest
  • Support Vector Machines
  • Neural Networks
  • k-Nearest Neighbors (k-NN)

Example: Email Classification

Consider an email classification system that categorizes emails as "spam" or "not spam":

Features might include:

  • Email length
  • Number of capital letters
  • Presence of certain keywords
  • Sender's domain

Python Data Processing

python data-science pandas
2026-01-19T00:00:00

Python Data Processing Example

This snippet demonstrates data processing using pandas and numpy.

PYTHON

1
2    import pandas as pd
3    import numpy as np
4    from sklearn.preprocessing import StandardScaler
5    
6    # Create sample data
7    data = {
8        'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
9        'age': [25, 30, 35, 28, 32],
10        'salary': [50000, 60000, 70000, 55000, 65000],
11        'department': ['IT', 'HR', 'Finance', 'IT', 'Marketing']
12    }
13    
14    # Create DataFrame
15    df = pd.DataFrame(data)
16    print("Original DataFrame:")
17    print(df)
18    
19    # Data preprocessing
20    # 1. Handle missing values
21    df.fillna({'salary': df['salary'].mean()}, inplace=True)
22    
23    # 2. Standardize numerical columns
24    scaler = StandardScaler()
25    numerical_cols = ['age', 'salary']
26    df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
27    
28    # 3. One-hot encode categorical columns
29    df_encoded = pd.get_dummies(df, columns=['department'])
30    
31    print("\nProcessed DataFrame:")
32    print(df_encoded)
33    
34    # 4. Group by department and calculate mean salary
35    dept_salary = df.groupby('department')['salary'].mean()
36    print("\nAverage salary by department:")
37    print(dept_salary)
    
Data preprocessing is crucial for email classification. We need to:
  • Convert text to numerical features
  • Handle missing values
  • Normalize features
  • Split data into training and testing sets
Here's how we might preprocess email data: Text preprocessing involves:
  • Tokenization (splitting text into words)
  • Removing stop words
  • Stemming or lemmatization
  • Converting to numerical representations (TF-IDF, word embeddings)

Regression

Regression algorithms are used when the output variable is a real or continuous value, such as predicting house prices, stock prices, or temperature. Common Regression Algorithms:
  • Linear Regression
  • Polynomial Regression
  • Ridge Regression
  • Lasso Regression
  • Support Vector Regression (SVR)
  • Decision Tree Regression
  • Random Forest Regression

Example: House Price Prediction

For predicting house prices, features might include:
  • Square footage
  • Number of bedrooms
  • Location (latitude, longitude)
  • Age of the house
  • Proximity to amenities

Derivatives

mathematics calculus derivatives
2023-02-15T00:00:00

The derivative of a function represents the rate of change of the function at any given point.

Definition

The derivative of a function with respect to is defined as:

This limit represents the instantaneous rate of change of the function at point .

Common Derivatives

Here are some common derivatives:

  1. Power rule:
  2. Exponential:
  3. Trigonometric:

Applications

Derivatives have numerous applications in:

  • Physics (velocity, acceleration)
  • Economics (marginal cost, marginal revenue)
  • Engineering (rates of change)
  • Optimization problems
Understanding derivatives is crucial for optimizing regression models. When using gradient descent to minimize the mean squared error in linear regression, we need to compute derivatives of the loss function with respect to the model parameters. The derivative tells us the direction of steepest ascent, so we move in the opposite direction to minimize the error.

Model Evaluation Metrics

Classification Metrics:
  • Accuracy
  • Precision
  • Recall
  • F1 Score
  • ROC-AUC
  • Confusion Matrix
Regression Metrics:
  • Mean Absolute Error (MAE)
  • Mean Squared Error (MSE)
  • Root Mean Squared Error (RMSE)
  • R-squared (R²)

Overfitting and Underfitting

Overfitting: Model performs well on training data but poorly on test data Underfitting: Model is too simple and captures neither training nor test data patterns Techniques to prevent overfitting:
  • Cross-validation
  • Regularization (L1, L2)
  • Dropout (for neural networks)
  • Early stopping
  • Data augmentation

SQL Queries

sql database queries
2026-01-19T00:00:00

SQL Queries Example

This snippet demonstrates various SQL queries for data manipulation.

SQL

1
2    -- Create tables
3    CREATE TABLE employees (
4        id INT PRIMARY KEY AUTO_INCREMENT,
5        first_name VARCHAR(50) NOT NULL,
6        last_name VARCHAR(50) NOT NULL,
7        email VARCHAR(100) UNIQUE NOT NULL,
8        department_id INT,
9        salary DECIMAL(10, 2),
10        hire_date DATE,
11        is_active BOOLEAN DEFAULT TRUE,
12        FOREIGN KEY (department_id) REFERENCES departments(id)
13    );
14    
15    CREATE TABLE departments (
16        id INT PRIMARY KEY AUTO_INCREMENT,
17        name VARCHAR(50) NOT NULL,
18        location VARCHAR(100),
19        manager_id INT,
20        budget DECIMAL(12, 2)
21    );
22    
23    CREATE TABLE projects (
24        id INT PRIMARY KEY AUTO_INCREMENT,
25        name VARCHAR(100) NOT NULL,
26        start_date DATE,
27        end_date DATE,
28        budget DECIMAL(12, 2),
29        status ENUM('Planning', 'In Progress', 'Completed', 'On Hold') DEFAULT 'Planning'
30    );
31    
32    CREATE TABLE employee_projects (
33        employee_id INT,
34        project_id INT,
35        role VARCHAR(50),
36        hours_worked INT DEFAULT 0,
37        PRIMARY KEY (employee_id, project_id),
38        FOREIGN KEY (employee_id) REFERENCES employees(id),
39        FOREIGN KEY (project_id) REFERENCES projects(id)
40    );
41    
42    -- Insert sample data
43    INSERT INTO departments (name, location, budget) VALUES
44    ('Engineering', 'San Francisco', 500000.00),
45    ('Marketing', 'New York', 300000.00),
46    ('Sales', 'Chicago', 400000.00),
47    ('HR', 'Remote', 200000.00);
48    
49    INSERT INTO employees (first_name, last_name, email, department_id, salary, hire_date) VALUES
50    ('John', 'Doe', 'john.doe@company.com', 1, 95000.00, '2022-01-15'),
51    ('Jane', 'Smith', 'jane.smith@company.com', 1, 105000.00, '2021-03-20'),
52    ('Mike', 'Johnson', 'mike.johnson@company.com', 2, 75000.00, '2022-06-10'),
53    ('Sarah', 'Williams', 'sarah.williams@company.com', 3, 85000.00, '2020-11-05'),
54    ('David', 'Brown', 'david.brown@company.com', 4, 65000.00, '2023-02-28');
55    
56    -- Basic SELECT queries
57    -- 1. Select all employees
58    SELECT * FROM employees;
59    
60    -- 2. Select specific columns
61    SELECT first_name, last_name, email, salary FROM employees;
62    
63    -- 3. Filter with WHERE clause
64    SELECT * FROM employees WHERE salary > 80000;
65    
66    -- 4. Multiple conditions
67    SELECT * FROM employees 
68    WHERE department_id = 1 AND salary >= 90000 
69    AND hire_date >= '2022-01-01';
70    
71    -- JOIN queries
72    -- 5. Inner join with departments
73    SELECT 
74        e.first_name,
75        e.last_name,
76        e.salary,
77        d.name AS department_name,
78        d.location
79    FROM employees e
80    INNER JOIN departments d ON e.department_id = d.id;
81    
82    -- 6. Left join to include all departments
83    SELECT 
84        d.name AS department_name,
85        COUNT(e.id) AS employee_count,
86        AVG(e.salary) AS average_salary
87    FROM departments d
88    LEFT JOIN employees e ON d.id = e.department_id
89    GROUP BY d.id, d.name;
90    
91    -- Aggregate functions
92    -- 7. Count, AVG, SUM, MAX, MIN
93    SELECT 
94        COUNT(*) AS total_employees,
95        AVG(salary) AS average_salary,
96        MAX(salary) AS highest_salary,
97        MIN(salary) AS lowest_salary,
98        SUM(salary) AS total_payroll
99    FROM employees
100    WHERE is_active = TRUE;
101    
102    -- 8. Group by with HAVING
103    SELECT 
104        department_id,
105        COUNT(*) AS employee_count,
106        AVG(salary) AS avg_salary
107    FROM employees
108    GROUP BY department_id
109    HAVING COUNT(*) > 1
110    ORDER BY avg_salary DESC;
111    
112    -- Subqueries
113    -- 9. Subquery in WHERE clause
114    SELECT first_name, last_name, salary
115    FROM employees
116    WHERE salary > (
117        SELECT AVG(salary) 
118        FROM employees
119    );
120    
121    -- 10. Subquery in FROM clause
122    SELECT dept_name, avg_salary
123    FROM (
124        SELECT 
125            d.name AS dept_name,
126            AVG(e.salary) AS avg_salary,
127            COUNT(e.id) AS emp_count
128        FROM departments d
129        LEFT JOIN employees e ON d.id = e.department_id
130        GROUP BY d.id, d.name
131    ) AS dept_stats
132    WHERE emp_count > 0;
133    
134    -- Window functions
135    -- 11. ROW_NUMBER, RANK, DENSE_RANK
136    SELECT 
137        first_name,
138        last_name,
139        salary,
140        ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num,
141        RANK() OVER (ORDER BY salary DESC) AS rank_num,
142        DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank
143    FROM employees
144    ORDER BY salary DESC;
145    
146    -- 12. LAG, LEAD functions
147    SELECT 
148        first_name,
149        last_name,
150        salary,
151        LAG(salary, 1, 0) OVER (ORDER BY salary) AS prev_salary,
152        LEAD(salary, 1, 0) OVER (ORDER BY salary) AS next_salary
153    FROM employees
154    ORDER BY salary;
155    
156    -- 13. Window functions with PARTITION BY
157    SELECT 
158        e.first_name,
159        e.last_name,
160        e.salary,
161        d.name AS department,
162        AVG(e.salary) OVER (PARTITION BY e.department_id) AS dept_avg_salary,
163        e.salary - AVG(e.salary) OVER (PARTITION BY e.department_id) AS salary_diff_from_avg
164    FROM employees e
165    JOIN departments d ON e.department_id = d.id
166    ORDER BY d.name, e.salary DESC;
167    
168    -- CTE (Common Table Expression)
169    -- 14. Simple CTE
170    WITH high_earners AS (
171        SELECT first_name, last_name, salary, department_id
172        FROM employees
173        WHERE salary > 80000
174    )
175    SELECT 
176        he.first_name,
177        he.last_name,
178        he.salary,
179        d.name AS department
180    FROM high_earners he
181    JOIN departments d ON he.department_id = d.id;
182    
183    -- 15. Recursive CTE
184    WITH RECURSIVE employee_hierarchy AS (
185        SELECT id, first_name, last_name, department_id, 0 AS level
186        FROM employees
187        WHERE department_id = 1 AND salary = (
188            SELECT MAX(salary) 
189            FROM employees 
190            WHERE department_id = 1
191        )
192        
193        UNION ALL
194        
195        SELECT 
196            e.id, 
197            e.first_name, 
198            e.last_name, 
199            e.department_id, 
200            eh.level + 1
201        FROM employees e
202        JOIN employee_hierarchy eh ON e.department_id = eh.department_id
203        WHERE e.salary < eh.salary + 20000 AND eh.level < 3
204    )
205    SELECT * FROM employee_hierarchy;
    
When working with large datasets, SQL queries are essential for data extraction and preprocessing. For example, to extract training data from a database:
SQL

1
2    SELECT 
3        house_id,
4        square_feet,
5        bedrooms,
6        bathrooms,
7        price,
8        year_built
9    FROM 
10        houses
11    WHERE 
12        price IS NOT NULL
13        AND square_feet > 0
14    ORDER BY 
15        price DESC
16    LIMIT 10000;
    
This query extracts relevant features for house price prediction, ensuring data quality by filtering out invalid entries.

Practical Considerations

When building supervised learning models:
  1. Feature Selection: Choose relevant features
  2. Data Quality: Ensure clean, consistent data
  3. Model Selection: Choose appropriate algorithm
  4. Hyperparameter Tuning: Optimize model parameters
  5. Cross-validation: Validate model performance
  6. Interpretability: Understand model decisions

Conclusion

Supervised learning forms the foundation of many real-world ML applications. By understanding both classification and regression, along with proper evaluation techniques, we can build effective predictive models. In the next chapter, we'll explore unsupervised learning, where we work with unlabeled data to discover hidden patterns and structures.