Machine Learning Basics

by Dr. Jane Smith

Machine Learning Basics

Model Evaluation

Introduction

Model evaluation is the process of assessing how well a machine learning model performs on unseen data. It's crucial for understanding model performance, comparing different models, and ensuring the model will work well in production.

Why Model Evaluation Matters

  1. Performance Assessment: Understand how well the model works
  2. Model Selection: Choose the best model among alternatives
  3. Hyperparameter Tuning: Optimize model parameters
  4. Business Impact: Quantify the value of the model
  5. Risk Management: Identify potential failure modes

Train-Test Split

The simplest evaluation approach is splitting data into training and testing sets:

PYTHON

1
2    from sklearn.model_selection import train_test_split
3    from sklearn.datasets import load_iris
4    
5    X, y = load_iris(return_X_y=True)
6    
7    # Split data: 80% training, 20% testing
8    X_train, X_test, y_train, y_test = train_test_split(
9        X, y, test_size=0.2, random_state=42
10    )
11    
12    print(f"Training set size: {len(X_train)}")
13    print(f"Test set size: {len(X_test)}")
    

Cross-Validation

Cross-validation provides more robust performance estimates by using multiple train-test splits.

K-Fold Cross-Validation

PYTHON

1
2    from sklearn.model_selection import cross_val_score
3    from sklearn.ensemble import RandomForestClassifier
4    
5    model = RandomForestClassifier(n_estimators=100, random_state=42)
6    
7    # 5-fold cross-validation
8    scores = cross_val_score(model, X, y, cv=5)
9    
10    print(f"Cross-validation scores: {scores}")
11    print(f"Mean score: {scores.mean():.3f}")
12    print(f"Standard deviation: {scores.std():.3f}")
    

Stratified K-Fold

Stratified K-Fold maintains class distribution in each fold:

PYTHON

1
2    from sklearn.model_selection import StratifiedKFold
3    
4    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
5    scores = cross_val_score(model, X, y, cv=skf)
    

Uncertainty Principle

physics quantum principles
2023-01-12T00:00:00

The Heisenberg uncertainty principle is one of the most fundamental principles in quantum mechanics, stating that there is a fundamental limit to the precision with which certain pairs of physical properties of a particle can be known simultaneously.

Mathematical Formulation

The uncertainty principle is expressed as:

where:

  • is the uncertainty in position
  • is the uncertainty in momentum
  • is the reduced Planck constant

Physical Interpretation

This principle means that:

  1. The more precisely we know a particle's position, the less precisely we can know its momentum
  2. The more precisely we know a particle's momentum, the less precisely we can know its position
  3. This is not a limitation of our measurement tools, but a fundamental property of nature

The uncertainty principle has profound implications for our understanding of reality and the nature of measurement at the quantum scale.

The uncertainty principle reminds us that evaluation metrics have inherent limitations. Just as we cannot perfectly measure both position and momentum, we cannot perfectly measure both model performance and generalization. There's always a trade-off between:
  • Training performance vs. test performance
  • Model complexity vs. interpretability
  • Bias vs. variance

Classification Metrics

Confusion Matrix

The confusion matrix shows True Positives, True Negatives, False Positives, and False Negatives:
PYTHON

1
2    from sklearn.metrics import confusion_matrix
3    import seaborn as sns
4    import matplotlib.pyplot as plt
5    
6    y_pred = model.predict(X_test)
7    cm = confusion_matrix(y_test, y_pred)
8    
9    sns.heatmap(cm, annot=True, fmt='d')
10    plt.xlabel('Predicted')
11    plt.ylabel('Actual')
12    plt.show()
    

Accuracy

Accuracy measures the proportion of correct predictions:
PYTHON

1
2    from sklearn.metrics import accuracy_score
3    
4    accuracy = accuracy_score(y_test, y_pred)
5    print(f"Accuracy: {accuracy:.3f}")
    

Precision and Recall

Precision measures the accuracy of positive predictions: Recall measures the ability to find all positive instances:
PYTHON

1
2    from sklearn.metrics import precision_score, recall_score
3    
4    precision = precision_score(y_test, y_pred, average='weighted')
5    recall = recall_score(y_test, y_pred, average='weighted')
6    
7    print(f"Precision: {precision:.3f}")
8    print(f"Recall: {recall:.3f}")
    

F1-Score

F1-Score is the harmonic mean of precision and recall:
PYTHON

1
2    from sklearn.metrics import f1_score
3    
4    f1 = f1_score(y_test, y_pred, average='weighted')
5    print(f"F1-Score: {f1:.3f}")
    

ROC and AUC

ROC curve shows the trade-off between true positive rate and false positive rate:
PYTHON

1
2    from sklearn.metrics import roc_curve, auc
3    import numpy as np
4    
5    # Get probability scores
6    y_scores = model.predict_proba(X_test)[:, 1]
7    
8    # Calculate ROC curve
9    fpr, tpr, thresholds = roc_curve(y_test, y_scores)
10    roc_auc = auc(fpr, tpr)
11    
12    plt.figure()
13    plt.plot(fpr, tpr, color='darkorange', lw=2, 
14             label=f'ROC curve (area = {roc_auc:.2f})')
15    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
16    plt.xlabel('False Positive Rate')
17    plt.ylabel('True Positive Rate')
18    plt.title('Receiver Operating Characteristic')
19    plt.legend(loc="lower right")
20    plt.show()
    

SQL Queries

sql database queries
2026-01-19T00:00:00

SQL Queries Example

This snippet demonstrates various SQL queries for data manipulation.

SQL

1
2    -- Create tables
3    CREATE TABLE employees (
4        id INT PRIMARY KEY AUTO_INCREMENT,
5        first_name VARCHAR(50) NOT NULL,
6        last_name VARCHAR(50) NOT NULL,
7        email VARCHAR(100) UNIQUE NOT NULL,
8        department_id INT,
9        salary DECIMAL(10, 2),
10        hire_date DATE,
11        is_active BOOLEAN DEFAULT TRUE,
12        FOREIGN KEY (department_id) REFERENCES departments(id)
13    );
14    
15    CREATE TABLE departments (
16        id INT PRIMARY KEY AUTO_INCREMENT,
17        name VARCHAR(50) NOT NULL,
18        location VARCHAR(100),
19        manager_id INT,
20        budget DECIMAL(12, 2)
21    );
22    
23    CREATE TABLE projects (
24        id INT PRIMARY KEY AUTO_INCREMENT,
25        name VARCHAR(100) NOT NULL,
26        start_date DATE,
27        end_date DATE,
28        budget DECIMAL(12, 2),
29        status ENUM('Planning', 'In Progress', 'Completed', 'On Hold') DEFAULT 'Planning'
30    );
31    
32    CREATE TABLE employee_projects (
33        employee_id INT,
34        project_id INT,
35        role VARCHAR(50),
36        hours_worked INT DEFAULT 0,
37        PRIMARY KEY (employee_id, project_id),
38        FOREIGN KEY (employee_id) REFERENCES employees(id),
39        FOREIGN KEY (project_id) REFERENCES projects(id)
40    );
41    
42    -- Insert sample data
43    INSERT INTO departments (name, location, budget) VALUES
44    ('Engineering', 'San Francisco', 500000.00),
45    ('Marketing', 'New York', 300000.00),
46    ('Sales', 'Chicago', 400000.00),
47    ('HR', 'Remote', 200000.00);
48    
49    INSERT INTO employees (first_name, last_name, email, department_id, salary, hire_date) VALUES
50    ('John', 'Doe', 'john.doe@company.com', 1, 95000.00, '2022-01-15'),
51    ('Jane', 'Smith', 'jane.smith@company.com', 1, 105000.00, '2021-03-20'),
52    ('Mike', 'Johnson', 'mike.johnson@company.com', 2, 75000.00, '2022-06-10'),
53    ('Sarah', 'Williams', 'sarah.williams@company.com', 3, 85000.00, '2020-11-05'),
54    ('David', 'Brown', 'david.brown@company.com', 4, 65000.00, '2023-02-28');
55    
56    -- Basic SELECT queries
57    -- 1. Select all employees
58    SELECT * FROM employees;
59    
60    -- 2. Select specific columns
61    SELECT first_name, last_name, email, salary FROM employees;
62    
63    -- 3. Filter with WHERE clause
64    SELECT * FROM employees WHERE salary > 80000;
65    
66    -- 4. Multiple conditions
67    SELECT * FROM employees 
68    WHERE department_id = 1 AND salary >= 90000 
69    AND hire_date >= '2022-01-01';
70    
71    -- JOIN queries
72    -- 5. Inner join with departments
73    SELECT 
74        e.first_name,
75        e.last_name,
76        e.salary,
77        d.name AS department_name,
78        d.location
79    FROM employees e
80    INNER JOIN departments d ON e.department_id = d.id;
81    
82    -- 6. Left join to include all departments
83    SELECT 
84        d.name AS department_name,
85        COUNT(e.id) AS employee_count,
86        AVG(e.salary) AS average_salary
87    FROM departments d
88    LEFT JOIN employees e ON d.id = e.department_id
89    GROUP BY d.id, d.name;
90    
91    -- Aggregate functions
92    -- 7. Count, AVG, SUM, MAX, MIN
93    SELECT 
94        COUNT(*) AS total_employees,
95        AVG(salary) AS average_salary,
96        MAX(salary) AS highest_salary,
97        MIN(salary) AS lowest_salary,
98        SUM(salary) AS total_payroll
99    FROM employees
100    WHERE is_active = TRUE;
101    
102    -- 8. Group by with HAVING
103    SELECT 
104        department_id,
105        COUNT(*) AS employee_count,
106        AVG(salary) AS avg_salary
107    FROM employees
108    GROUP BY department_id
109    HAVING COUNT(*) > 1
110    ORDER BY avg_salary DESC;
111    
112    -- Subqueries
113    -- 9. Subquery in WHERE clause
114    SELECT first_name, last_name, salary
115    FROM employees
116    WHERE salary > (
117        SELECT AVG(salary) 
118        FROM employees
119    );
120    
121    -- 10. Subquery in FROM clause
122    SELECT dept_name, avg_salary
123    FROM (
124        SELECT 
125            d.name AS dept_name,
126            AVG(e.salary) AS avg_salary,
127            COUNT(e.id) AS emp_count
128        FROM departments d
129        LEFT JOIN employees e ON d.id = e.department_id
130        GROUP BY d.id, d.name
131    ) AS dept_stats
132    WHERE emp_count > 0;
133    
134    -- Window functions
135    -- 11. ROW_NUMBER, RANK, DENSE_RANK
136    SELECT 
137        first_name,
138        last_name,
139        salary,
140        ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num,
141        RANK() OVER (ORDER BY salary DESC) AS rank_num,
142        DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank
143    FROM employees
144    ORDER BY salary DESC;
145    
146    -- 12. LAG, LEAD functions
147    SELECT 
148        first_name,
149        last_name,
150        salary,
151        LAG(salary, 1, 0) OVER (ORDER BY salary) AS prev_salary,
152        LEAD(salary, 1, 0) OVER (ORDER BY salary) AS next_salary
153    FROM employees
154    ORDER BY salary;
155    
156    -- 13. Window functions with PARTITION BY
157    SELECT 
158        e.first_name,
159        e.last_name,
160        e.salary,
161        d.name AS department,
162        AVG(e.salary) OVER (PARTITION BY e.department_id) AS dept_avg_salary,
163        e.salary - AVG(e.salary) OVER (PARTITION BY e.department_id) AS salary_diff_from_avg
164    FROM employees e
165    JOIN departments d ON e.department_id = d.id
166    ORDER BY d.name, e.salary DESC;
167    
168    -- CTE (Common Table Expression)
169    -- 14. Simple CTE
170    WITH high_earners AS (
171        SELECT first_name, last_name, salary, department_id
172        FROM employees
173        WHERE salary > 80000
174    )
175    SELECT 
176        he.first_name,
177        he.last_name,
178        he.salary,
179        d.name AS department
180    FROM high_earners he
181    JOIN departments d ON he.department_id = d.id;
182    
183    -- 15. Recursive CTE
184    WITH RECURSIVE employee_hierarchy AS (
185        SELECT id, first_name, last_name, department_id, 0 AS level
186        FROM employees
187        WHERE department_id = 1 AND salary = (
188            SELECT MAX(salary) 
189            FROM employees 
190            WHERE department_id = 1
191        )
192        
193        UNION ALL
194        
195        SELECT 
196            e.id, 
197            e.first_name, 
198            e.last_name, 
199            e.department_id, 
200            eh.level + 1
201        FROM employees e
202        JOIN employee_hierarchy eh ON e.department_id = eh.department_id
203        WHERE e.salary < eh.salary + 20000 AND eh.level < 3
204    )
205    SELECT * FROM employee_hierarchy;
    
SQL queries can be used to evaluate model performance on large datasets stored in databases:
SQL

1
2    -- Calculate accuracy for a classification model
3    WITH predictions AS (
4        SELECT 
5            actual_class,
6            predicted_class,
7            CASE WHEN actual_class = predicted_class THEN 1 ELSE 0 END AS correct
8        FROM model_predictions
9    )
10    SELECT 
11        COUNT(*) AS total_predictions,
12        SUM(correct) AS correct_predictions,
13        SUM(correct) * 100.0 / COUNT(*) AS accuracy_percentage
14    FROM predictions;
    

Regression Metrics

Mean Absolute Error (MAE)

MAE measures the average absolute difference between predicted and actual values:
PYTHON

1
2    from sklearn.metrics import mean_absolute_error
3    
4    mae = mean_absolute_error(y_test, y_pred)
5    print(f"MAE: {mae:.3f}")
    
Mean Squared Error (MSE) MSE measures the average squared difference:
PYTHON

1
2    from sklearn.metrics import mean_squared_error
3    
4    mse = mean_squared_error(y_test, y_pred)
5    print(f"MSE: {mse:.3f}")
    
Root Mean Squared Error (RMSE) RMSE is the square root of MSE, in the same units as the target:
PYTHON

1
2    rmse = np.sqrt(mse)
3    print(f"RMSE: {rmse:.3f}")
    
R-squared (R²) R-squared measures the proportion of variance explained by the model:
PYTHON

1
2    from sklearn.metrics import r2_score
3    
4    r2 = r2_score(y_test, y_pred)
5    print(f"R-squared: {r2:.3f}")
    

Python Data Processing

python data-science pandas
2026-01-19T00:00:00

Python Data Processing Example

This snippet demonstrates data processing using pandas and numpy.

PYTHON

1
2    import pandas as pd
3    import numpy as np
4    from sklearn.preprocessing import StandardScaler
5    
6    # Create sample data
7    data = {
8        'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
9        'age': [25, 30, 35, 28, 32],
10        'salary': [50000, 60000, 70000, 55000, 65000],
11        'department': ['IT', 'HR', 'Finance', 'IT', 'Marketing']
12    }
13    
14    # Create DataFrame
15    df = pd.DataFrame(data)
16    print("Original DataFrame:")
17    print(df)
18    
19    # Data preprocessing
20    # 1. Handle missing values
21    df.fillna({'salary': df['salary'].mean()}, inplace=True)
22    
23    # 2. Standardize numerical columns
24    scaler = StandardScaler()
25    numerical_cols = ['age', 'salary']
26    df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
27    
28    # 3. One-hot encode categorical columns
29    df_encoded = pd.get_dummies(df, columns=['department'])
30    
31    print("\nProcessed DataFrame:")
32    print(df_encoded)
33    
34    # 4. Group by department and calculate mean salary
35    dept_salary = df.groupby('department')['salary'].mean()
36    print("\nAverage salary by department:")
37    print(dept_salary)
    
Proper data preprocessing is essential for accurate model evaluation:
PYTHON

1
2    import pandas as pd
3    from sklearn.preprocessing import StandardScaler
4    from sklearn.pipeline import Pipeline
5    
6    # Create a pipeline for preprocessing and modeling
7    pipeline = Pipeline([
8        ('scaler', StandardScaler()),
9        ('model', RandomForestClassifier())
10    ])
11    
12    # Fit and evaluate with proper preprocessing
13    scores = cross_val_score(pipeline, X, y, cv=5)
14    print(f"CV scores with preprocessing: {scores.mean():.3f}")
    

Handling Class Imbalance

When classes are imbalanced, accuracy can be misleading. Use:
  1. Alternative Metrics: Precision, Recall, F1-Score
  2. Resampling: Oversample minority class or undersample majority class
  3. Class Weights: Weight classes differently in the loss function
PYTHON

1
2    from sklearn.utils.class_weight import compute_class_weight
3    
4    # Calculate class weights
5    class_weights = compute_class_weight(
6        class_weight='balanced',
7        classes=np.unique(y_train),
8        y=y_train
9    )
10    
11    # Use class weights in model
12    model = RandomForestClassifier(
13        class_weight=dict(enumerate(class_weights))
14    )
    

Hyperparameter Tuning

PYTHON

1
2    from sklearn.model_selection import GridSearchCV
3    
4    param_grid = {
5        'n_estimators': [50, 100, 200],
6        'max_depth': [None, 10, 20],
7        'min_samples_split': [2, 5, 10]
8    }
9    
10    grid_search = GridSearchCV(
11        RandomForestClassifier(),
12        param_grid,
13        cv=5,
14        scoring='accuracy'
15    )
16    
17    grid_search.fit(X_train, y_train)
18    print(f"Best parameters: {grid_search.best_params_}")
19    print(f"Best score: {grid_search.best_score_:.3f}")
    
PYTHON

1
2    from sklearn.model_selection import RandomizedSearchCV
3    from scipy.stats import randint
4    
5    param_dist = {
6        'n_estimators': randint(50, 200),
7        'max_depth': [None] + list(range(10, 21)),
8        'min_samples_split': randint(2, 11)
9    }
10    
11    random_search = RandomizedSearchCV(
12        RandomForestClassifier(),
13        param_distributions=param_dist,
14        n_iter=50,
15        cv=5,
16        scoring='accuracy'
17    )
18    
19    random_search.fit(X_train, y_train)
    

Derivatives

mathematics calculus derivatives
2023-02-15T00:00:00

The derivative of a function represents the rate of change of the function at any given point.

Definition

The derivative of a function with respect to is defined as:

This limit represents the instantaneous rate of change of the function at point .

Common Derivatives

Here are some common derivatives:

  1. Power rule:
  2. Exponential:
  3. Trigonometric:

Applications

Derivatives have numerous applications in:

  • Physics (velocity, acceleration)
  • Economics (marginal cost, marginal revenue)
  • Engineering (rates of change)
  • Optimization problems
Gradient-based optimization uses derivatives to find optimal hyperparameters. Just as derivatives guide us to minima in optimization, they can guide hyperparameter search:
PYTHON

1
2    # Using gradient-based optimization for learning rate
3    def optimize_learning_rate(initial_lr, gradient, learning_rate=0.01):
4        return initial_lr - learning_rate * gradient
    

Model Comparison

Statistical Tests

Use statistical tests to determine if performance differences are significant:
PYTHON

1
2    from scipy.stats import ttest_rel
3    
4    # Compare two models using paired t-test
5    scores_model1 = [0.85, 0.87, 0.84, 0.86, 0.85]
6    scores_model2 = [0.83, 0.85, 0.82, 0.84, 0.83]
7    
8    t_stat, p_value = ttest_rel(scores_model1, scores_model2)
9    print(f"P-value: {p_value:.3f}")
10    
11    if p_value < 0.05:
12        print("Models are significantly different")
13    else:
14        print("No significant difference found")
    

Bayesian Model Comparison

Bayesian methods provide probabilistic model comparison:
PYTHON

1
2    import numpy as np
3    
4    def bayesian_model_comparison(scores1, scores2):
5        # Simple Bayesian comparison using normal distributions
6        mean1, std1 = np.mean(scores1), np.std(scores1)
7        mean2, std2 = np.mean(scores2), np.std(scores2)
8        
9        # Calculate probability that model1 is better
10        diff_mean = mean1 - mean2
11        diff_std = np.sqrt(std1**2 + std2**2)
12        
13        prob_better = 1 - norm.cdf(0, diff_mean, diff_std)
14        return prob_better