Machine Learning Basics
by Dr. Jane Smith
Machine Learning Basics
Model Evaluation
Introduction
Model evaluation is the process of assessing how well a machine learning model performs on unseen data. It's crucial for understanding model performance, comparing different models, and ensuring the model will work well in production.
Why Model Evaluation Matters
- Performance Assessment: Understand how well the model works
- Model Selection: Choose the best model among alternatives
- Hyperparameter Tuning: Optimize model parameters
- Business Impact: Quantify the value of the model
- Risk Management: Identify potential failure modes
Train-Test Split
The simplest evaluation approach is splitting data into training and testing sets:
1
2 from sklearn.model_selection import train_test_split
3 from sklearn.datasets import load_iris
4
5 X, y = load_iris(return_X_y=True)
6
7 # Split data: 80% training, 20% testing
8 X_train, X_test, y_train, y_test = train_test_split(
9 X, y, test_size=0.2, random_state=42
10 )
11
12 print(f"Training set size: {len(X_train)}")
13 print(f"Test set size: {len(X_test)}")
Cross-Validation
Cross-validation provides more robust performance estimates by using multiple train-test splits.
K-Fold Cross-Validation
1
2 from sklearn.model_selection import cross_val_score
3 from sklearn.ensemble import RandomForestClassifier
4
5 model = RandomForestClassifier(n_estimators=100, random_state=42)
6
7 # 5-fold cross-validation
8 scores = cross_val_score(model, X, y, cv=5)
9
10 print(f"Cross-validation scores: {scores}")
11 print(f"Mean score: {scores.mean():.3f}")
12 print(f"Standard deviation: {scores.std():.3f}")
Stratified K-Fold
Stratified K-Fold maintains class distribution in each fold:
1
2 from sklearn.model_selection import StratifiedKFold
3
4 skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
5 scores = cross_val_score(model, X, y, cv=skf)
Uncertainty Principle
The Heisenberg uncertainty principle is one of the most fundamental principles in quantum mechanics, stating that there is a fundamental limit to the precision with which certain pairs of physical properties of a particle can be known simultaneously.
Mathematical Formulation
The uncertainty principle is expressed as:
where:
- is the uncertainty in position
- is the uncertainty in momentum
- is the reduced Planck constant
Physical Interpretation
This principle means that:
- The more precisely we know a particle's position, the less precisely we can know its momentum
- The more precisely we know a particle's momentum, the less precisely we can know its position
- This is not a limitation of our measurement tools, but a fundamental property of nature
The uncertainty principle has profound implications for our understanding of reality and the nature of measurement at the quantum scale.
- Training performance vs. test performance
- Model complexity vs. interpretability
- Bias vs. variance
Classification Metrics
Confusion Matrix
The confusion matrix shows True Positives, True Negatives, False Positives, and False Negatives:
1
2 from sklearn.metrics import confusion_matrix
3 import seaborn as sns
4 import matplotlib.pyplot as plt
5
6 y_pred = model.predict(X_test)
7 cm = confusion_matrix(y_test, y_pred)
8
9 sns.heatmap(cm, annot=True, fmt='d')
10 plt.xlabel('Predicted')
11 plt.ylabel('Actual')
12 plt.show()
Accuracy
Accuracy measures the proportion of correct predictions:
1
2 from sklearn.metrics import accuracy_score
3
4 accuracy = accuracy_score(y_test, y_pred)
5 print(f"Accuracy: {accuracy:.3f}")
Precision and Recall
Precision measures the accuracy of positive predictions: Recall measures the ability to find all positive instances:
1
2 from sklearn.metrics import precision_score, recall_score
3
4 precision = precision_score(y_test, y_pred, average='weighted')
5 recall = recall_score(y_test, y_pred, average='weighted')
6
7 print(f"Precision: {precision:.3f}")
8 print(f"Recall: {recall:.3f}")
F1-Score
F1-Score is the harmonic mean of precision and recall:
1
2 from sklearn.metrics import f1_score
3
4 f1 = f1_score(y_test, y_pred, average='weighted')
5 print(f"F1-Score: {f1:.3f}")
ROC and AUC
ROC curve shows the trade-off between true positive rate and false positive rate:
1
2 from sklearn.metrics import roc_curve, auc
3 import numpy as np
4
5 # Get probability scores
6 y_scores = model.predict_proba(X_test)[:, 1]
7
8 # Calculate ROC curve
9 fpr, tpr, thresholds = roc_curve(y_test, y_scores)
10 roc_auc = auc(fpr, tpr)
11
12 plt.figure()
13 plt.plot(fpr, tpr, color='darkorange', lw=2,
14 label=f'ROC curve (area = {roc_auc:.2f})')
15 plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
16 plt.xlabel('False Positive Rate')
17 plt.ylabel('True Positive Rate')
18 plt.title('Receiver Operating Characteristic')
19 plt.legend(loc="lower right")
20 plt.show()
SQL Queries
SQL Queries Example
This snippet demonstrates various SQL queries for data manipulation.
1
2 -- Create tables
3 CREATE TABLE employees (
4 id INT PRIMARY KEY AUTO_INCREMENT,
5 first_name VARCHAR(50) NOT NULL,
6 last_name VARCHAR(50) NOT NULL,
7 email VARCHAR(100) UNIQUE NOT NULL,
8 department_id INT,
9 salary DECIMAL(10, 2),
10 hire_date DATE,
11 is_active BOOLEAN DEFAULT TRUE,
12 FOREIGN KEY (department_id) REFERENCES departments(id)
13 );
14
15 CREATE TABLE departments (
16 id INT PRIMARY KEY AUTO_INCREMENT,
17 name VARCHAR(50) NOT NULL,
18 location VARCHAR(100),
19 manager_id INT,
20 budget DECIMAL(12, 2)
21 );
22
23 CREATE TABLE projects (
24 id INT PRIMARY KEY AUTO_INCREMENT,
25 name VARCHAR(100) NOT NULL,
26 start_date DATE,
27 end_date DATE,
28 budget DECIMAL(12, 2),
29 status ENUM('Planning', 'In Progress', 'Completed', 'On Hold') DEFAULT 'Planning'
30 );
31
32 CREATE TABLE employee_projects (
33 employee_id INT,
34 project_id INT,
35 role VARCHAR(50),
36 hours_worked INT DEFAULT 0,
37 PRIMARY KEY (employee_id, project_id),
38 FOREIGN KEY (employee_id) REFERENCES employees(id),
39 FOREIGN KEY (project_id) REFERENCES projects(id)
40 );
41
42 -- Insert sample data
43 INSERT INTO departments (name, location, budget) VALUES
44 ('Engineering', 'San Francisco', 500000.00),
45 ('Marketing', 'New York', 300000.00),
46 ('Sales', 'Chicago', 400000.00),
47 ('HR', 'Remote', 200000.00);
48
49 INSERT INTO employees (first_name, last_name, email, department_id, salary, hire_date) VALUES
50 ('John', 'Doe', 'john.doe@company.com', 1, 95000.00, '2022-01-15'),
51 ('Jane', 'Smith', 'jane.smith@company.com', 1, 105000.00, '2021-03-20'),
52 ('Mike', 'Johnson', 'mike.johnson@company.com', 2, 75000.00, '2022-06-10'),
53 ('Sarah', 'Williams', 'sarah.williams@company.com', 3, 85000.00, '2020-11-05'),
54 ('David', 'Brown', 'david.brown@company.com', 4, 65000.00, '2023-02-28');
55
56 -- Basic SELECT queries
57 -- 1. Select all employees
58 SELECT * FROM employees;
59
60 -- 2. Select specific columns
61 SELECT first_name, last_name, email, salary FROM employees;
62
63 -- 3. Filter with WHERE clause
64 SELECT * FROM employees WHERE salary > 80000;
65
66 -- 4. Multiple conditions
67 SELECT * FROM employees
68 WHERE department_id = 1 AND salary >= 90000
69 AND hire_date >= '2022-01-01';
70
71 -- JOIN queries
72 -- 5. Inner join with departments
73 SELECT
74 e.first_name,
75 e.last_name,
76 e.salary,
77 d.name AS department_name,
78 d.location
79 FROM employees e
80 INNER JOIN departments d ON e.department_id = d.id;
81
82 -- 6. Left join to include all departments
83 SELECT
84 d.name AS department_name,
85 COUNT(e.id) AS employee_count,
86 AVG(e.salary) AS average_salary
87 FROM departments d
88 LEFT JOIN employees e ON d.id = e.department_id
89 GROUP BY d.id, d.name;
90
91 -- Aggregate functions
92 -- 7. Count, AVG, SUM, MAX, MIN
93 SELECT
94 COUNT(*) AS total_employees,
95 AVG(salary) AS average_salary,
96 MAX(salary) AS highest_salary,
97 MIN(salary) AS lowest_salary,
98 SUM(salary) AS total_payroll
99 FROM employees
100 WHERE is_active = TRUE;
101
102 -- 8. Group by with HAVING
103 SELECT
104 department_id,
105 COUNT(*) AS employee_count,
106 AVG(salary) AS avg_salary
107 FROM employees
108 GROUP BY department_id
109 HAVING COUNT(*) > 1
110 ORDER BY avg_salary DESC;
111
112 -- Subqueries
113 -- 9. Subquery in WHERE clause
114 SELECT first_name, last_name, salary
115 FROM employees
116 WHERE salary > (
117 SELECT AVG(salary)
118 FROM employees
119 );
120
121 -- 10. Subquery in FROM clause
122 SELECT dept_name, avg_salary
123 FROM (
124 SELECT
125 d.name AS dept_name,
126 AVG(e.salary) AS avg_salary,
127 COUNT(e.id) AS emp_count
128 FROM departments d
129 LEFT JOIN employees e ON d.id = e.department_id
130 GROUP BY d.id, d.name
131 ) AS dept_stats
132 WHERE emp_count > 0;
133
134 -- Window functions
135 -- 11. ROW_NUMBER, RANK, DENSE_RANK
136 SELECT
137 first_name,
138 last_name,
139 salary,
140 ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num,
141 RANK() OVER (ORDER BY salary DESC) AS rank_num,
142 DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank
143 FROM employees
144 ORDER BY salary DESC;
145
146 -- 12. LAG, LEAD functions
147 SELECT
148 first_name,
149 last_name,
150 salary,
151 LAG(salary, 1, 0) OVER (ORDER BY salary) AS prev_salary,
152 LEAD(salary, 1, 0) OVER (ORDER BY salary) AS next_salary
153 FROM employees
154 ORDER BY salary;
155
156 -- 13. Window functions with PARTITION BY
157 SELECT
158 e.first_name,
159 e.last_name,
160 e.salary,
161 d.name AS department,
162 AVG(e.salary) OVER (PARTITION BY e.department_id) AS dept_avg_salary,
163 e.salary - AVG(e.salary) OVER (PARTITION BY e.department_id) AS salary_diff_from_avg
164 FROM employees e
165 JOIN departments d ON e.department_id = d.id
166 ORDER BY d.name, e.salary DESC;
167
168 -- CTE (Common Table Expression)
169 -- 14. Simple CTE
170 WITH high_earners AS (
171 SELECT first_name, last_name, salary, department_id
172 FROM employees
173 WHERE salary > 80000
174 )
175 SELECT
176 he.first_name,
177 he.last_name,
178 he.salary,
179 d.name AS department
180 FROM high_earners he
181 JOIN departments d ON he.department_id = d.id;
182
183 -- 15. Recursive CTE
184 WITH RECURSIVE employee_hierarchy AS (
185 SELECT id, first_name, last_name, department_id, 0 AS level
186 FROM employees
187 WHERE department_id = 1 AND salary = (
188 SELECT MAX(salary)
189 FROM employees
190 WHERE department_id = 1
191 )
192
193 UNION ALL
194
195 SELECT
196 e.id,
197 e.first_name,
198 e.last_name,
199 e.department_id,
200 eh.level + 1
201 FROM employees e
202 JOIN employee_hierarchy eh ON e.department_id = eh.department_id
203 WHERE e.salary < eh.salary + 20000 AND eh.level < 3
204 )
205 SELECT * FROM employee_hierarchy;
1
2 -- Calculate accuracy for a classification model
3 WITH predictions AS (
4 SELECT
5 actual_class,
6 predicted_class,
7 CASE WHEN actual_class = predicted_class THEN 1 ELSE 0 END AS correct
8 FROM model_predictions
9 )
10 SELECT
11 COUNT(*) AS total_predictions,
12 SUM(correct) AS correct_predictions,
13 SUM(correct) * 100.0 / COUNT(*) AS accuracy_percentage
14 FROM predictions;
Regression Metrics
Mean Absolute Error (MAE)
MAE measures the average absolute difference between predicted and actual values:
1
2 from sklearn.metrics import mean_absolute_error
3
4 mae = mean_absolute_error(y_test, y_pred)
5 print(f"MAE: {mae:.3f}")
1
2 from sklearn.metrics import mean_squared_error
3
4 mse = mean_squared_error(y_test, y_pred)
5 print(f"MSE: {mse:.3f}")
1
2 rmse = np.sqrt(mse)
3 print(f"RMSE: {rmse:.3f}")
1
2 from sklearn.metrics import r2_score
3
4 r2 = r2_score(y_test, y_pred)
5 print(f"R-squared: {r2:.3f}")
Python Data Processing
Python Data Processing Example
This snippet demonstrates data processing using pandas and numpy.
1
2 import pandas as pd
3 import numpy as np
4 from sklearn.preprocessing import StandardScaler
5
6 # Create sample data
7 data = {
8 'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
9 'age': [25, 30, 35, 28, 32],
10 'salary': [50000, 60000, 70000, 55000, 65000],
11 'department': ['IT', 'HR', 'Finance', 'IT', 'Marketing']
12 }
13
14 # Create DataFrame
15 df = pd.DataFrame(data)
16 print("Original DataFrame:")
17 print(df)
18
19 # Data preprocessing
20 # 1. Handle missing values
21 df.fillna({'salary': df['salary'].mean()}, inplace=True)
22
23 # 2. Standardize numerical columns
24 scaler = StandardScaler()
25 numerical_cols = ['age', 'salary']
26 df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
27
28 # 3. One-hot encode categorical columns
29 df_encoded = pd.get_dummies(df, columns=['department'])
30
31 print("\nProcessed DataFrame:")
32 print(df_encoded)
33
34 # 4. Group by department and calculate mean salary
35 dept_salary = df.groupby('department')['salary'].mean()
36 print("\nAverage salary by department:")
37 print(dept_salary)
1
2 import pandas as pd
3 from sklearn.preprocessing import StandardScaler
4 from sklearn.pipeline import Pipeline
5
6 # Create a pipeline for preprocessing and modeling
7 pipeline = Pipeline([
8 ('scaler', StandardScaler()),
9 ('model', RandomForestClassifier())
10 ])
11
12 # Fit and evaluate with proper preprocessing
13 scores = cross_val_score(pipeline, X, y, cv=5)
14 print(f"CV scores with preprocessing: {scores.mean():.3f}")
Handling Class Imbalance
When classes are imbalanced, accuracy can be misleading. Use:- Alternative Metrics: Precision, Recall, F1-Score
- Resampling: Oversample minority class or undersample majority class
- Class Weights: Weight classes differently in the loss function
1
2 from sklearn.utils.class_weight import compute_class_weight
3
4 # Calculate class weights
5 class_weights = compute_class_weight(
6 class_weight='balanced',
7 classes=np.unique(y_train),
8 y=y_train
9 )
10
11 # Use class weights in model
12 model = RandomForestClassifier(
13 class_weight=dict(enumerate(class_weights))
14 )
Hyperparameter Tuning
Grid Search
1
2 from sklearn.model_selection import GridSearchCV
3
4 param_grid = {
5 'n_estimators': [50, 100, 200],
6 'max_depth': [None, 10, 20],
7 'min_samples_split': [2, 5, 10]
8 }
9
10 grid_search = GridSearchCV(
11 RandomForestClassifier(),
12 param_grid,
13 cv=5,
14 scoring='accuracy'
15 )
16
17 grid_search.fit(X_train, y_train)
18 print(f"Best parameters: {grid_search.best_params_}")
19 print(f"Best score: {grid_search.best_score_:.3f}")
Random Search
1
2 from sklearn.model_selection import RandomizedSearchCV
3 from scipy.stats import randint
4
5 param_dist = {
6 'n_estimators': randint(50, 200),
7 'max_depth': [None] + list(range(10, 21)),
8 'min_samples_split': randint(2, 11)
9 }
10
11 random_search = RandomizedSearchCV(
12 RandomForestClassifier(),
13 param_distributions=param_dist,
14 n_iter=50,
15 cv=5,
16 scoring='accuracy'
17 )
18
19 random_search.fit(X_train, y_train)
Derivatives
The derivative of a function represents the rate of change of the function at any given point.
Definition
The derivative of a function with respect to is defined as:
This limit represents the instantaneous rate of change of the function at point .
Common Derivatives
Here are some common derivatives:
- Power rule:
- Exponential:
- Trigonometric:
Applications
Derivatives have numerous applications in:
- Physics (velocity, acceleration)
- Economics (marginal cost, marginal revenue)
- Engineering (rates of change)
- Optimization problems
1
2 # Using gradient-based optimization for learning rate
3 def optimize_learning_rate(initial_lr, gradient, learning_rate=0.01):
4 return initial_lr - learning_rate * gradient
Model Comparison
Statistical Tests
Use statistical tests to determine if performance differences are significant:
1
2 from scipy.stats import ttest_rel
3
4 # Compare two models using paired t-test
5 scores_model1 = [0.85, 0.87, 0.84, 0.86, 0.85]
6 scores_model2 = [0.83, 0.85, 0.82, 0.84, 0.83]
7
8 t_stat, p_value = ttest_rel(scores_model1, scores_model2)
9 print(f"P-value: {p_value:.3f}")
10
11 if p_value < 0.05:
12 print("Models are significantly different")
13 else:
14 print("No significant difference found")
Bayesian Model Comparison
Bayesian methods provide probabilistic model comparison:
1
2 import numpy as np
3
4 def bayesian_model_comparison(scores1, scores2):
5 # Simple Bayesian comparison using normal distributions
6 mean1, std1 = np.mean(scores1), np.std(scores1)
7 mean2, std2 = np.mean(scores2), np.std(scores2)
8
9 # Calculate probability that model1 is better
10 diff_mean = mean1 - mean2
11 diff_std = np.sqrt(std1**2 + std2**2)
12
13 prob_better = 1 - norm.cdf(0, diff_mean, diff_std)
14 return prob_better