Machine Learning Basics
by Dr. Jane Smith
Machine Learning Basics
Supervised Learning
Introduction
Supervised learning is the most common type of machine learning, where the algorithm learns from labeled training data. The goal is to learn a mapping function that can predict the output for new, unseen data.
Types of Supervised Learning
- Classification: Predicting a discrete class label
- Regression: Predicting a continuous value
Classification
Classification algorithms are used when the output variable is a category, such as "spam" or "not spam", "disease" or "no disease".
Common Classification Algorithms:
- Logistic Regression
- Decision Trees
- Random Forest
- Support Vector Machines
- Neural Networks
- k-Nearest Neighbors (k-NN)
Example: Email Classification
Consider an email classification system that categorizes emails as "spam" or "not spam":
Features might include:
- Email length
- Number of capital letters
- Presence of certain keywords
- Sender's domain
Python Data Processing
Python Data Processing Example
This snippet demonstrates data processing using pandas and numpy.
1
2 import pandas as pd
3 import numpy as np
4 from sklearn.preprocessing import StandardScaler
5
6 # Create sample data
7 data = {
8 'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
9 'age': [25, 30, 35, 28, 32],
10 'salary': [50000, 60000, 70000, 55000, 65000],
11 'department': ['IT', 'HR', 'Finance', 'IT', 'Marketing']
12 }
13
14 # Create DataFrame
15 df = pd.DataFrame(data)
16 print("Original DataFrame:")
17 print(df)
18
19 # Data preprocessing
20 # 1. Handle missing values
21 df.fillna({'salary': df['salary'].mean()}, inplace=True)
22
23 # 2. Standardize numerical columns
24 scaler = StandardScaler()
25 numerical_cols = ['age', 'salary']
26 df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
27
28 # 3. One-hot encode categorical columns
29 df_encoded = pd.get_dummies(df, columns=['department'])
30
31 print("\nProcessed DataFrame:")
32 print(df_encoded)
33
34 # 4. Group by department and calculate mean salary
35 dept_salary = df.groupby('department')['salary'].mean()
36 print("\nAverage salary by department:")
37 print(dept_salary)
- Convert text to numerical features
- Handle missing values
- Normalize features
- Split data into training and testing sets
- Tokenization (splitting text into words)
- Removing stop words
- Stemming or lemmatization
- Converting to numerical representations (TF-IDF, word embeddings)
Regression
Regression algorithms are used when the output variable is a real or continuous value, such as predicting house prices, stock prices, or temperature. Common Regression Algorithms:- Linear Regression
- Polynomial Regression
- Ridge Regression
- Lasso Regression
- Support Vector Regression (SVR)
- Decision Tree Regression
- Random Forest Regression
Example: House Price Prediction
For predicting house prices, features might include:- Square footage
- Number of bedrooms
- Location (latitude, longitude)
- Age of the house
- Proximity to amenities
Derivatives
The derivative of a function represents the rate of change of the function at any given point.
Definition
The derivative of a function with respect to is defined as:
This limit represents the instantaneous rate of change of the function at point .
Common Derivatives
Here are some common derivatives:
- Power rule:
- Exponential:
- Trigonometric:
Applications
Derivatives have numerous applications in:
- Physics (velocity, acceleration)
- Economics (marginal cost, marginal revenue)
- Engineering (rates of change)
- Optimization problems
Model Evaluation Metrics
Classification Metrics:- Accuracy
- Precision
- Recall
- F1 Score
- ROC-AUC
- Confusion Matrix
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- R-squared (R²)
Overfitting and Underfitting
Overfitting: Model performs well on training data but poorly on test data Underfitting: Model is too simple and captures neither training nor test data patterns Techniques to prevent overfitting:- Cross-validation
- Regularization (L1, L2)
- Dropout (for neural networks)
- Early stopping
- Data augmentation
SQL Queries
SQL Queries Example
This snippet demonstrates various SQL queries for data manipulation.
1
2 -- Create tables
3 CREATE TABLE employees (
4 id INT PRIMARY KEY AUTO_INCREMENT,
5 first_name VARCHAR(50) NOT NULL,
6 last_name VARCHAR(50) NOT NULL,
7 email VARCHAR(100) UNIQUE NOT NULL,
8 department_id INT,
9 salary DECIMAL(10, 2),
10 hire_date DATE,
11 is_active BOOLEAN DEFAULT TRUE,
12 FOREIGN KEY (department_id) REFERENCES departments(id)
13 );
14
15 CREATE TABLE departments (
16 id INT PRIMARY KEY AUTO_INCREMENT,
17 name VARCHAR(50) NOT NULL,
18 location VARCHAR(100),
19 manager_id INT,
20 budget DECIMAL(12, 2)
21 );
22
23 CREATE TABLE projects (
24 id INT PRIMARY KEY AUTO_INCREMENT,
25 name VARCHAR(100) NOT NULL,
26 start_date DATE,
27 end_date DATE,
28 budget DECIMAL(12, 2),
29 status ENUM('Planning', 'In Progress', 'Completed', 'On Hold') DEFAULT 'Planning'
30 );
31
32 CREATE TABLE employee_projects (
33 employee_id INT,
34 project_id INT,
35 role VARCHAR(50),
36 hours_worked INT DEFAULT 0,
37 PRIMARY KEY (employee_id, project_id),
38 FOREIGN KEY (employee_id) REFERENCES employees(id),
39 FOREIGN KEY (project_id) REFERENCES projects(id)
40 );
41
42 -- Insert sample data
43 INSERT INTO departments (name, location, budget) VALUES
44 ('Engineering', 'San Francisco', 500000.00),
45 ('Marketing', 'New York', 300000.00),
46 ('Sales', 'Chicago', 400000.00),
47 ('HR', 'Remote', 200000.00);
48
49 INSERT INTO employees (first_name, last_name, email, department_id, salary, hire_date) VALUES
50 ('John', 'Doe', 'john.doe@company.com', 1, 95000.00, '2022-01-15'),
51 ('Jane', 'Smith', 'jane.smith@company.com', 1, 105000.00, '2021-03-20'),
52 ('Mike', 'Johnson', 'mike.johnson@company.com', 2, 75000.00, '2022-06-10'),
53 ('Sarah', 'Williams', 'sarah.williams@company.com', 3, 85000.00, '2020-11-05'),
54 ('David', 'Brown', 'david.brown@company.com', 4, 65000.00, '2023-02-28');
55
56 -- Basic SELECT queries
57 -- 1. Select all employees
58 SELECT * FROM employees;
59
60 -- 2. Select specific columns
61 SELECT first_name, last_name, email, salary FROM employees;
62
63 -- 3. Filter with WHERE clause
64 SELECT * FROM employees WHERE salary > 80000;
65
66 -- 4. Multiple conditions
67 SELECT * FROM employees
68 WHERE department_id = 1 AND salary >= 90000
69 AND hire_date >= '2022-01-01';
70
71 -- JOIN queries
72 -- 5. Inner join with departments
73 SELECT
74 e.first_name,
75 e.last_name,
76 e.salary,
77 d.name AS department_name,
78 d.location
79 FROM employees e
80 INNER JOIN departments d ON e.department_id = d.id;
81
82 -- 6. Left join to include all departments
83 SELECT
84 d.name AS department_name,
85 COUNT(e.id) AS employee_count,
86 AVG(e.salary) AS average_salary
87 FROM departments d
88 LEFT JOIN employees e ON d.id = e.department_id
89 GROUP BY d.id, d.name;
90
91 -- Aggregate functions
92 -- 7. Count, AVG, SUM, MAX, MIN
93 SELECT
94 COUNT(*) AS total_employees,
95 AVG(salary) AS average_salary,
96 MAX(salary) AS highest_salary,
97 MIN(salary) AS lowest_salary,
98 SUM(salary) AS total_payroll
99 FROM employees
100 WHERE is_active = TRUE;
101
102 -- 8. Group by with HAVING
103 SELECT
104 department_id,
105 COUNT(*) AS employee_count,
106 AVG(salary) AS avg_salary
107 FROM employees
108 GROUP BY department_id
109 HAVING COUNT(*) > 1
110 ORDER BY avg_salary DESC;
111
112 -- Subqueries
113 -- 9. Subquery in WHERE clause
114 SELECT first_name, last_name, salary
115 FROM employees
116 WHERE salary > (
117 SELECT AVG(salary)
118 FROM employees
119 );
120
121 -- 10. Subquery in FROM clause
122 SELECT dept_name, avg_salary
123 FROM (
124 SELECT
125 d.name AS dept_name,
126 AVG(e.salary) AS avg_salary,
127 COUNT(e.id) AS emp_count
128 FROM departments d
129 LEFT JOIN employees e ON d.id = e.department_id
130 GROUP BY d.id, d.name
131 ) AS dept_stats
132 WHERE emp_count > 0;
133
134 -- Window functions
135 -- 11. ROW_NUMBER, RANK, DENSE_RANK
136 SELECT
137 first_name,
138 last_name,
139 salary,
140 ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num,
141 RANK() OVER (ORDER BY salary DESC) AS rank_num,
142 DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank
143 FROM employees
144 ORDER BY salary DESC;
145
146 -- 12. LAG, LEAD functions
147 SELECT
148 first_name,
149 last_name,
150 salary,
151 LAG(salary, 1, 0) OVER (ORDER BY salary) AS prev_salary,
152 LEAD(salary, 1, 0) OVER (ORDER BY salary) AS next_salary
153 FROM employees
154 ORDER BY salary;
155
156 -- 13. Window functions with PARTITION BY
157 SELECT
158 e.first_name,
159 e.last_name,
160 e.salary,
161 d.name AS department,
162 AVG(e.salary) OVER (PARTITION BY e.department_id) AS dept_avg_salary,
163 e.salary - AVG(e.salary) OVER (PARTITION BY e.department_id) AS salary_diff_from_avg
164 FROM employees e
165 JOIN departments d ON e.department_id = d.id
166 ORDER BY d.name, e.salary DESC;
167
168 -- CTE (Common Table Expression)
169 -- 14. Simple CTE
170 WITH high_earners AS (
171 SELECT first_name, last_name, salary, department_id
172 FROM employees
173 WHERE salary > 80000
174 )
175 SELECT
176 he.first_name,
177 he.last_name,
178 he.salary,
179 d.name AS department
180 FROM high_earners he
181 JOIN departments d ON he.department_id = d.id;
182
183 -- 15. Recursive CTE
184 WITH RECURSIVE employee_hierarchy AS (
185 SELECT id, first_name, last_name, department_id, 0 AS level
186 FROM employees
187 WHERE department_id = 1 AND salary = (
188 SELECT MAX(salary)
189 FROM employees
190 WHERE department_id = 1
191 )
192
193 UNION ALL
194
195 SELECT
196 e.id,
197 e.first_name,
198 e.last_name,
199 e.department_id,
200 eh.level + 1
201 FROM employees e
202 JOIN employee_hierarchy eh ON e.department_id = eh.department_id
203 WHERE e.salary < eh.salary + 20000 AND eh.level < 3
204 )
205 SELECT * FROM employee_hierarchy;
1
2 SELECT
3 house_id,
4 square_feet,
5 bedrooms,
6 bathrooms,
7 price,
8 year_built
9 FROM
10 houses
11 WHERE
12 price IS NOT NULL
13 AND square_feet > 0
14 ORDER BY
15 price DESC
16 LIMIT 10000;
Practical Considerations
When building supervised learning models:- Feature Selection: Choose relevant features
- Data Quality: Ensure clean, consistent data
- Model Selection: Choose appropriate algorithm
- Hyperparameter Tuning: Optimize model parameters
- Cross-validation: Validate model performance
- Interpretability: Understand model decisions