Ml Knn
K Nearest Neighbors (KNN) is a simple and commonly used classification and regression algorithm.
KNN belongs to supervised learning, and its core idea is to calculate the distance between the sample to be classified and each sample in the training set, find the K nearest samples, and then predict the category or value of the sample to be classified based on the categories or values of these K samples.
### KNN Basic Principles
The basic principles of the KNN algorithm can be summarized into the following steps:
1. **Calculate Distance**: Calculate the distance between the sample to be classified and each sample in the training set. Common distance measurement methods include Euclidean distance, Manhattan distance, etc.
2. **Select K Nearest Neighbors**: Based on the calculated distances, select the K samples with the smallest distances.
3. **Vote or Average**: For classification problems, the category that appears most frequently among the K nearest neighbors is the category of the sample to be classified; for regression problems, the average value of the K nearest neighbors is the value of the sample to be classified.
### KNN Characteristics
* **Simple and Easy to Understand**: The principle of the KNN algorithm is very simple and easy to understand and implement.
* **No Training Required**: KNN is a "lazy learning" algorithm that does not require an explicit training process; all calculations are performed at prediction time.
* **No Assumptions About Data Distribution**: KNN does not make any assumptions about the distribution of data and is suitable for various types of data.
* **High Computational Complexity**: Since KNN needs to calculate the distance to all samples at prediction time, the computational complexity can be very high when the dataset is large.
### KNN Algorithm Advantages and Disadvantages
**Advantages**
* **Simple and Easy to Use**: The principle of the KNN algorithm is simple and easy to understand and implement.
* **No Training Required**: KNN does not require an explicit training process; all calculations are performed at prediction time.
* **Suitable for Multi-classification Problems**: KNN can easily handle multi-classification problems.
**Disadvantages**
* **High Computational Complexity**: KNN needs to calculate the distance to all samples at prediction time, which can be very computationally expensive when the dataset is large.
* **Sensitive to Noise**: KNN is relatively sensitive to noisy data, and noisy data may affect prediction results.
* **Need to Choose Appropriate K Value**: The choice of K value has a significant impact on model performance, and choosing the appropriate K value is a challenge.
* * *
## KNN Algorithm Implementation Steps
### 1. Import Necessary Libraries
First, we need to import some commonly used Python libraries, such as `numpy` for numerical computation, `matplotlib` for plotting, and `sklearn` for loading datasets and evaluating models.
## Example
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
### 2. Load Dataset
We use the `load_iris` function from `sklearn` to load the classic Iris dataset. This dataset contains 150 samples, each with 4 features, and the goal is to classify the samples into 3 categories.
## Example
# Load Iris dataset
iris = datasets.load_iris()
X = iris.data[:, :2]# Only take the first two features for visualization
y = iris.target
### 3. Data Preprocessing
Before applying the KNN algorithm, data usually needs to be standardized to ensure that each feature contributes equally to distance calculation.
## Example
# Split the dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
### 4. Train KNN Model
Next, we use `KNeighborsClassifier` from `sklearn` to train the KNN model. Here we choose K=3, which means selecting 3 nearest neighbors.
## Example
# Create KNN model, set K value to 3
knn = KNeighborsClassifier(n_neighbors=3)
# Train the model
knn.fit(X_train, y_train)
### 5. Prediction and Evaluation
Use the trained model to make predictions on the test set and calculate the model accuracy.
## Example
# Make predictions on the test set
y_pred = knn.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"KNN Model Accuracy: {accuracy:.4f}")
The output is as follows:
KNN Model Accuracy: 0.7556
### 6. Visualize KNN Classification Results
To more intuitively understand the classification effect of KNN, we can plot the data points and decision boundaries.
Here we use the first two features of the dataset as input features.
## Example
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Load Iris dataset
iris = datasets.load_iris()
X = iris.data[:, :2]# Only take the first two features for visualization
y = iris.target
# Split the dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create KNN model, set K value to 3
knn = KNeighborsClassifier(n_neighbors=3)
# Train the model
knn.fit(X_train, y_train)
# Make predictions on the test set
y_pred = knn.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"KNN Model Accuracy: {accuracy:.4f}")
# Plot decision boundaries and data points
h =.02# Grid step size
x_min, x_max = X[:,0].min() - 1, X[:,0].max() + 1
y_min, y_max = X[:,1].min() - 1, X[:,1].max() + 1
# Create a two-dimensional grid representing different sample spaces
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# Use KNN model to predict the category of each point in the grid
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot decision boundaries
plt.contourf(xx, yy, Z, alpha=0.8)
# Plot training data points
plt.scatter(X[:,0], X[:,1], c=y, edgecolors='k', marker='o', s=50)
plt.title("KNN Demo")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()
It is displayed as follows:
!(#)
### 7. Adjust K Value
The choice of K value has an important impact on model performance.
Usually, we select the best K value through cross-validation or visualization methods.
## Example
# Try different K values and plot the accuracy changes
k_range =range(1,21)
accuracies =[]
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
# Plot the relationship between K value and accuracy
plt.plot(k_range, accuracies, marker='o')
plt.title("Relationship Between K Value and Accuracy")
plt.xlabel("K Value")
plt.ylabel("Accuracy")
plt.show()
### 8. Use KNN for Regression Tasks
KNN can also be used for regression tasks (KNN Regression).
In regression tasks, KNN predicts the output by averaging the target values of the K nearest neighbors.
## Example
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
# Generate sample data
X = np.random.rand(100,1) * 10
y = np.sin(X).ravel() + 0.1 * np.random.randn(100)
# Split into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create KNN regression model
knn_reg = KNeighborsRegressor(n_neighbors=5)
# Train the model
knn_reg.fit(X_train, y_train)
# Make predictions on the test set
y_pred = knn_reg.predict(X_test)
# Visualize regression results
plt.scatter(X_test, y_test, color='red', label='True Values')
plt.scatter(X_test, y_pred, color='blue', label='Predicted Values')
plt.title("KNN Regression")
plt.xlabel("Feature")
plt.ylabel("Target")
plt.legend()
plt.show()
Red represents true values, blue represents predicted values:
!(#)
YouTip