Ai Computer Vision
About 80% of information humans acquire comes from vision.
When you see a photo, you can immediately recognize how many people are in it, what they are doing, and whether the background is indoors or outdoors.
But for a computer, this photo is just a bunch of numbers β each pixel is represented by three values: red, green, and blue, nothing more.
Computer Vision (CV) is the technology that enables computers to understand images.
From facial recognition unlocking on phones, to road condition recognition in self-driving cars, to lesion analysis in medical imaging, computer vision is now everywhere.
This module will take you from the basics of convolutional neural networks to the latest vision-language models.
> Learning Path: Convolutional Neural Network β Vision Transformer β Object Detection β Image Segmentation β Diffusion Model β CLIP β Vision-Language Model. Each step includes runnable code examples.
* * *
## Convolutional Neural Network (CNN)
CNN is the foundational technology of computer vision. It mimics how the human visual cortex works, extracting image features through local receptive fields.
### Convolution Operation Principle
The core idea of convolution is: using a small sliding window (kernel) to scan across the image and extract local features.
For example, a 3Γ3 kernel looks at 9 pixels on the image at a time, calculates their weighted sum, and produces one output value.
This process repeats as the kernel slides from left to right, top to bottom across the image, ultimately generating a "feature map."
## Example
# ============================================
# Implement the simplest convolution operation using NumPy
# Demonstrate how convolution extracts edge features
# ============================================
import numpy as np
def simple_convolution(image: np.ndarray, kernel: np.ndarray) -> np.ndarray:
"""
Implements the most basic 2D convolution operation (without padding and stride)
Parameters:
image: Input image (H, W), single-channel grayscale
kernel: Convolution kernel (kH, kW)
Returns:
Convolved feature map
"""
# Get dimensions of image and kernel
img_h, img_w = image.shape
kernel_h, kernel_w = kernel.shape
# Calculate output feature map dimensions
# Output size = Input size - Kernel size + 1
out_h = img_h - kernel_h + 1
out_w = img_w - kernel_w + 1
# Initialize output feature map
output = np.zeros((out_h, out_w))
# Slide the kernel for computation
for i in range(out_h):
for j in range(out_w):
# Extract the local region corresponding to the kernel
region = image[i:i+kernel_h, j:j+kernel_w]
# Element-wise multiplication and sum (this is the convolution operation)
output[i, j]= np.sum(region * kernel)
return output
# Create a simple test image: white square in the middle, black surroundings
# Shape: 8Γ8 grayscale image
test_image = np.array([
[0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0],
[0,0,1,1,1,1,0,0],
[0,0,1,1,1,1,0,0],
[0,0,1,1,1,1,0,0],
[0,0,1,1,1,1,0,0],
[0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0],
])
print("Original image:")
print(test_image)
# Define an edge detection kernel (simplified version of Sobel operator)
# This kernel can detect vertical edges
edge_kernel = np.array([
[-1,0,1],
[-2,0,2],
[-1,0,1],
])
# Perform convolution
feature_map = simple_convolution(test_image, edge_kernel)
print("
Feature map after convolution (vertical edges detected):")
print(np.round(feature_map,2))
The key to convolution is: the kernel parameters are learned, not manually designed.
During training, the model automatically adjusts the kernel values so it can extract features useful for the task.
### Pooling
Pooling serves to compress the feature map dimensions, reduce computation, while preserving important features.
The most commonly used is Max Pooling: divide the feature map into several small blocks, keeping only the maximum value from each block.
## Example
# ============================================
# Implement max pooling operation
# ============================================
def max_pooling(feature_map: np.ndarray, pool_size: int=2) -> np.ndarray:
"""
Implements max pooling operation
Parameters:
feature_map: Input feature map (H, W)
pool_size: Pooling window size, default 2Γ2
Returns:
Pooled feature map
"""
h, w = feature_map.shape
# Calculate output dimensions
out_h = h // pool_size
out_w = w // pool_size
output = np.zeros((out_h, out_w))
for i in range(out_h):
for j in range(out_w):
# Extract pooling window region
region = feature_map[
i*pool_size:(i+1)*pool_size,
j*pool_size:(j+1)*pool_size
]
# Take maximum value
output[i, j]= np.max(region)
return output
# Pool the previous feature map
print("Feature map before pooling:")
print(feature_map)
pooled = max_pooling(feature_map, pool_size=2)
print("
Feature map after 2Γ2 max pooling:")
print(pooled)
### Classic Architecture Comparison
There are several milestone architectures in the history of CNN development, each representing different design philosophies at different times.
| Architecture | Year | Core Innovation | Characteristics |
| --- | --- | --- | --- |
| AlexNet | 2012 | ReLU activation, Dropout, Data augmentation | Dawn of the deep learning era, first to significantly outperform traditional methods on ImageNet |
| VGG | 2014 | Uniform use of 3Γ3 small convolution kernels | Simple and elegant structure, easy to understand and implement |
| ResNet | 2015 | Residual connection (Skip Connection) | Solves the vanishing gradient problem in deep networks, networks can have hundreds of layers |
| EfficientNet | 2019 | Compound scaling method | Simultaneously scales depth, width, and resolution, extremely parameter-efficient |
### Residual Connection (Skip Connection)
The core innovation of ResNet is the residual connection, which solves the problem of "the deeper the network, the harder it is to train."
The learning target of traditional network layers is: directly learn the mapping from input x to output y.
The learning target of residual networks is: learn the difference between output y and input x (residual).
Formula: y = F(x) + x, where F(x) is the residual to be learned.
> Intuition of residual connection: asking the network to learn "how much to modify on the existing basis" is much easier than asking it to "learn the complete mapping from scratch."
## Example
# ============================================
# Implement a complete CNN image classifier using PyTorch
# Includes convolution, pooling, and residual connections
# ============================================
import torch
import torch.nn as nn
import torch.nn.functional as F
class ResidualBlock(nn.Module):
"""A simple residual block"""
def __init__ (self, in_channels: int, out_channels: int, stride: int=1):
super(). __init__ ()
# First convolution layer
self.conv1= nn.Conv2d(
in_channels, out_channels,
kernel_size=3, stride=stride, padding=1, bias=False
)
self.bn1= nn.BatchNorm2d(out_channels)
# Second convolution layer
self.conv2= nn.Conv2d(
out_channels, out_channels,
kernel_size=3, stride=1
YouTip