Ai Computer Vision

About 80% of information humans acquire comes from vision. When you see a photo, you can immediately recognize how many people are in it, what they are doing, and whether the background is indoors or outdoors. But for a computer, this photo is just a bunch of numbers — each pixel is represented by three values: red, green, and blue, nothing more. Computer Vision (CV) is the technology that enables computers to understand images. From facial recognition unlocking on phones, to road condition recognition in self-driving cars, to lesion analysis in medical imaging, computer vision is now everywhere. This module will take you from the basics of convolutional neural networks to the latest vision-language models. > Learning Path: Convolutional Neural Network → Vision Transformer → Object Detection → Image Segmentation → Diffusion Model → CLIP → Vision-Language Model. Each step includes runnable code examples. * * * ## Convolutional Neural Network (CNN) CNN is the foundational technology of computer vision. It mimics how the human visual cortex works, extracting image features through local receptive fields. ### Convolution Operation Principle The core idea of convolution is: using a small sliding window (kernel) to scan across the image and extract local features. For example, a 3×3 kernel looks at 9 pixels on the image at a time, calculates their weighted sum, and produces one output value. This process repeats as the kernel slides from left to right, top to bottom across the image, ultimately generating a "feature map." ## Example # ============================================ # Implement the simplest convolution operation using NumPy # Demonstrate how convolution extracts edge features # ============================================ import numpy as np def simple_convolution(image: np.ndarray, kernel: np.ndarray) -> np.ndarray: """ Implements the most basic 2D convolution operation (without padding and stride) Parameters: image: Input image (H, W), single-channel grayscale kernel: Convolution kernel (kH, kW) Returns: Convolved feature map """ # Get dimensions of image and kernel img_h, img_w = image.shape kernel_h, kernel_w = kernel.shape # Calculate output feature map dimensions # Output size = Input size - Kernel size + 1 out_h = img_h - kernel_h + 1 out_w = img_w - kernel_w + 1 # Initialize output feature map output = np.zeros((out_h, out_w)) # Slide the kernel for computation for i in range(out_h): for j in range(out_w): # Extract the local region corresponding to the kernel region = image[i:i+kernel_h, j:j+kernel_w] # Element-wise multiplication and sum (this is the convolution operation) output[i, j]= np.sum(region * kernel) return output # Create a simple test image: white square in the middle, black surroundings # Shape: 8×8 grayscale image test_image = np.array([ [0,0,0,0,0,0,0,0], [0,0,0,0,0,0,0,0], [0,0,1,1,1,1,0,0], [0,0,1,1,1,1,0,0], [0,0,1,1,1,1,0,0], [0,0,1,1,1,1,0,0], [0,0,0,0,0,0,0,0], [0,0,0,0,0,0,0,0], ]) print("Original image:") print(test_image) # Define an edge detection kernel (simplified version of Sobel operator) # This kernel can detect vertical edges edge_kernel = np.array([ [-1,0,1], [-2,0,2], [-1,0,1], ]) # Perform convolution feature_map = simple_convolution(test_image, edge_kernel) print(" Feature map after convolution (vertical edges detected):") print(np.round(feature_map,2)) The key to convolution is: the kernel parameters are learned, not manually designed. During training, the model automatically adjusts the kernel values so it can extract features useful for the task. ### Pooling Pooling serves to compress the feature map dimensions, reduce computation, while preserving important features. The most commonly used is Max Pooling: divide the feature map into several small blocks, keeping only the maximum value from each block. ## Example # ============================================ # Implement max pooling operation # ============================================ def max_pooling(feature_map: np.ndarray, pool_size: int=2) -> np.ndarray: """ Implements max pooling operation Parameters: feature_map: Input feature map (H, W) pool_size: Pooling window size, default 2×2 Returns: Pooled feature map """ h, w = feature_map.shape # Calculate output dimensions out_h = h // pool_size out_w = w // pool_size output = np.zeros((out_h, out_w)) for i in range(out_h): for j in range(out_w): # Extract pooling window region region = feature_map[ i*pool_size:(i+1)*pool_size, j*pool_size:(j+1)*pool_size ] # Take maximum value output[i, j]= np.max(region) return output # Pool the previous feature map print("Feature map before pooling:") print(feature_map) pooled = max_pooling(feature_map, pool_size=2) print(" Feature map after 2×2 max pooling:") print(pooled) ### Classic Architecture Comparison There are several milestone architectures in the history of CNN development, each representing different design philosophies at different times. | Architecture | Year | Core Innovation | Characteristics | | --- | --- | --- | --- | | AlexNet | 2012 | ReLU activation, Dropout, Data augmentation | Dawn of the deep learning era, first to significantly outperform traditional methods on ImageNet | | VGG | 2014 | Uniform use of 3×3 small convolution kernels | Simple and elegant structure, easy to understand and implement | | ResNet | 2015 | Residual connection (Skip Connection) | Solves the vanishing gradient problem in deep networks, networks can have hundreds of layers | | EfficientNet | 2019 | Compound scaling method | Simultaneously scales depth, width, and resolution, extremely parameter-efficient | ### Residual Connection (Skip Connection) The core innovation of ResNet is the residual connection, which solves the problem of "the deeper the network, the harder it is to train." The learning target of traditional network layers is: directly learn the mapping from input x to output y. The learning target of residual networks is: learn the difference between output y and input x (residual). Formula: y = F(x) + x, where F(x) is the residual to be learned. > Intuition of residual connection: asking the network to learn "how much to modify on the existing basis" is much easier than asking it to "learn the complete mapping from scratch." ## Example # ============================================ # Implement a complete CNN image classifier using PyTorch # Includes convolution, pooling, and residual connections # ============================================ import torch import torch.nn as nn import torch.nn.functional as F class ResidualBlock(nn.Module): """A simple residual block""" def __init__ (self, in_channels: int, out_channels: int, stride: int=1): super(). __init__ () # First convolution layer self.conv1= nn.Conv2d( in_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False ) self.bn1= nn.BatchNorm2d(out_channels) # Second convolution layer self.conv2= nn.Conv2d( out_channels, out_channels, kernel_size=3, stride=1

YouTip

Ai Computer Vision

📂 Categories