PDF Joiner
PDF Joiner
CSE463
Computer Vision: Fundamentals and Applications
Lecture 1
Introduction to Computer Vision
Computer Vision is a specialized field within artificial intelligence (AI) aimed at teaching
machines to "see" and interpret the world through visual data. By processing digital images or
videos, computers can gain insights into scenes, objects, and actions—enabling applications
across various industries. Unlike human vision, which is inherently biological, computer vision
relies on digital data and mathematical models to achieve similar outcomes, interpreting images
using a combination of pixel analysis, pattern recognition, and statistical models.
Image Acquisition
Image acquisition is the process of capturing visual information from the physical world using
cameras, sensors, or scanners and converting it into a digital format.
● Types of Sensors: Standard RGB cameras, infrared cameras, LiDAR, and depth sensors
are commonly used for different applications, from security surveillance to autonomous
vehicles.
● Data Formats: Images can be 2D, like standard photographs, or 3D point clouds
generated by depth-sensing cameras.
● Challenges: The quality of acquired images depends on factors such as lighting, camera
resolution, and environmental conditions, all of which impact later stages of computer
vision processing.
2
Preprocessing
Preprocessing involves transforming or enhancing images to prepare them for more advanced
analysis.
● Techniques:
a. Denoising: Reduces noise in images (often using filters like Gaussian or median
filters).
b. Contrast Enhancement: Techniques like histogram equalization improve contrast,
making features more distinct.
c. Scaling and Cropping: Adjusts the image size, focusing on relevant portions.
d. Normalization: Scales pixel values to a consistent range (e.g., 0-1) for better
model performance.
Proper preprocessing can make feature extraction more accurate by improving image
clarity and removing artifacts.
Labeling and annotation are essential steps in data preparation, particularly for training
supervised machine learning models. They involve adding information to images, such as object
categories, bounding boxes, or pixel-level masks, to provide labeled data that guides algorithms
in recognizing patterns.
● Labeling: This is the process of assigning labels to images or objects within images,
usually at a high level (e.g., labeling an image as "cat" or "dog").
● Annotations: Annotations are more detailed and often involve marking regions of
interest within an image. This can include bounding boxes, polygons, or pixel-wise
masks, depending on the type of computer vision task.
Types of Annotations
1. Classification Labels
Are used for image classification tasks, where the goal is to classify an entire image.
Example: Labeling images as "car," "bicycle," or "pedestrian" in a dataset of street
scenes.
2. Bounding Boxes
They are used for object detection tasks to locate and identify objects in an image. A
bounding box is a rectangular outline drawn around the object of interest. Example:
Marking cars in traffic images to help a model detect and locate vehicles.
3. Semantic Segmentation
Are used to classify each pixel in an image into a category (e.g., sky, road, car). Each
pixel is labeled, providing fine-grained object delineation. Segmenting a road scene
where each pixel is assigned to classes like "road," "vehicle," or "pedestrian."
4. Instance Segmentation
It goes beyond semantic segmentation by distinguishing between multiple instances of
the same object class. Labels each instance of an object separately, even if they belong
3
to the same category. Example: Differentiating multiple people in a crowd, with each
person labeled as a unique instance.
Feature Extraction
Feature extraction identifies distinct points, edges, textures, or other characteristics within an
image that represent useful information.
● Key Techniques:
a. Edge Detection: Algorithms like Canny and Sobel detect boundaries between
different regions.
b. Corner Detection: Harris and Shi-Tomasi corner detectors find interest points and
are often used in object tracking.
c. Descriptors: SIFT (Scale-Invariant Feature Transform) and ORB (Oriented FAST
and Rotated BRIEF) provide unique, robust representations of image regions.
● Role in Machine Vision: Feature extraction simplifies complex visual data into
recognizable patterns, making it easier for algorithms to understand scenes, recognize
objects, and match images.
Interpretation
This is the final stage where extracted features are used to make sense of the visual data. It
often involves machine learning or deep learning to analyze, classify, or predict based on the
image data. Some potential tasks involve:
1. Classification: Identifies objects or scenes (e.g., cat vs. dog or indoor vs. outdoor).
2. Object Detection: Locates and labels multiple objects within an image (e.g., YOLO, SSD
models).
3. Segmentation: Divides images into meaningful parts or objects, like foreground and
background segmentation.
4. Image Captioning: Generates descriptive captions for images, typically using a
combination of convolutional neural networks (CNNs) and recurrent neural networks
(RNNs).
Interpretation is where AI models make high-level decisions about the image, often relying on
large datasets and trained neural networks to achieve high accuracy.
The differences between human vision and computer vision stem from how each processes
visual information. Human vision is biological and involves complex brain functions, whereas
computer vision is digital and relies on algorithms and mathematical models. Here’s a
comparison to highlight what humans see versus what computers see:
4
Color The human eye perceives colors Computers use numerical values for
Perception through three types of cones color, typically representing each pixel
sensitive to red, green, and blue in RGB values (e.g., (255, 0, 0) for
wavelengths. Humans are also red). Computers don’t inherently
capable of recognizing millions of adjust for lighting or context unless
colors and adjusting perception programmed to do so.
based on lighting and surrounding
context.
Depth and 3D Through binocular vision (two eyes) Most computer vision systems
Understanding and visual cues (e.g., size, process 2D images and lack inherent
perspective), humans perceive depth perception. To simulate depth,
depth and can understand spatial computers may rely on techniques
relationships in 3D. like stereo vision (using two cameras)
or additional sensors (e.g., LiDAR) to
create a 3D model.
Object The human brain uses context to Without training on specific data,
Recognition recognize objects, even with partial computers cannot recognize objects
and Contextual occlusion, low lighting, or unusual or make sense of context. Object
Awareness orientations. Humans also use recognition relies on algorithms and
previous experiences and extensive labeled data to detect
knowledge to interpret unfamiliar patterns, and even then, performance
scenes. may drop if objects are partially
obscured or in an unexpected setting.
Adaptability to Human vision can adapt to varying Computers often struggle with
Lighting and light conditions, thanks to the different lighting conditions. Models
Environment brain’s ability to compensate for trained on images with consistent
Changes shadows, brightness, and lighting may fail in varying
reflections. environments unless explicitly trained
to handle these variations or
equipped with techniques like
histogram equalization.
Computer vision still faces numerous challenges due to the complexity of real-world
environments:
Exercises
1. What is computer vision, and how does it differ from human vision?
2. Describe the main stages of a computer vision pipeline and the purpose of each stage.
3. What are some real-world applications of computer vision, and how do they benefit from
this technology?
4. Explain the difference between object detection, image segmentation, and image
classification in computer vision.
5. What challenges do computer vision systems face in real-world environments?
1
CSE463
Computer Vision: Fundamentals and Applications
Lecture 2
Image Formation and Filters
Perspective Projection
In perspective projection, objects appear smaller as they move further away from the camera,
and lines that are parallel in the 3D world converge in the 2D image, typically towards a
“vanishing point.” This principle explains why nearby objects appear large while distant objects
appear small and are crucial for a realistic representation of depth in images.
2
The simplest model of image formation is the pinhole camera model, which provides a
conceptual framework for understanding how light from a scene is captured through a small
aperture (the "pinhole") and projected onto an image plane (the camera sensor).
How It Works:
1. Light rays from the 3D objects in the scene pass through the pinhole and hit the image
plane.
2. Each light ray corresponds to a specific point in the scene and is projected onto a point
on the image plane.
3. The resulting image on the image plane is inverted, meaning that objects higher in the
scene appear lower on the image plane, and objects farther away appear closer.
4. The size of the image depends on the distance between the scene, the pinhole, and the
image plane.
3
The pinhole camera model is a simple approximation, but it provides the basis for more
sophisticated camera models that include lens effects like distortion and focus.
For accurate image formation and interpretation, a camera's internal and external properties
must be understood. These properties are captured in the intrinsic and extrinsic parameters of
the camera.
These are the internal properties of the camera that affect how it captures the scene.
● Focal Length: The distance between the camera's lens and the image plane. It
determines the magnification and the field of view (FOV).
● Principal Point: The point on the image plane where the optical axis intersects (usually
near the center of the image).
● Pixel Aspect Ratio: The ratio of the width to the height of a pixel in the camera sensor.
This parameter is used to account for non-square pixels.
● Skew: A measure of non-orthogonality of the image axes (often assumed to be zero in
most cameras).
These parameters are typically represented in a camera matrix KKK, which is used to transform
3D coordinates into 2D image coordinates.
These parameters describe the position and orientation of the camera in the world.
● Rotation Matrix (R): A 3x3 matrix that describes the camera’s orientation in 3D space.
● Translation Vector (T): A 3x1 vector that describes the camera’s position in 3D space
relative to the world coordinate system.
4
Extrinsic parameters define how the camera is positioned relative to the world and are critical for
reconstructing 3D scenes from images.
Projection Models
The projection from 3D space onto 2D space can be modeled using different projection
techniques, such as perspective projection and orthographic projection.
Perspective Projection:
● Most common in real-world cameras and is responsible for the phenomenon where
objects appear smaller as they get farther away from the camera (i.e., the vanishing
point).
● In perspective projection, light rays converge towards a single point (the camera's focal
point or pinhole).
● The transformation from 3D world coordinates (X, Y, Z) to 2D image coordinates (x,y) is
a nonlinear operation and involves both intrinsic and extrinsic parameters.
Where:
Orthographic Projection:
● Assumes parallel projection where objects appear the same size regardless of their
distance from the camera.
● It is often used for technical drawings or engineering applications but not for real-world
photography, as it doesn’t capture depth perception.
Image Formation:
○ Image formation models describe the physics of how images are formed on the
camera sensor.
○ Light and Aperture: Light enters the camera through the aperture, which controls the
amount of light hitting the sensor. The aperture and lens focus the incoming light,
creating an image on the sensor.
○ Focal Length and Depth of Field: The focal length determines the magnification of the
image, while the depth of field affects the range of distances at which objects appear
sharply in focus. Adjusting these parameters changes the scene's perspective and
focus.
Types of Filters:
Linear Filters:
● Gaussian Filter: A smoothing filter used to reduce noise by averaging pixel values in a
local region, creating a blurring effect. It’s widely used as a preprocessing step in
computer vision tasks.
● Box filtering: this is an average-of-surrounding-pixel kind of image filtering that causes
blurring. A 3x3 box filter is given as follows:
● Sobel Filter: An edge-detection filter that calculates the gradient of image intensity,
highlighting regions with rapid intensity change, which correspond to edges.
Non-Linear Filters:
● Median Filter: A noise-reduction filter that replaces each pixel with the median value of
neighboring pixels. It is effective at removing "salt-and-pepper" noise without blurring
edges.
● Bilateral Filter: This filter smooths the image while preserving edges, by combining both
spatial and intensity information, making it useful in preserving details in high-frequency
areas.
Eg 2: https://www.youtube.com/watch?v=yb2tPt0QVPY
Eg 3:
7
Applications of Filters:
Filtering helps prepare images for higher-level tasks by enhancing specific features or reducing
irrelevant data.
Exercises
1. What is the pinhole camera model, and how does it explain the projection of a 3D world
onto a 2D image plane?
2. Describe the difference between intrinsic and extrinsic camera parameters. Why are
both necessary for accurate image projection?
3. What is perspective projection, and how does it affect the appearance of objects as they
move farther from the camera?
4. Compare and contrast orthographic projection with perspective projection. In what
scenarios might each be preferred?
5. In terms of image formation, explain how light enters through the aperture and is focused
onto the image sensor. What role does the lens play in this process?
6. What is the purpose of using a Gaussian filter in image processing? How does it work to
reduce noise in an image?
7. Explain the difference between linear and non-linear filters, and provide examples of
each. How do these filters affect images?
8. Write down the matrix representation for a 3x3 box filter. And apply it over the image
given below-
8
a. b.
9.
10. The image on the left shows a noisy image. What filter can be used to revert it to its original
form?
9
Input Size (H_in,W_in): The height or width of the original input image (before filtering).
Filter Size (K_h, K_w): The height and width of the filter (kernel) being applied to the image.
Stride (S): The number of pixels the filter moves (or "slides") horizontally or vertically in each
step.
Assuming an Image size of 10×10 applying a Filter size: 3×3 with a stride of 2-
10
Since we can't have fractional pixels, we need to floor the value: Output Size = 4 x 4.
Gaussian Filter-
● Imagine you have a image of size 5x5 and a filter of size 3x3 with a stride 1,
Gaussian Kernel,
11
Simplified Formula for squared image = ((N - F + 2P)/S) + 1 where N=5 (input), F=3 (filter),
P=0 (padding), S=1 (stride)
We slide the kernel across the image, calculate the weighted sum for each 3×3 patch, and
normalize the result by dividing by 16.
● Calculate the result (sum and normalize) and put it in the (0, 0) position:
(b) For position (0, 1): Slide the filter to cover the sub-image (stride = 1),
= * (2 * 1 + 3 * 2 + 4 * 1 + 7 * 2 + 8 * 4 + 9 * 2 + 12 * 1 + 13 * 2 + 14 * 1) = 8
12
7 8 ?
? ? ?
? ? ?
1
CSE463
Computer Vision: Fundamentals and Applications
Lecture 3
Light and Binocular Vision
Biological Vision
Biological vision refers to the mechanisms through which living organisms, particularly humans,
perceive and interpret visual information from their surroundings. The study of biological vision
provides insights into how the human visual system functions and often inspires advancements
in computer vision and image processing.
○ Photoreceptors:
■ Rods: Responsible for vision in low-light (scotopic) conditions; sensitive
to intensity but not color.
■ Cones: Active in bright-light (photopic) conditions; responsible for color
vision (red, green, and blue cones).
○ Optic Nerve: Transmits visual information from the retina to the brain.
2. Visual Processing in the Brain:
○ Primary Visual Cortex (V1): Processes basic visual features like edges,
orientation, and motion.
○ Higher Visual Areas: Combine features to recognize shapes, objects, and
scenes.
Visible Light
Visible light is a part of the electromagnetic spectrum that the human eye can detect, typically
ranging from wavelengths of approximately 400 to 700 nanometers. Each wavelength
corresponds to a specific color that humans perceive, from violet (shorter wavelengths) to red
(longer wavelengths). Cameras and imaging devices capture this range of light to create color
images, which are then processed and represented in different color spaces.
Color Image
A color image represents visible light using combinations of primary colors. Most digital color
images are stored in three separate channels corresponding to red, green, and blue light
intensities. By combining these three channels, we can produce a wide range of colors that
closely match human color perception.
Color Spaces
RGB is the most common color space, representing colors by their red, green, and blue (RGB)
components. Each color is defined by a combination of these three values, ranging from 0 to 1
(normalized) or 0 to 255 in 8-bit representation.
● Primary Colors:
○ Red (1,0,0): Maximum red, no green, no blue.
○ Green (0,1,0): Maximum green, no red, no blue.
○ Blue (0,0,1): Maximum blue, no red, no green.
RGB is often visualized as a color cube where each axis corresponds to one of the RGB
values. The color at any point inside the cube is a mix of these three colors.
Drawbacks of RGB:
● Channel Correlation: The RGB channels are highly correlated, meaning changes in
one channel often affect perceived brightness and color, which can complicate
color-based tasks like segmentation.
● Non-perceptual: RGB is not aligned with human color perception, making it difficult to
manipulate colors in a way that corresponds to how we intuitively see and perceive
them.
Despite these drawbacks, RGB remains the default color space for most imaging devices and
digital displays due to its straightforward representation.
HSV (Hue, Saturation, Value) is an intuitive color space that aligns more closely with how
humans perceive colors. It is useful for color-based image processing and editing tasks.
● Hue (H): Represents the color type, ranging from 0 to 1. Hue is an angular value, often
visualized on a color wheel.
● Saturation (S): Represents the intensity of the color, with 0 being grayscale and 1 being
fully saturated color.
● Value (V): Represents the brightness, with 0 being black and 1 being full brightness.
● H (S=1, V=1): Represents the pure color tone at maximum saturation and brightness.
● S (H=1, V=1): Maximum color intensity.
● V (H=1, S=0): Represents grayscale brightness.
The HSV color space is ideal for color-based segmentation, filtering, and detection, as colors
can be manipulated independently of brightness and saturation.
YCbCr is widely used in image and video compression due to its efficient representation of color
and brightness. It separates luminance (Y) from chrominance (Cb and Cr) components.
In digital imaging, the Y channel handles most of the intensity information, while Cb and Cr
channels represent color differences. This separation allows for efficient compression by
reducing the resolution or bit-depth of the chrominance channels without significantly affecting
perceived image quality.
YCbCr is widely used in television and digital video standards due to its fast computation and
compatibility with compression algorithms.
The L*a*b* color space (also known as CIELAB) is designed to be perceptually uniform. It is
based on human vision, with L* representing lightness and a*, b* representing color-opponent
dimensions (green–red and blue-yellow).
L*a*b* color space is used in applications where color accuracy and perceptual uniformity are
essential, such as color correction, editing, and comparison.
Exercises
1. What is visible light, and how does it relate to the concept of color in digital imaging?
2. Describe the RGB color space. Why is it the default color space in most digital devices,
and what are its drawbacks?
3. Explain how the HSV color space differs from RGB. Why is HSV considered more
intuitive for certain color-based applications?
4. What are the primary components of the YCbCr color space, and why is it commonly
used in video compression?
5. Describe the purpose of each component in the L*a*b* color space and explain why it’s
useful for tasks requiring perceptual uniformity.
6. How does separating luma (Y) from chrominance (Cb, Cr) in YCbCr enable more
efficient compression?
7. What kind of tasks are best suited for the HSV color space, and why? Provide an
example of an application where HSV would be more useful than RGB.
8. What does it mean for a color space to be 'perceptually uniform,' and which color space
is designed with this property in mind?
9. If you were processing an image to adjust its brightness independently of color, which
color space might offer the most straightforward approach?
10.Explain how the different axes in the RGB color cube represent color combinations.
What color would you get if R=1, G=0, and B=0?
9
10
11
i) Grey-scale video
CSE463
Computer Vision: Fundamentals and Applications
Lecture 4
Point Operators and Filtering
Point Operators
Point operators are basic image processing transformations that apply adjustments to each
individual pixel independently, without considering neighboring pixels. They modify pixel values
based on a specific formula or parameter, such as brightness or contrast, to adjust the
appearance of an image.
Pixel Transforms
Pixel transforms are the building blocks of point operators, allowing the adjustment of brightness
and contrast. Two commonly used point processes are multiplication and addition with a
constant, a > 0 and b are often called the gain and bias parameters.
2
Linear operators use the superposition principle, meaning the transformation applied to a
combination of inputs equals the sum of the transformations applied to individual inputs. A
simple example is adjusting the brightness using a multiplicative gain.
● Linear Blend Operator: This is often used in image blending and cross-dissolving,
where two images are blended based on an alpha value, α, that varies from 0 to 1. For
instance:
Gamma correction is a critical step in image processing and display technology. It addresses the
nonlinear relationship between the intensity values of image pixels and their perceived
brightness. The goal is to ensure that images appear natural and consistent across different
devices and lighting conditions.
Human vision does not perceive brightness linearly. For example, doubling the intensity of light
does not double the perceived brightness. Similarly:
● Most display devices (like monitors and TVs) do not render brightness linearly due to
hardware constraints.
● Camera sensors capture light linearly, which does not align with how humans perceive
brightness.
Gamma correction adjusts the pixel intensity values to align with the nonlinear perception of
brightness by the human eye and the nonlinear response of display devices.
2. Gamma Function
Color Transforms
Color transforms modify an image's color properties, adjusting individual channels to alter
brightness, and balance, or to convert between color spaces.
4
1. Understanding Color Channels: Color images consist of correlated signals (RGB
channels) due to the interaction of light, sensors, and human perception.
2. Brightness Adjustment:
○ Adding a constant to each RGB channel uniformly increases brightness but may
alter the color balance.
○ Chromaticity Coordinates: Adjustments using chromaticity coordinates help
maintain perceptual color qualities without affecting hue or saturation.
3. Color Balancing: Corrects lighting discrepancies (e.g., yellowish hue from incandescent
lighting) by scaling each channel separately or applying a 3×3 color twist matrix for
complex color transformations.
4. Color Spaces:
○ RGB: Common in digital displays and image files.
○ YCbCr: Often used in video compression for efficient storage.
○ HSV: An intuitive color space for tasks involving hue manipulation.
○ L*a*b*: Designed for color accuracy and perceptual uniformity in color-critical
applications.
Image matting and compositing are techniques to extract and seamlessly blend objects from
one image into another.
○ Introduced by Porter and Duff (1984), the Over operator defines a way to blend
a foreground object over a background using alpha values.
○ Formula: Combines RGB and alpha channels to blend images while avoiding
harsh edges, ensuring smooth transitions.
5
Image Filtering
Image filtering adjusts pixel values based on neighboring pixels, making it essential for tasks
like noise reduction, blurring, sharpening, and edge detection.
Answer:
Yes, it is necessary to undo the gamma in the color values to achieve accurate exposure
matching along the seam when stitching images taken with different exposures.
In image stitching, particularly when dealing with two images taken with different exposures, it's
crucial to ensure that the RGB values align along the seam for a smooth transition. This process
often involves adjusting the brightness and color balance between the images. Regarding
gamma correction, it depends on how the images were captured and stored:
1. Gamma Correction: Most images are encoded with gamma correction to adjust for the
nonlinear response of display devices and human vision. This means that the pixel
values in the image are stored in a gamma-corrected space, which is typically darker in
the shadows and lighter in the highlights than a linear color space.
2. Undoing Gamma: To align the two images for stitching, you often need to perform
operations like brightness adjustment, blending, or histogram matching. These
operations should ideally be done in a linear color space, where the RGB values
represent the actual intensities of light.
If the images you're stitching have gamma correction applied, it’s advisable to undo the
gamma correction before performing the adjustments. This is because adjusting RGB
values in a gamma-corrected color space might lead to inaccurate results, especially
when manipulating pixel intensities to match exposure levels. By working in the linear
space, you ensure that the color and brightness adjustments are applied correctly, and
then you can apply the gamma correction back to the result after the adjustment.
7
Answers:
To adjust the RGB values in an image so that a sample "white color" (Rw, Gw, Bw) appears
neutral, you need to scale the RGB channels in a way that the white point becomes a neutral
white (where all channels are equal, typically RGB = 1 or RGB = (1, 1, 1) in normalized space).
The general approach is to compute a scaling factor for each color channel based on the ratio of
the target white color to the observed white color in the image. This scaling should correct for
any color casts without significantly affecting the exposure.
In this case, the transformation involves simple per-channel scaling of the RGB values. Since
you're adjusting the RGB channels independently (multiplying each channel by a constant
factor), a full 3x3 color twist matrix is not necessary for this operation. A full 3x3 matrix would be
used for more complex color transformations that mix the RGB channels, but here, you're
adjusting each channel independently. So, the correct transformation in this case is a simple
per-channel scaling, which is a diagonal scaling operation.
8