0% found this document useful (0 votes)
23 views38 pages

PDF Joiner

Computer Vision is a field of AI focused on enabling machines to interpret visual data through techniques like image acquisition, preprocessing, feature extraction, and interpretation. It contrasts with human vision in processing methods and capabilities, with applications in areas like autonomous vehicles, facial recognition, and medical imaging. Key challenges include lighting variability, object occlusion, and the need for robust models to generalize across different environments.

Uploaded by

omarblaze15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views38 pages

PDF Joiner

Computer Vision is a field of AI focused on enabling machines to interpret visual data through techniques like image acquisition, preprocessing, feature extraction, and interpretation. It contrasts with human vision in processing methods and capabilities, with applications in areas like autonomous vehicles, facial recognition, and medical imaging. Key challenges include lighting variability, object occlusion, and the need for robust models to generalize across different environments.

Uploaded by

omarblaze15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

1

CSE463
Computer Vision: Fundamentals and Applications
Lecture 1
Introduction to Computer Vision

What is Computer Vision?

Computer Vision is a specialized field within artificial intelligence (AI) aimed at teaching
machines to "see" and interpret the world through visual data. By processing digital images or
videos, computers can gain insights into scenes, objects, and actions—enabling applications
across various industries. Unlike human vision, which is inherently biological, computer vision
relies on digital data and mathematical models to achieve similar outcomes, interpreting images
using a combination of pixel analysis, pattern recognition, and statistical models.

Here are some of the key aspects of Computer Vision

Image Acquisition
Image acquisition is the process of capturing visual information from the physical world using
cameras, sensors, or scanners and converting it into a digital format.
●​ Types of Sensors: Standard RGB cameras, infrared cameras, LiDAR, and depth sensors
are commonly used for different applications, from security surveillance to autonomous
vehicles.
●​ Data Formats: Images can be 2D, like standard photographs, or 3D point clouds
generated by depth-sensing cameras.
●​ Challenges: The quality of acquired images depends on factors such as lighting, camera
resolution, and environmental conditions, all of which impact later stages of computer
vision processing.
2

Preprocessing
Preprocessing involves transforming or enhancing images to prepare them for more advanced
analysis.
●​ Techniques:
a.​ Denoising: Reduces noise in images (often using filters like Gaussian or median
filters).
b.​ Contrast Enhancement: Techniques like histogram equalization improve contrast,
making features more distinct.
c.​ Scaling and Cropping: Adjusts the image size, focusing on relevant portions.
d.​ Normalization: Scales pixel values to a consistent range (e.g., 0-1) for better
model performance.
Proper preprocessing can make feature extraction more accurate by improving image
clarity and removing artifacts.

Labeling and Annotations

Labeling and annotation are essential steps in data preparation, particularly for training
supervised machine learning models. They involve adding information to images, such as object
categories, bounding boxes, or pixel-level masks, to provide labeled data that guides algorithms
in recognizing patterns.

●​ Labeling: This is the process of assigning labels to images or objects within images,
usually at a high level (e.g., labeling an image as "cat" or "dog").
●​ Annotations: Annotations are more detailed and often involve marking regions of
interest within an image. This can include bounding boxes, polygons, or pixel-wise
masks, depending on the type of computer vision task.

Types of Annotations
1.​ Classification Labels
Are used for image classification tasks, where the goal is to classify an entire image.
Example: Labeling images as "car," "bicycle," or "pedestrian" in a dataset of street
scenes.
2.​ Bounding Boxes
They are used for object detection tasks to locate and identify objects in an image. A
bounding box is a rectangular outline drawn around the object of interest. Example:
Marking cars in traffic images to help a model detect and locate vehicles.
3.​ Semantic Segmentation
Are used to classify each pixel in an image into a category (e.g., sky, road, car). Each
pixel is labeled, providing fine-grained object delineation. Segmenting a road scene
where each pixel is assigned to classes like "road," "vehicle," or "pedestrian."
4.​ Instance Segmentation
It goes beyond semantic segmentation by distinguishing between multiple instances of
the same object class. Labels each instance of an object separately, even if they belong
3

to the same category. Example: Differentiating multiple people in a crowd, with each
person labeled as a unique instance.

Feature Extraction
Feature extraction identifies distinct points, edges, textures, or other characteristics within an
image that represent useful information.
●​ Key Techniques:
a.​ Edge Detection: Algorithms like Canny and Sobel detect boundaries between
different regions.
b.​ Corner Detection: Harris and Shi-Tomasi corner detectors find interest points and
are often used in object tracking.
c.​ Descriptors: SIFT (Scale-Invariant Feature Transform) and ORB (Oriented FAST
and Rotated BRIEF) provide unique, robust representations of image regions.
●​ Role in Machine Vision: Feature extraction simplifies complex visual data into
recognizable patterns, making it easier for algorithms to understand scenes, recognize
objects, and match images.

Interpretation
This is the final stage where extracted features are used to make sense of the visual data. It
often involves machine learning or deep learning to analyze, classify, or predict based on the
image data. Some potential tasks involve:
1.​ Classification: Identifies objects or scenes (e.g., cat vs. dog or indoor vs. outdoor).
2.​ Object Detection: Locates and labels multiple objects within an image (e.g., YOLO, SSD
models).
3.​ Segmentation: Divides images into meaningful parts or objects, like foreground and
background segmentation.
4.​ Image Captioning: Generates descriptive captions for images, typically using a
combination of convolutional neural networks (CNNs) and recurrent neural networks
(RNNs).
Interpretation is where AI models make high-level decisions about the image, often relying on
large datasets and trained neural networks to achieve high accuracy.​

Q. How humans see vs how computers see?

The differences between human vision and computer vision stem from how each processes
visual information. Human vision is biological and involves complex brain functions, whereas
computer vision is digital and relies on algorithms and mathematical models. Here’s a
comparison to highlight what humans see versus what computers see:
4

Aspects Human Computer

Perception vs. Humans perceive scenes Computers analyze images as grids


Pixels holistically, automatically grouping of pixels, each with a specific color
objects and backgrounds, noticing and intensity. Without further
depth, color, and movement, and processing, computers don’t
interpreting emotions or intentions. inherently understand objects, depth,
or emotions.

Color The human eye perceives colors Computers use numerical values for
Perception through three types of cones color, typically representing each pixel
sensitive to red, green, and blue in RGB values (e.g., (255, 0, 0) for
wavelengths. Humans are also red). Computers don’t inherently
capable of recognizing millions of adjust for lighting or context unless
colors and adjusting perception programmed to do so.
based on lighting and surrounding
context.

Depth and 3D Through binocular vision (two eyes) Most computer vision systems
Understanding and visual cues (e.g., size, process 2D images and lack inherent
perspective), humans perceive depth perception. To simulate depth,
depth and can understand spatial computers may rely on techniques
relationships in 3D. like stereo vision (using two cameras)
or additional sensors (e.g., LiDAR) to
create a 3D model.

Object The human brain uses context to Without training on specific data,
Recognition recognize objects, even with partial computers cannot recognize objects
and Contextual occlusion, low lighting, or unusual or make sense of context. Object
Awareness orientations. Humans also use recognition relies on algorithms and
previous experiences and extensive labeled data to detect
knowledge to interpret unfamiliar patterns, and even then, performance
scenes. may drop if objects are partially
obscured or in an unexpected setting.

Adaptability to Human vision can adapt to varying Computers often struggle with
Lighting and light conditions, thanks to the different lighting conditions. Models
Environment brain’s ability to compensate for trained on images with consistent
Changes shadows, brightness, and lighting may fail in varying
reflections. environments unless explicitly trained
to handle these variations or
equipped with techniques like
histogram equalization.

Semantic Humans instinctively understand Computers rely on algorithms and


Understanding relationships between objects in a pre-labeled data to understand such
scene, like knowing that a person is relationships. Even with advanced
holding an object or that a car models, they may struggle with
5

should be on the road. complex relationships or unusual


scenes without extensive training.

Applications of Computer Vision


1.​ Autonomous Vehicles:
a.​ Object Detection: Identifies and locates other vehicles, pedestrians, road signs,
and obstacles.
b.​ Lane Detection: Tracks road lanes to ensure safe lane-keeping and assists in
navigation.
c.​ Depth Estimation: Using stereo vision or LiDAR to estimate distances, ensuring
safe braking and obstacle avoidance.
2.​ Facial Recognition in Security:
a.​ Face Detection: Identifies faces in images or videos, often used for surveillance
or entry control.
b.​ Identity Verification: Compares detected faces to a database for identity
matching.
c.​ Expression Analysis: Analyzes facial expressions for sentiment analysis or
behavioral studies.
3.​ Medical Imaging:
a.​ Disease Detection: Identifies abnormal growths or irregularities in X-rays, MRIs,
and CT scans, assisting in early diagnosis.
b.​ Image Segmentation: Differentiates between organs, tissues, or abnormalities in
medical scans.
c.​ Tumor Localization: Helps in precisely locating tumors or other areas of concern
in images.
4.​ Augmented Reality and Robotics:
a.​ Object Recognition: Identifies objects in a robot’s field of view, crucial for
interaction and navigation.
b.​ Scene Understanding: Analyzes surroundings to adapt interactions or make
decisions.
c.​ 3D Reconstruction: Builds 3D models from multiple images, used in applications
ranging from entertainment to surgical planning.

Key Technical Challenges

Computer vision still faces numerous challenges due to the complexity of real-world
environments:

1.​ Lighting Variability


6

Variations in lighting can drastically affect an image's appearance, making feature


detection and object recognition more challenging.
2.​ Object Occlusion
When objects overlap or partially block each other, it’s harder for algorithms to recognize
and interpret individual items.
3.​ Viewpoint Variability
Objects may appear differently from various angles, requiring sophisticated models to
generalize well.
4.​ Generalization and Transfer Learning
Models trained on specific datasets may not perform well in new environments,
necessitating robust learning techniques that generalize across different contexts.

Exercises

1.​ What is computer vision, and how does it differ from human vision?
2.​ Describe the main stages of a computer vision pipeline and the purpose of each stage.
3.​ What are some real-world applications of computer vision, and how do they benefit from
this technology?
4.​ Explain the difference between object detection, image segmentation, and image
classification in computer vision.
5.​ What challenges do computer vision systems face in real-world environments?
1

CSE463
Computer Vision: Fundamentals and Applications
Lecture 2
Image Formation and Filters

Geometry of Image Formation


The geometry of image formation studies the process by which 3D objects in the world are
captured and represented on a 2D image plane. This field is foundational in understanding how
cameras perceive depth, scale, and spatial relationships in a scene.

Perspective Projection

In perspective projection, objects appear smaller as they move further away from the camera,
and lines that are parallel in the 3D world converge in the 2D image, typically towards a
“vanishing point.” This principle explains why nearby objects appear large while distant objects
appear small and are crucial for a realistic representation of depth in images.
2

Camera Image Formation


Camera image formation refers to the process of capturing a 3D scene from the world and
projecting it onto a 2D image plane, which is a critical aspect of understanding how images are
formed in computer vision and photogrammetry. This process involves several physical
principles and geometrical concepts that ensure a 3D scene is represented correctly on a 2D
plane.

Pin-Hole Camera Model

The simplest model of image formation is the pinhole camera model, which provides a
conceptual framework for understanding how light from a scene is captured through a small
aperture (the "pinhole") and projected onto an image plane (the camera sensor).

Pinhole Camera Model Components:

●​ Scene (3D world): A real-world scene consisting of objects in three-dimensional space.


●​ Camera: The device that captures the scene, consisting of a lens and an image plane.
●​ Pinhole: A small aperture through which light passes.
●​ Image plane: A 2D surface (typically a digital sensor or film) where the scene is
projected.

How It Works:

1.​ Light rays from the 3D objects in the scene pass through the pinhole and hit the image
plane.
2.​ Each light ray corresponds to a specific point in the scene and is projected onto a point
on the image plane.
3.​ The resulting image on the image plane is inverted, meaning that objects higher in the
scene appear lower on the image plane, and objects farther away appear closer.
4.​ The size of the image depends on the distance between the scene, the pinhole, and the
image plane.
3

The pinhole camera model is a simple approximation, but it provides the basis for more
sophisticated camera models that include lens effects like distortion and focus.

Camera Calibration Parameters

For accurate image formation and interpretation, a camera's internal and external properties
must be understood. These properties are captured in the intrinsic and extrinsic parameters of
the camera.

Intrinsic Parameters (Camera Intrinsics):

These are the internal properties of the camera that affect how it captures the scene.

●​ Focal Length: The distance between the camera's lens and the image plane. It
determines the magnification and the field of view (FOV).
●​ Principal Point: The point on the image plane where the optical axis intersects (usually
near the center of the image).
●​ Pixel Aspect Ratio: The ratio of the width to the height of a pixel in the camera sensor.
This parameter is used to account for non-square pixels.
●​ Skew: A measure of non-orthogonality of the image axes (often assumed to be zero in
most cameras).

These parameters are typically represented in a camera matrix KKK, which is used to transform
3D coordinates into 2D image coordinates.

Extrinsic Parameters (Camera Extrinsic):

These parameters describe the position and orientation of the camera in the world.

●​ Rotation Matrix (R): A 3x3 matrix that describes the camera’s orientation in 3D space.
●​ Translation Vector (T): A 3x1 vector that describes the camera’s position in 3D space
relative to the world coordinate system.
4

Extrinsic parameters define how the camera is positioned relative to the world and are critical for
reconstructing 3D scenes from images.

Projection Models

The projection from 3D space onto 2D space can be modeled using different projection
techniques, such as perspective projection and orthographic projection.

Perspective Projection:

●​ Most common in real-world cameras and is responsible for the phenomenon where
objects appear smaller as they get farther away from the camera (i.e., the vanishing
point).
●​ In perspective projection, light rays converge towards a single point (the camera's focal
point or pinhole).
●​ The transformation from 3D world coordinates (X, Y, Z) to 2D image coordinates (x,y) is
a nonlinear operation and involves both intrinsic and extrinsic parameters.

The mathematical formulation for perspective projection is as follows:


5

Where:

●​ [R∣T]is the extrinsic matrix (rotation and translation),


●​ Kis the intrinsic camera matrix,
●​ (X, Y, Z)are the 3D coordinates of a point in the world,
●​ (x,y)are the corresponding 2D image coordinates.

Orthographic Projection:

●​ Assumes parallel projection where objects appear the same size regardless of their
distance from the camera.
●​ It is often used for technical drawings or engineering applications but not for real-world
photography, as it doesn’t capture depth perception.

In orthographic projection, the transformation from 3D world coordinates to 2D coordinates is


linear, and the depth dimension is ignored.

Image Formation:
○​ Image formation models describe the physics of how images are formed on the
camera sensor.
○​ Light and Aperture: Light enters the camera through the aperture, which controls the
amount of light hitting the sensor. The aperture and lens focus the incoming light,
creating an image on the sensor.
○​ Focal Length and Depth of Field: The focal length determines the magnification of the
image, while the depth of field affects the range of distances at which objects appear
sharply in focus. Adjusting these parameters changes the scene's perspective and
focus.

Image Filtering (2D Convolution)


Image filtering is a technique applied to images to enhance or preprocess them for analysis by
modifying pixel values in systematic ways. Filters can remove noise, enhance edges, or
sharpen an image, depending on the filter type and parameters. Filters are also known as
kernels and can have a shape of 1x1, 3x3, 5x5, 7x7, and so on.
6

Types of Filters:

Linear Filters:

●​ Gaussian Filter: A smoothing filter used to reduce noise by averaging pixel values in a
local region, creating a blurring effect. It’s widely used as a preprocessing step in
computer vision tasks.
●​ Box filtering: this is an average-of-surrounding-pixel kind of image filtering that causes
blurring. A 3x3 box filter is given as follows:

●​ Sobel Filter: An edge-detection filter that calculates the gradient of image intensity,
highlighting regions with rapid intensity change, which correspond to edges.

Non-Linear Filters:

●​ Median Filter: A noise-reduction filter that replaces each pixel with the median value of
neighboring pixels. It is effective at removing "salt-and-pepper" noise without blurring
edges.
●​ Bilateral Filter: This filter smooths the image while preserving edges, by combining both
spatial and intensity information, making it useful in preserving details in high-frequency
areas.

Sliding of a Filter/Kernal Example

Eg 1 for 3x3 Kernal with explanation:


https://www.songho.ca/dsp/convolution/convolution2d_example.html​

Eg 2: https://www.youtube.com/watch?v=yb2tPt0QVPY

Eg 3:
7

Applications of Filters:

Filters are widely used in image processing tasks such as:

1.​ noise reduction (Gaussian, Median filters)


2.​ edge detection (Sobel filter)
3.​ image sharpening or smoothing

Filtering helps prepare images for higher-level tasks by enhancing specific features or reducing
irrelevant data.

Exercises
1.​ What is the pinhole camera model, and how does it explain the projection of a 3D world
onto a 2D image plane?
2.​ Describe the difference between intrinsic and extrinsic camera parameters. Why are
both necessary for accurate image projection?
3.​ What is perspective projection, and how does it affect the appearance of objects as they
move farther from the camera?
4.​ Compare and contrast orthographic projection with perspective projection. In what
scenarios might each be preferred?
5.​ In terms of image formation, explain how light enters through the aperture and is focused
onto the image sensor. What role does the lens play in this process?
6.​ What is the purpose of using a Gaussian filter in image processing? How does it work to
reduce noise in an image?
7.​ Explain the difference between linear and non-linear filters, and provide examples of
each. How do these filters affect images?
8.​ Write down the matrix representation for a 3x3 box filter. And apply it over the image
given below-
8

a.​ b.

9.

10. The image on the left shows a noisy image. What filter can be used to revert it to its original
form?
9

Image Filtering Mathematical Examples-


· Output Size Calculation

The output size H_out, W_out for a convolution is given by:

Input Size (H_in,W_in): The height or width of the original input image (before filtering).

Filter Size (K_h, K_w​): The height and width of the filter (kernel) being applied to the image.

Stride (S): The number of pixels the filter moves (or "slides") horizontally or vertically in each
step.

Assuming an Image size of 10×10 applying a Filter size: 3×3 with a stride of 2-
10

Since we can't have fractional pixels, we need to floor the value: Output Size = 4 x 4.

Gaussian Filter-

●​ The Gaussian kernel is defined mathematically as:

➔​σ is the standard deviation controlling the extent of smoothing.


➔​For example, with σ=1, a 3x3 kernel might look like:

●​ Imagine you have a image of size 5x5 and a filter of size 3x3 with a stride 1,

Sample Image Matrix,

Gaussian Kernel,
11

Output Image Size-

Simplified Formula for squared image = ((N - F + 2P)/S) + 1 where N=5 (input), F=3 (filter),
P=0 (padding), S=1 (stride)

So, Output size = ((5-3+0)/1) + 1 = 3x3

Steps for Each Position:

We slide the kernel across the image, calculate the weighted sum for each 3×3 patch, and
normalize the result by dividing by 16.

(a) For Position (0, 0):

●​ Extract the sub-image:

●​ Perform element-wise multiplication:

●​ Calculate the result (sum and normalize) and put it in the (0, 0) position:

(b) For position (0, 1): Slide the filter to cover the sub-image (stride = 1),

●​ Extract the sub-image:

●​ Repeat the weighted sum and normalization steps:

=​ ​ * (2 * 1 + 3 * 2 + 4 * 1 + 7 * 2 + 8 * 4 + 9 * 2 + 12 * 1 + 13 * 2 + 14 * 1) = 8
12

Continue this for all positions.

Put it in the filtered image matrix-

7 8 ?

? ? ?

? ? ?
1

CSE463
Computer Vision: Fundamentals and Applications
Lecture 3
Light and Binocular Vision

Biological Vision
Biological vision refers to the mechanisms through which living organisms, particularly humans,
perceive and interpret visual information from their surroundings. The study of biological vision
provides insights into how the human visual system functions and often inspires advancements
in computer vision and image processing.

Key Components of the Human Visual System

1.​ Eye Structure:


○​ Cornea: The transparent outer layer that focuses light onto the retina.
○​ Lens: Fine-tunes the focus of light onto the retina for near and far objects
(accommodation).
○​ Retina: A light-sensitive layer at the back of the eye that contains
photoreceptors.
2

○​ Photoreceptors:
■​ Rods: Responsible for vision in low-light (scotopic) conditions; sensitive
to intensity but not color.
■​ Cones: Active in bright-light (photopic) conditions; responsible for color
vision (red, green, and blue cones).
○​ Optic Nerve: Transmits visual information from the retina to the brain.
2.​ Visual Processing in the Brain:
○​ Primary Visual Cortex (V1): Processes basic visual features like edges,
orientation, and motion.
○​ Higher Visual Areas: Combine features to recognize shapes, objects, and
scenes.

The Visual Pathway

1.​ Light Detection:


○​ Light enters through the pupil, is focused by the cornea and lens, and forms an
image on the retina.
○​ Photoreceptors in the retina convert light into electrical signals.
2.​ Signal Transmission:
○​ The signals pass through retinal ganglion cells and are transmitted to the brain
via the optic nerve.
3.​ Neural Processing:
○​ The brain interprets these signals to detect patterns, depth, motion, and colors.
3

Visible Light

Visible light is a part of the electromagnetic spectrum that the human eye can detect, typically
ranging from wavelengths of approximately 400 to 700 nanometers. Each wavelength
corresponds to a specific color that humans perceive, from violet (shorter wavelengths) to red
(longer wavelengths). Cameras and imaging devices capture this range of light to create color
images, which are then processed and represented in different color spaces.

Color Image

A color image represents visible light using combinations of primary colors. Most digital color
images are stored in three separate channels corresponding to red, green, and blue light
intensities. By combining these three channels, we can produce a wide range of colors that
closely match human color perception.

Color Spaces

A color space is a way of representing colors in a structured format, allowing images to be


processed and analyzed in various applications like object detection, image compression, and
enhancement. Each color space has unique properties and is suitable for different tasks. Some
of the most common color spaces include RGB, HSV, YCbCr, and L*a*b*.

RGB Color Space


4

RGB is the most common color space, representing colors by their red, green, and blue (RGB)
components. Each color is defined by a combination of these three values, ranging from 0 to 1
(normalized) or 0 to 255 in 8-bit representation.

●​ Primary Colors:
○​ Red (1,0,0): Maximum red, no green, no blue.
○​ Green (0,1,0): Maximum green, no red, no blue.
○​ Blue (0,0,1): Maximum blue, no red, no green.

RGB is often visualized as a color cube where each axis corresponds to one of the RGB
values. The color at any point inside the cube is a mix of these three colors.

Drawbacks of RGB:

●​ Channel Correlation: The RGB channels are highly correlated, meaning changes in
one channel often affect perceived brightness and color, which can complicate
color-based tasks like segmentation.
●​ Non-perceptual: RGB is not aligned with human color perception, making it difficult to
manipulate colors in a way that corresponds to how we intuitively see and perceive
them.

Despite these drawbacks, RGB remains the default color space for most imaging devices and
digital displays due to its straightforward representation.

HSV Color Space


5

HSV (Hue, Saturation, Value) is an intuitive color space that aligns more closely with how
humans perceive colors. It is useful for color-based image processing and editing tasks.

●​ Hue (H): Represents the color type, ranging from 0 to 1. Hue is an angular value, often
visualized on a color wheel.
●​ Saturation (S): Represents the intensity of the color, with 0 being grayscale and 1 being
fully saturated color.
●​ Value (V): Represents the brightness, with 0 being black and 1 being full brightness.

Primary HSV Colors:

●​ H (S=1, V=1): Represents the pure color tone at maximum saturation and brightness.
●​ S (H=1, V=1): Maximum color intensity.
●​ V (H=1, S=0): Represents grayscale brightness.

The HSV color space is ideal for color-based segmentation, filtering, and detection, as colors
can be manipulated independently of brightness and saturation.

YCbCr Color Space


6

YCbCr is widely used in image and video compression due to its efficient representation of color
and brightness. It separates luminance (Y) from chrominance (Cb and Cr) components.

●​ Y: Represents the luma or brightness component.


●​ Cb: Blue-difference chroma component.
●​ Cr: Red-difference chroma component.

In digital imaging, the Y channel handles most of the intensity information, while Cb and Cr
channels represent color differences. This separation allows for efficient compression by
reducing the resolution or bit-depth of the chrominance channels without significantly affecting
perceived image quality.

Primary YCbCr Values:

●​ Y (Cb=0.5, Cr=0.5): Luma component at mid-level chrominance.


●​ Cb (Y=0.5, Cr=0.5): Blue-difference component.
●​ Cr (Y=0.5, Cb=0.5): Red-difference component.

YCbCr is widely used in television and digital video standards due to its fast computation and
compatibility with compression algorithms.

L*a*b* Color Space


7

The L*a*b* color space (also known as CIELAB) is designed to be perceptually uniform. It is
based on human vision, with L* representing lightness and a*, b* representing color-opponent
dimensions (green–red and blue-yellow).

●​ L*: Lightness, ranging from 0 (black) to 100 (white).


●​ a*: Green–red component, where positive values indicate red and negative values
indicate green.
●​ b*: Blue–yellow component, where positive values indicate yellow and negative values
indicate blue.

L*a*b* color space is used in applications where color accuracy and perceptual uniformity are
essential, such as color correction, editing, and comparison.

Color Space Properties Common Uses

RGB Device-friendly, Display, general image


non-perceptual storage

HSV Intuitive, perceptual attributes Color detection,


segmentation

YCbCr Efficient for compression Video, TV broadcasting

L*a*b*​ Perceptually uniform, Color correction, analysis


accurate
8

Exercises
1.​ What is visible light, and how does it relate to the concept of color in digital imaging?
2.​ Describe the RGB color space. Why is it the default color space in most digital devices,
and what are its drawbacks?
3.​ Explain how the HSV color space differs from RGB. Why is HSV considered more
intuitive for certain color-based applications?
4.​ What are the primary components of the YCbCr color space, and why is it commonly
used in video compression?
5.​ Describe the purpose of each component in the L*a*b* color space and explain why it’s
useful for tasks requiring perceptual uniformity.
6.​ How does separating luma (Y) from chrominance (Cb, Cr) in YCbCr enable more
efficient compression?
7.​ What kind of tasks are best suited for the HSV color space, and why? Provide an
example of an application where HSV would be more useful than RGB.
8.​ What does it mean for a color space to be 'perceptually uniform,' and which color space
is designed with this property in mind?
9.​ If you were processing an image to adjust its brightness independently of color, which
color space might offer the most straightforward approach?
10.​Explain how the different axes in the RGB color cube represent color combinations.
What color would you get if R=1, G=0, and B=0?
9
10
11

Problem 1: Calculate the number of bits to represent a 2 minutes of video comprises of


image frames with a resolution of 1920*1080 and a frame rate of 30 fps:
12

i) Grey-scale video

ii) color video

iii) black & white video


1

CSE463
Computer Vision: Fundamentals and Applications
Lecture 4
Point Operators and Filtering

Point Operators

Point operators are basic image processing transformations that apply adjustments to each
individual pixel independently, without considering neighboring pixels. They modify pixel values
based on a specific formula or parameter, such as brightness or contrast, to adjust the
appearance of an image.

Pixel Transforms

Pixel transforms are the building blocks of point operators, allowing the adjustment of brightness
and contrast. Two commonly used point processes are multiplication and addition with a
constant, a > 0 and b are often called the gain and bias parameters.
2

1.​ Brightness and Contrast Adjustment:


○​ Multiplicative Transformation: Adjusts the contrast by scaling pixel values. The
contrast gain factor controls how much contrast is increased or decreased.

○​ Additive Transformation: Adjusts brightness by adding a bias value b to each


pixel, shifting its intensity uniformly across the image.

2.​ Spatially Varying Gain and Bias:


○​ For effects like vignetting, the gain and bias values can vary across the image
rather than being constant. This allows for more localized adjustments, often
used for creative or artistic effects.

Linear Operations and Image Blending

Linear operators use the superposition principle, meaning the transformation applied to a
combination of inputs equals the sum of the transformations applied to individual inputs. A
simple example is adjusting the brightness using a multiplicative gain.

●​ Linear Blend Operator: This is often used in image blending and cross-dissolving,
where two images are blended based on an alpha value, α, that varies from 0 to 1. For
instance:

○​ If α = 0, only the background image is visible.


○​ If α = 1, only the foreground image is visible.
○​ When 0 < α < 1, the images blend proportionally to α.
3

Non-Linear Transformations: Gamma Correction

Gamma correction is a critical step in image processing and display technology. It addresses the
nonlinear relationship between the intensity values of image pixels and their perceived
brightness. The goal is to ensure that images appear natural and consistent across different
devices and lighting conditions.

1. Why is Gamma Correction Needed?

Human vision does not perceive brightness linearly. For example, doubling the intensity of light
does not double the perceived brightness. Similarly:

●​ Most display devices (like monitors and TVs) do not render brightness linearly due to
hardware constraints.
●​ Camera sensors capture light linearly, which does not align with how humans perceive
brightness.

Gamma correction adjusts the pixel intensity values to align with the nonlinear perception of
brightness by the human eye and the nonlinear response of display devices.

2. Gamma Function

The transformation is mathematically represented as:

For practical purposes:

●​ Gamma > 1: Makes darker regions brighter.


●​ Gamma < 1: Makes brighter regions darker.
●​ Gamma ≈ 2.2: Common standard used in digital displays to compensate for their default
nonlinear response.

Color Transforms

Color transforms modify an image's color properties, adjusting individual channels to alter
brightness, and balance, or to convert between color spaces.
4

1.​ Understanding Color Channels: Color images consist of correlated signals (RGB
channels) due to the interaction of light, sensors, and human perception.
2.​ Brightness Adjustment:
○​ Adding a constant to each RGB channel uniformly increases brightness but may
alter the color balance.
○​ Chromaticity Coordinates: Adjustments using chromaticity coordinates help
maintain perceptual color qualities without affecting hue or saturation.
3.​ Color Balancing: Corrects lighting discrepancies (e.g., yellowish hue from incandescent
lighting) by scaling each channel separately or applying a 3×3 color twist matrix for
complex color transformations.
4.​ Color Spaces:
○​ RGB: Common in digital displays and image files.
○​ YCbCr: Often used in video compression for efficient storage.
○​ HSV: An intuitive color space for tasks involving hue manipulation.
○​ L*a*b*: Designed for color accuracy and perceptual uniformity in color-critical
applications.

Image Matting and Compositing

Image matting and compositing are techniques to extract and seamlessly blend objects from
one image into another.

1.​ Alpha Matting:


○​ Uses an alpha matte, a grayscale image, where each pixel indicates
transparency level.
○​ Alpha Channel (α): Controls transparency, with
○​ α = 1: Fully opaque (inside object)
○​ α = 0: Fully transparent (outside object)
○​ 0 < α < 1: Smooth transitions on object boundaries to avoid visual artifacts (e.g.,
"jaggies"). Ie. for smooth transitions on edges.

2.​ Compositing Formula – The Over Operator:

○​ Introduced by Porter and Duff (1984), the Over operator defines a way to blend
a foreground object over a background using alpha values.
○​ Formula: Combines RGB and alpha channels to blend images while avoiding
harsh edges, ensuring smooth transitions.
5

Image Filtering

(*Read section 3.2 pg 119-122 of Richard Szeliski)

Image filtering adjusts pixel values based on neighboring pixels, making it essential for tasks
like noise reduction, blurring, sharpening, and edge detection.

1.​ Linear Filtering:

○​ Linear filters apply a weighted sum of neighborhood pixel values to compute


each output pixel.
○​ Kernel or Mask: The weights for neighboring pixels, also called filter coefficients,
determine the filter's effect.

2.​ Correlation and Convolution:


6

○​ Correlation: Applies the kernel directly across the image.


○​ Convolution: Similar to correlation, but with reversed kernel offsets, making it
shift-invariant.
○​ Both correlation and convolution can be represented in matrix form, where each
image is flattened into a vector, enabling computational efficiency.

Exercises (from the book)

Answer:

Yes, it is necessary to undo the gamma in the color values to achieve accurate exposure
matching along the seam when stitching images taken with different exposures.

In image stitching, particularly when dealing with two images taken with different exposures, it's
crucial to ensure that the RGB values align along the seam for a smooth transition. This process
often involves adjusting the brightness and color balance between the images. Regarding
gamma correction, it depends on how the images were captured and stored:

1.​ Gamma Correction: Most images are encoded with gamma correction to adjust for the
nonlinear response of display devices and human vision. This means that the pixel
values in the image are stored in a gamma-corrected space, which is typically darker in
the shadows and lighter in the highlights than a linear color space.
2.​ Undoing Gamma: To align the two images for stitching, you often need to perform
operations like brightness adjustment, blending, or histogram matching. These
operations should ideally be done in a linear color space, where the RGB values
represent the actual intensities of light.​
If the images you're stitching have gamma correction applied, it’s advisable to undo the
gamma correction before performing the adjustments. This is because adjusting RGB
values in a gamma-corrected color space might lead to inaccurate results, especially
when manipulating pixel intensities to match exposure levels. By working in the linear
space, you ensure that the color and brightness adjustments are applied correctly, and
then you can apply the gamma correction back to the result after the adjustment.
7

Answers:

1. Adjusting the RGB Values to Make the White Color Neutral

To adjust the RGB values in an image so that a sample "white color" (Rw, Gw, Bw) appears
neutral, you need to scale the RGB channels in a way that the white point becomes a neutral
white (where all channels are equal, typically RGB = 1 or RGB = (1, 1, 1) in normalized space).

The general approach is to compute a scaling factor for each color channel based on the ratio of
the target white color to the observed white color in the image. This scaling should correct for
any color casts without significantly affecting the exposure.

2. Simple Scaling vs 3x3 Color Twist Matrix

In this case, the transformation involves simple per-channel scaling of the RGB values. Since
you're adjusting the RGB channels independently (multiplying each channel by a constant
factor), a full 3x3 color twist matrix is not necessary for this operation. A full 3x3 matrix would be
used for more complex color transformations that mix the RGB channels, but here, you're
adjusting each channel independently. So, the correct transformation in this case is a simple
per-channel scaling, which is a diagonal scaling operation.
8

Exercise (not from book)


1.​ What is gamma correction, and why is it important in digital image processing?
2.​ Explain why gamma correction is necessary to align digital images with human
perception of brightness.
3.​ Describe how the value of gamma affects the brightness and contrast in an image.
4.​ Given an image pixel with an original intensity value of 0.5, calculate the
gamma-corrected intensity value for gamma = 2.2 and gamma = 0.5.
5.​ If the gamma value is set to greater than 1, what impact does it have on the darker
regions of an image? Why?
6.​ Describe the process of forward and inverse gamma correction. Why are both types of
correction necessary?
7.​ Why do digital displays use a gamma value of approximately 2.2 as the standard?
8.​ How does gamma correction ensure consistency of brightness across different devices?
9.​ What are the challenges associated with gamma correction when displaying images on
multiple devices with varying gamma responses?
10.​How does gamma correction help in reducing data size in video compression?
11.​Explain how gamma correction improves the clarity and realism of images, especially in
areas with very high or low brightness levels.
12.​What are some real-world applications where gamma correction is crucial, and why?
13.​How would an image look if it is displayed without gamma correction on a digital screen?
14.​In what way does gamma correction bridge the gap between linear intensity captured by
cameras and nonlinear human visual perception?

You might also like