SlideShare a Scribd company logo
Audio Enhancment: A Computer Vision
Approach
Ramin Anushiravani
Electrical And Computer Engineering Department
University Of Illinois at Urbana-Champaign
Urbana, IL
Abstract
Many audio enhancment applications can be simplified with some
user interface. The purpose of this project is to remove a desired
noise that is mimicked by a user from an arbitrary recording using
object detection techniques.
1 Introduction
Imagine having a recording of a lecture you attended, and someone’s cellphone rang
in the middle of your recording. Or say you are recording a live concert, and there
is too much screaming in the background. There are no easy automated ways of
recognizing these unwanted noises as an actual noise. Your best shot at solving
this problem is to remove all time-frequency bins corresponding to that unwanted
noise assumming you know some basic Signal Processing. You might also be able
to come up with some probablistic models to decompose your sound to a mixtures
of sounds where one of the mixtures would hopefully correspond to your unwanted
noise [1]. An alternative solution is to employ source separation techniques to sep-
arate the noise from the desired signal [2]. This problem can be greatly simplified
( at least mathematically) by some user interface. If we know approximately how
the unwanted noise sounds like then we might be able to search the signal for the
most likely match in the noisy recording.
2 Motivation
In the field of Audio Processing, sound is usually visualized using Spectrograms.
A time domain representation is not very informative of what the context of the
signal shows, since it only shows the singal amplitude versus time. Spectrograms,
however, show a time-frequency representation of a sound and they can be derived
using Short Time Fourier Transform (STFT). STFT is basically DFT of the signal
at overlapping frame that are aligned next to each other. The intensity values are
then depicted using a colormap, which is called the spectrogram of that sound [3].
There are numorous alternatives to STFTs for visualizing sounds each optimized
for a certain application e.g. LP Spectrogram [4], and Cochleagram[5] among many
others that can be found in many Spectral Analysis textbooks. An example of these
representations is given in Figure 1.
Figure 1: (Left:Top-Spectrogram,Bottom-Time Domain)-(Middle-LP Spectrogram)-(Right
Chochleagram)
Being inspired by these different methods of visualization, this project is aiming
to remove a specific noise. The user is asked to mimick a noise from the noisy
recording. Computer vision approaches is then applied to detect the noise in the
spectrogram. More specifically, we are going to look at sound as if it was an
image and apply object detection techniques to detect a noise object in the sound
image. The image is then resysnthesize, converted to sound domain, by converting
the picture back to spectrogram. The spectrogram is then converted back to time
domain using overlap-add and inverse STFT [6]. With this introduction, now we
can look at the removal of unwanted noise as if we are trying to detect cats in an
image as shown in Figure 2.
3 Preprocessing
Since we want to treat sounds as an image and then synthesize it back to sound, we
would need to do some preprocessing to make sure there are easily visible in time
and frequency.
2
Figure 2: Cat Detection vs Noise Detection
3.1 From Sound Samples to Image Pixels
When visualizing a sound using spectrogram; people try different colormaps to
make it easier to visualize the sound. This could be a good colormap or different
intensity levels, e.g. log values. For example, the spectrogram on the left of Figure 3
is hiding lots of the time-frequency bins and is not a good candidate for this project.
A better visualization is shown next to it.
Figure 3: A bad chioce colormap is shown on the left. The better representation is shown on the
right.
In order to save the STFT of a sound using our own colormap on Matlab, we
would need to save the figures, which would then look like the one shown in Figure
4. In order to extract only the spectrogram of the sound (and not the title and white
areas around it), the following can be done.
indy = argmaxi(
w
i=1
image /w) > (
w
i=1
image /αw) (1)
Where indy is a 2 elements vector with the corresponding start and end y-
position of the spectrogram in Figure 4. w is the width of the image and the (’)
operator corresponds to taking the gradient of the image with respect to the x and
y positions. α is a threshold factor bigger than one for determining the major peaks
in the mean gradient. Basically, we take the derivative of the image with respect
to the x and y position and then take the avarage of that over the width to get
the y position. The same procedure can be done over the transpose of the image
and a sum over the height of the image to extract the start and end x position. I
3
Figure 4: Spectrogram saved on Matlab
chose a window size of 1024 samples using a Hanning window, with 25% overlap
to construct the STFTs and used overlap-add for taking the inverse STFT back to
time domain (sound domain). I chose the hot colormap with the intensity values
to the power of 0.35 as my colormap.
3.2 Object Extraction
The noise mimicked by the user will have some background noise and many time-
frequency bins that do not correspond to the mimicked noise as shown in Figure
5. In order to supress the background noise, a very strong Spectral Subtraction
algorithm is applied [7]. Even though, this does not necessarily sounds good, since
the mimicked noise components are also going to be removed. However, it makes
sure that the noise object is as distinguished as it can be from the non-relavant
components in the noisy recording. In order to remove the unrelated frequency
components, a very strong threshold is defined that would only keep the time-
frequency bins corresponding to the mimicked noise components.
Pseudo code for removing irrelavent frequencies
noise(i, j)object = noise(i, j)object if noise(i, j)object > threshold
0 else
Figure 5: Spectrogram saved on Matlab
4 Object Recognition
Common object recognition algorithms follow these steps,
4
Pseudo code for object recognition
1- Scan the image with a fixed window at different scales.
2- Extract Histogram of Gradients (HOG) features from each patch.
3- Score each patch by comparing it to the object HOG features.
4- Perform Non-Maximum Suppression.
In this project, we have one important advantage with respect to object detection
in an image. We have the advantage of having time as the x-axis. We can assume
that the user’s mimicked noise is as long as the actual noise in the noisy recording
(it’s okay that the noise is repeating at different time postitions). Since we have
the advantage of having harmonics, as long as the user approximately provides us
with the fundamental frequency in the noise, then we are also provided with the
height of the scanning window. That is, after detecting the object, we can go ahead
and remove the harmonics as well. As a result, we do not have to scan the image
at different scales.
4.1 Scanning the Image
Scanning the image with overlaps can be a very time consuming task given the
implementation and can also affect the accuracy of the algorithm greatly. I have
tried multiple ways for scanning the image spectrogram, listed below.
1- At each position in the image, extract four windows with 50% overlap, two on top
andtwoonthebottom.
2- Extract windows in a row from an image with 12.5% overlap throughout the
whole image. Figure 6 shows the result from each of these methods for the case of
having two noise objects. Either of these procedures could be a very time consum-
Figure 6: The left figure shows scanning the image with 50% overlap and the one on the right shows
scanning of the image with 12.5% overlap, both for the case of having two similiar noise objects at
different times and frequency positions (different x and y center pixel values)
ming task based on how long the signal is. One way to speed up the process is by
cascading classifiers, a somewhat similiar concept to the one used in Viola-Jones
face detection algorithm. This can be done by first convolving the image with the
noise object, which can also be sped up using 2D Fourier Transform. This would
tell us approximately where the desired noise is located at in the 2D plane. This is
basically a weak classifier to reject lots of the patches that do not include the desired
noise.
There is another way of evaluating the object recognition algorithm that is
vectrozied and would not involve scanning the image. This is discussed at the end
of Classification subsection.
5
Figure 7: Response of the image with to the noise object. The subplot in the bottom is the thresholded
respone of the image to the noise object.
4.2 HOG Features
HOG features are descriptors that capture the edge orientation of an image in a
defined sized cell, and it is invariant to the scale transform. HOG features are
mainly known for object detection applications in computer vision. Since they
require very careful tuning and normalization, I used an outside library, VLFeat [8],
to compute HOG features. In this project I used a cell size of 8 by 8 to extract the HOG
features of a gray colored image (instead of RGB for easier coding implementation).
After extracting the HOG features from the noise object and all the patches extracted
Figure 8: HOG features of the users’ mimicked noise
from the image, we need to classify every patch as a noise object and non-noise
patch that belongs to clean signal.
4.3 Classification
In order to classify each patch of the image, I used two different methods.
1- K-Nearest Neighbor. Vectorize all the HOG features of the image into one big
matrix. The error function used in K-NN is a Euclidean distance measure.
error = (vec(noisehog)2 − vec(imagehog)2) (2)
This error function made a lot of misclassifications and so I purpose the following
error function to get better accuracies.
2- The modified error function is as follows,
error = |noisehog − imagehog|l2/|noisehog|l2 (3)
The latter error function seems to give a much better accuracy in localizing the noise
object. The resulting objects for the case of 50% overlap is shown in Figure 9. The
score on the top shows the value of the latter error function. The resulting object
for the 12.5% overlap is shown on the right of Figure 9. As you can see, one object
might be detected multiple times, so in order to detect mutiple objects we can do
6
Figure 9: 50% overlap windowing on the left, 12.5% overlap in a row on the right
Figure 10: NSM input on the left. NSM output form scanning method 1 in the middle.Scanning
method 2 on the right.
the following: Detecting Multiple Objects
1- Detect the first most likely object from inside the image (noisy recording STFT in
pixel domain)
2- Remove that object from the image (a blank space, to assure it is not detected in
the next round).
3- Detect the second object and iterate. Since we have the advantages of the user
interface, we know how to many objects to look for and as a result how many times
to iterate.
4.4 Non Maximum Suppression (NSM)
The purpose of NMS is to see if the objects found in the image overlap or not. If
they do, then we pick the one with the highest score, and if they do not overlap as
much, we pick both. The figure below shows the amount of overlap between each
patch and the resulting object. The plots on diagonals of Figure 9 are the overlap
of the patch with itself. I then extracted the object with the highest overlap (as
they already have the highest score). This results were improved with the 12.5%
overlap. I added another rule in addition to ones mentioned in the top. I decreased
the size of the window and then check for the amount of overlap for every other
patch in the image. The resulting object and its mask is shown in Figure 10.
4.5 Vectorized Method for Object Recognition
Another error function is the sum of squared error.
error =
#patches
i=1
(vec(noisehog) − vec(imagehog(i)))2
= (4)
7
Figure 11: Left: removing the whole object. Middle subtracting the template. Right: ideal case of
subtraction
#patches
i=1
vec(noisehog)2
+ vec(imagehog(i))2
− 2 ∗ vec(imagehog(i))vec(noisehog) (5)
The first term in equation 5 can be evaluated by squaring the noise object and
summing over all the elements. The last terms in equation 5 can be calculated by
convolving the noise object with the image as shown in Figure 7. The middle term
can be calculated using an integral image, which gives us the summation of each
patch.
5 Results
When resynthesizing the sound, we can either multiply the mask, as shown in
Figure 12 with the spectrogram of the sound and get rid of the whole object, or
we can only subtract the noise template within the mask from the signal. Note
that, due to Spectral Subtraction, there might be some residue left when subtracting
only the noise template. The ideal case is shown on the right side of Figure 11.
Ideally, we would hope to subtract all of noise pixels (time-frequency bins) without
subtracting any of clean signal pixels .
In conclusion, the object detection method seems to be effective in removing the
noise from the noisy recording for the case of having one noise object and two
noise objects. For the case of three noise objects, there are some misdetections,
which might be resolved with a more robust NMS alogorithm. However, if the
clean signal and the noise are similiar in their time-frequency representation, it is
likely that the algorithm will fail in detecting the noise object e.g. suppressing an
intereference speech from a lecture recording.
For future work, I suggest looking into predicting the most likely pixels inside the
blank space in the noisy reording (where the noise was located before removing
it) i.e. predicting the missing time-frequency bins in the signal or completing an
incomplete image. In addition, when localizing a deformed object (when the user
cannot mimick the noise accurately), it would be necessary to look for techniques
that are robust to deformity in the example object given by the user. This project
can also be applied to real-time sound classification projects e.g. Is there screaming
in this sound? Is someone asking for help?
8
References
[1] Wilson, K.K., Raj, B., Smaragdis, R., Divakaran, A.: Speech denoising using
non-negative matrix factorization with priors. In: ICASSP. 40294032 (2008)
[2] Minje Kim and Paris Smaragdis, ”Single Channel Source Separation Using
Smooth Nonnegative Matrix Factorization with Markov Random Fields,” IEEE
Workshop for Machine Learning in Signal Processing (MLSP), Southampton,
UK, September 2013
[3] Smith, Julius. ”The Short-Time Fourier Transform (STFT).” The Short-Time
Fourier Transform (STFT). CCRMA.stanford, 2005
[4] Slaney, Malcolm. ”The History and Future of CASA.” Journal of Market-
ing 31.4, American Marketing Association Factbook 1967-1968 (1967): 31-36.
Ee.columbia. Web.
[5] ”Department of Linguistics.” Acoustic Analysis of Sound: Spectral Analysis.
MACQUARIE University
[6] Smith, Julius O. ”Overlap-Add Synthesis.” Overlap-Add Synthesis.
Ccrma.stanford, 2005
[7] Y. Ephraim and D. Malah Speech enhancement using a minimum mean-square
error short-time spectral amplitude estimator” // IEEE Trans. Acoustics, Speech,
Signal Processing, vol. 32, pp. 1109- 1121, Dec. 1984
[8] A. Vedaldi and B. Fulkerso, VLFeat, An Open and Portable Library of Computer
Vision Algorithms, 2008, http://www.vlfeat.org/
[9] Badgerati. ”Computer Vision The Integral Image.” Computer Science Source,
03 Sept. 2010
9

More Related Content

PDF
3D Audio playback for single channel audio using visual cues
Ramin Anushiravani
 
PDF
Sound Source Localization with microphone arrays
Ramin Anushiravani
 
PDF
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
sipij
 
PPT
07 frequency domain DIP
babak danyal
 
PDF
Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction
IOSRJVSP
 
PPTX
Frequency domain methods
thanhhoang2012
 
PPTX
Non-essentiality of Correlation between Image and Depth Map in Free Viewpoin...
Norishige Fukushima
 
PDF
In2414961500
IJERA Editor
 
3D Audio playback for single channel audio using visual cues
Ramin Anushiravani
 
Sound Source Localization with microphone arrays
Ramin Anushiravani
 
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
sipij
 
07 frequency domain DIP
babak danyal
 
Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction
IOSRJVSP
 
Frequency domain methods
thanhhoang2012
 
Non-essentiality of Correlation between Image and Depth Map in Free Viewpoin...
Norishige Fukushima
 
In2414961500
IJERA Editor
 

What's hot (20)

PDF
Speech Processing in Stressing Co-Channel Interference Using the Wigner Distr...
CSCJournals
 
PPT
Enhancement in frequency domain
Ashish Kumar
 
PPT
Image Denoising Using Wavelet
Asim Qureshi
 
PDF
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...
IJORCS
 
PDF
Lecture 10
Wael Sharba
 
PPT
Lec 07 image enhancement in frequency domain i
Ali Hassan
 
PDF
Reduced Ordering Based Approach to Impulsive Noise Suppression in Color Images
IDES Editor
 
PDF
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
IRJET Journal
 
PDF
Frequency Domain Filtering of Digital Images
Upendra Pratap Singh
 
PDF
04 1 - frequency domain filtering fundamentals
cpshah01
 
PPT
Frequency Domain Image Enhancement Techniques
Diwaker Pant
 
PPT
08 frequency domain filtering DIP
babak danyal
 
PDF
Robust Sound Field Reproduction against Listener’s Movement Utilizing Image ...
奈良先端大 情報科学研究科
 
PDF
K31074076
IJERA Editor
 
PPTX
Curved Wavelet Transform For Image Denoising using MATLAB.
Nikhil Kumar
 
PDF
APPRAISAL AND ANALOGY OF MODIFIED DE-NOISING AND LOCAL ADAPTIVE WAVELET IMAGE...
International Journal of Technical Research & Application
 
PDF
129966864160453838[1]
威華 王
 
PPTX
Comparison between Blur Transfer and Blur Re-Generation in Depth Image Based ...
Norishige Fukushima
 
PPTX
Filtering in frequency domain
GowriLatha1
 
PDF
Microphone arrays
drmaninderpal
 
Speech Processing in Stressing Co-Channel Interference Using the Wigner Distr...
CSCJournals
 
Enhancement in frequency domain
Ashish Kumar
 
Image Denoising Using Wavelet
Asim Qureshi
 
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...
IJORCS
 
Lecture 10
Wael Sharba
 
Lec 07 image enhancement in frequency domain i
Ali Hassan
 
Reduced Ordering Based Approach to Impulsive Noise Suppression in Color Images
IDES Editor
 
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
IRJET Journal
 
Frequency Domain Filtering of Digital Images
Upendra Pratap Singh
 
04 1 - frequency domain filtering fundamentals
cpshah01
 
Frequency Domain Image Enhancement Techniques
Diwaker Pant
 
08 frequency domain filtering DIP
babak danyal
 
Robust Sound Field Reproduction against Listener’s Movement Utilizing Image ...
奈良先端大 情報科学研究科
 
K31074076
IJERA Editor
 
Curved Wavelet Transform For Image Denoising using MATLAB.
Nikhil Kumar
 
APPRAISAL AND ANALOGY OF MODIFIED DE-NOISING AND LOCAL ADAPTIVE WAVELET IMAGE...
International Journal of Technical Research & Application
 
129966864160453838[1]
威華 王
 
Comparison between Blur Transfer and Blur Re-Generation in Depth Image Based ...
Norishige Fukushima
 
Filtering in frequency domain
GowriLatha1
 
Microphone arrays
drmaninderpal
 
Ad

Viewers also liked (8)

PPTX
3D Spatial Response
Ramin Anushiravani
 
PDF
Techfest jan17
Ramin Anushiravani
 
PPTX
example based audio editing
Ramin Anushiravani
 
PPTX
3D audio
Ramin Anushiravani
 
PDF
Poster cs543
Ramin Anushiravani
 
PPTX
recommender_systems
Ramin Anushiravani
 
PPTX
Beamforming and microphone arrays
Ramin Anushiravani
 
PDF
Data Science - Part XIII - Hidden Markov Models
Derek Kane
 
3D Spatial Response
Ramin Anushiravani
 
Techfest jan17
Ramin Anushiravani
 
example based audio editing
Ramin Anushiravani
 
Poster cs543
Ramin Anushiravani
 
recommender_systems
Ramin Anushiravani
 
Beamforming and microphone arrays
Ramin Anushiravani
 
Data Science - Part XIII - Hidden Markov Models
Derek Kane
 
Ad

Similar to A computer vision approach to speech enhancement (20)

PDF
Review of Use of Nonlocal Spectral – Spatial Structured Sparse Representation...
IJERA Editor
 
PDF
Paper id 24201464
IJRAT
 
PDF
Highly Adaptive Image Restoration In Compressive Sensing Applications Using S...
IJARIDEA Journal
 
PPTX
Chap6 image restoration
ShardaSalunkhe1
 
PPTX
Tausif (2)
tausif2
 
PDF
50120140504008
IAEME Publication
 
PDF
Audio Noise Removal – The State of the Art
ijceronline
 
PDF
Audio Noise Removal – The State of the Art
ijceronline
 
PDF
Ch5_Restoration (1).pdf
AlaaElhaddad3
 
PDF
Advance in Image and Audio Restoration and their Assessments: A Review
IJCSES Journal
 
PDF
ADVANCE IN IMAGE AND AUDIO RESTORATION AND THEIR ASSESSMENTS: A REVIEW
IJCSES Journal
 
PPT
Digital Image Processing_ ch3 enhancement freq-domain
Malik obeisat
 
PDF
Image Processing
Tuyen Pham
 
PDF
Log polar coordinates
Oğul Göçmen
 
PDF
Frequency Image Processing
Suhas Deshpande
 
DOC
Paper on image processing
Saloni Bhatia
 
DOCX
The method of comparing two audio files
Minh Anh Nguyen
 
PDF
GRUPO 4 : new algorithm for image noise reduction
viisonartificial2012
 
PPT
Image Enhancement in Frequency Domain (2).ppt
BULE HORA UNIVERSITY(DESALE CHALI)
 
PDF
75 78
Editor IJARCET
 
Review of Use of Nonlocal Spectral – Spatial Structured Sparse Representation...
IJERA Editor
 
Paper id 24201464
IJRAT
 
Highly Adaptive Image Restoration In Compressive Sensing Applications Using S...
IJARIDEA Journal
 
Chap6 image restoration
ShardaSalunkhe1
 
Tausif (2)
tausif2
 
50120140504008
IAEME Publication
 
Audio Noise Removal – The State of the Art
ijceronline
 
Audio Noise Removal – The State of the Art
ijceronline
 
Ch5_Restoration (1).pdf
AlaaElhaddad3
 
Advance in Image and Audio Restoration and their Assessments: A Review
IJCSES Journal
 
ADVANCE IN IMAGE AND AUDIO RESTORATION AND THEIR ASSESSMENTS: A REVIEW
IJCSES Journal
 
Digital Image Processing_ ch3 enhancement freq-domain
Malik obeisat
 
Image Processing
Tuyen Pham
 
Log polar coordinates
Oğul Göçmen
 
Frequency Image Processing
Suhas Deshpande
 
Paper on image processing
Saloni Bhatia
 
The method of comparing two audio files
Minh Anh Nguyen
 
GRUPO 4 : new algorithm for image noise reduction
viisonartificial2012
 
Image Enhancement in Frequency Domain (2).ppt
BULE HORA UNIVERSITY(DESALE CHALI)
 

Recently uploaded (20)

PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PDF
Zero Carbon Building Performance standard
BassemOsman1
 
PDF
Zero carbon Building Design Guidelines V4
BassemOsman1
 
PDF
Software Testing Tools - names and explanation
shruti533256
 
PPTX
business incubation centre aaaaaaaaaaaaaa
hodeeesite4
 
PDF
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
PDF
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
PPTX
Introduction of deep learning in cse.pptx
fizarcse
 
PPTX
Civil Engineering Practices_BY Sh.JP Mishra 23.09.pptx
bineetmishra1990
 
PPTX
Information Retrieval and Extraction - Module 7
premSankar19
 
PDF
The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...
Partho Prosad
 
PDF
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
DOCX
SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docx
Kanimozhi676285
 
PPTX
22PCOAM21 Data Quality Session 3 Data Quality.pptx
Guru Nanak Technical Institutions
 
PPTX
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
PPTX
22PCOAM21 Session 1 Data Management.pptx
Guru Nanak Technical Institutions
 
PDF
Introduction to Data Science: data science process
ShivarkarSandip
 
PPTX
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
PDF
July 2025: Top 10 Read Articles Advanced Information Technology
ijait
 
PPT
Ppt for engineering students application on field effect
lakshmi.ec
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
Zero Carbon Building Performance standard
BassemOsman1
 
Zero carbon Building Design Guidelines V4
BassemOsman1
 
Software Testing Tools - names and explanation
shruti533256
 
business incubation centre aaaaaaaaaaaaaa
hodeeesite4
 
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
Introduction of deep learning in cse.pptx
fizarcse
 
Civil Engineering Practices_BY Sh.JP Mishra 23.09.pptx
bineetmishra1990
 
Information Retrieval and Extraction - Module 7
premSankar19
 
The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...
Partho Prosad
 
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docx
Kanimozhi676285
 
22PCOAM21 Data Quality Session 3 Data Quality.pptx
Guru Nanak Technical Institutions
 
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
22PCOAM21 Session 1 Data Management.pptx
Guru Nanak Technical Institutions
 
Introduction to Data Science: data science process
ShivarkarSandip
 
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
July 2025: Top 10 Read Articles Advanced Information Technology
ijait
 
Ppt for engineering students application on field effect
lakshmi.ec
 

A computer vision approach to speech enhancement

  • 1. Audio Enhancment: A Computer Vision Approach Ramin Anushiravani Electrical And Computer Engineering Department University Of Illinois at Urbana-Champaign Urbana, IL Abstract Many audio enhancment applications can be simplified with some user interface. The purpose of this project is to remove a desired noise that is mimicked by a user from an arbitrary recording using object detection techniques. 1 Introduction Imagine having a recording of a lecture you attended, and someone’s cellphone rang in the middle of your recording. Or say you are recording a live concert, and there is too much screaming in the background. There are no easy automated ways of recognizing these unwanted noises as an actual noise. Your best shot at solving this problem is to remove all time-frequency bins corresponding to that unwanted noise assumming you know some basic Signal Processing. You might also be able to come up with some probablistic models to decompose your sound to a mixtures of sounds where one of the mixtures would hopefully correspond to your unwanted noise [1]. An alternative solution is to employ source separation techniques to sep- arate the noise from the desired signal [2]. This problem can be greatly simplified ( at least mathematically) by some user interface. If we know approximately how the unwanted noise sounds like then we might be able to search the signal for the most likely match in the noisy recording.
  • 2. 2 Motivation In the field of Audio Processing, sound is usually visualized using Spectrograms. A time domain representation is not very informative of what the context of the signal shows, since it only shows the singal amplitude versus time. Spectrograms, however, show a time-frequency representation of a sound and they can be derived using Short Time Fourier Transform (STFT). STFT is basically DFT of the signal at overlapping frame that are aligned next to each other. The intensity values are then depicted using a colormap, which is called the spectrogram of that sound [3]. There are numorous alternatives to STFTs for visualizing sounds each optimized for a certain application e.g. LP Spectrogram [4], and Cochleagram[5] among many others that can be found in many Spectral Analysis textbooks. An example of these representations is given in Figure 1. Figure 1: (Left:Top-Spectrogram,Bottom-Time Domain)-(Middle-LP Spectrogram)-(Right Chochleagram) Being inspired by these different methods of visualization, this project is aiming to remove a specific noise. The user is asked to mimick a noise from the noisy recording. Computer vision approaches is then applied to detect the noise in the spectrogram. More specifically, we are going to look at sound as if it was an image and apply object detection techniques to detect a noise object in the sound image. The image is then resysnthesize, converted to sound domain, by converting the picture back to spectrogram. The spectrogram is then converted back to time domain using overlap-add and inverse STFT [6]. With this introduction, now we can look at the removal of unwanted noise as if we are trying to detect cats in an image as shown in Figure 2. 3 Preprocessing Since we want to treat sounds as an image and then synthesize it back to sound, we would need to do some preprocessing to make sure there are easily visible in time and frequency. 2
  • 3. Figure 2: Cat Detection vs Noise Detection 3.1 From Sound Samples to Image Pixels When visualizing a sound using spectrogram; people try different colormaps to make it easier to visualize the sound. This could be a good colormap or different intensity levels, e.g. log values. For example, the spectrogram on the left of Figure 3 is hiding lots of the time-frequency bins and is not a good candidate for this project. A better visualization is shown next to it. Figure 3: A bad chioce colormap is shown on the left. The better representation is shown on the right. In order to save the STFT of a sound using our own colormap on Matlab, we would need to save the figures, which would then look like the one shown in Figure 4. In order to extract only the spectrogram of the sound (and not the title and white areas around it), the following can be done. indy = argmaxi( w i=1 image /w) > ( w i=1 image /αw) (1) Where indy is a 2 elements vector with the corresponding start and end y- position of the spectrogram in Figure 4. w is the width of the image and the (’) operator corresponds to taking the gradient of the image with respect to the x and y positions. α is a threshold factor bigger than one for determining the major peaks in the mean gradient. Basically, we take the derivative of the image with respect to the x and y position and then take the avarage of that over the width to get the y position. The same procedure can be done over the transpose of the image and a sum over the height of the image to extract the start and end x position. I 3
  • 4. Figure 4: Spectrogram saved on Matlab chose a window size of 1024 samples using a Hanning window, with 25% overlap to construct the STFTs and used overlap-add for taking the inverse STFT back to time domain (sound domain). I chose the hot colormap with the intensity values to the power of 0.35 as my colormap. 3.2 Object Extraction The noise mimicked by the user will have some background noise and many time- frequency bins that do not correspond to the mimicked noise as shown in Figure 5. In order to supress the background noise, a very strong Spectral Subtraction algorithm is applied [7]. Even though, this does not necessarily sounds good, since the mimicked noise components are also going to be removed. However, it makes sure that the noise object is as distinguished as it can be from the non-relavant components in the noisy recording. In order to remove the unrelated frequency components, a very strong threshold is defined that would only keep the time- frequency bins corresponding to the mimicked noise components. Pseudo code for removing irrelavent frequencies noise(i, j)object = noise(i, j)object if noise(i, j)object > threshold 0 else Figure 5: Spectrogram saved on Matlab 4 Object Recognition Common object recognition algorithms follow these steps, 4
  • 5. Pseudo code for object recognition 1- Scan the image with a fixed window at different scales. 2- Extract Histogram of Gradients (HOG) features from each patch. 3- Score each patch by comparing it to the object HOG features. 4- Perform Non-Maximum Suppression. In this project, we have one important advantage with respect to object detection in an image. We have the advantage of having time as the x-axis. We can assume that the user’s mimicked noise is as long as the actual noise in the noisy recording (it’s okay that the noise is repeating at different time postitions). Since we have the advantage of having harmonics, as long as the user approximately provides us with the fundamental frequency in the noise, then we are also provided with the height of the scanning window. That is, after detecting the object, we can go ahead and remove the harmonics as well. As a result, we do not have to scan the image at different scales. 4.1 Scanning the Image Scanning the image with overlaps can be a very time consuming task given the implementation and can also affect the accuracy of the algorithm greatly. I have tried multiple ways for scanning the image spectrogram, listed below. 1- At each position in the image, extract four windows with 50% overlap, two on top andtwoonthebottom. 2- Extract windows in a row from an image with 12.5% overlap throughout the whole image. Figure 6 shows the result from each of these methods for the case of having two noise objects. Either of these procedures could be a very time consum- Figure 6: The left figure shows scanning the image with 50% overlap and the one on the right shows scanning of the image with 12.5% overlap, both for the case of having two similiar noise objects at different times and frequency positions (different x and y center pixel values) ming task based on how long the signal is. One way to speed up the process is by cascading classifiers, a somewhat similiar concept to the one used in Viola-Jones face detection algorithm. This can be done by first convolving the image with the noise object, which can also be sped up using 2D Fourier Transform. This would tell us approximately where the desired noise is located at in the 2D plane. This is basically a weak classifier to reject lots of the patches that do not include the desired noise. There is another way of evaluating the object recognition algorithm that is vectrozied and would not involve scanning the image. This is discussed at the end of Classification subsection. 5
  • 6. Figure 7: Response of the image with to the noise object. The subplot in the bottom is the thresholded respone of the image to the noise object. 4.2 HOG Features HOG features are descriptors that capture the edge orientation of an image in a defined sized cell, and it is invariant to the scale transform. HOG features are mainly known for object detection applications in computer vision. Since they require very careful tuning and normalization, I used an outside library, VLFeat [8], to compute HOG features. In this project I used a cell size of 8 by 8 to extract the HOG features of a gray colored image (instead of RGB for easier coding implementation). After extracting the HOG features from the noise object and all the patches extracted Figure 8: HOG features of the users’ mimicked noise from the image, we need to classify every patch as a noise object and non-noise patch that belongs to clean signal. 4.3 Classification In order to classify each patch of the image, I used two different methods. 1- K-Nearest Neighbor. Vectorize all the HOG features of the image into one big matrix. The error function used in K-NN is a Euclidean distance measure. error = (vec(noisehog)2 − vec(imagehog)2) (2) This error function made a lot of misclassifications and so I purpose the following error function to get better accuracies. 2- The modified error function is as follows, error = |noisehog − imagehog|l2/|noisehog|l2 (3) The latter error function seems to give a much better accuracy in localizing the noise object. The resulting objects for the case of 50% overlap is shown in Figure 9. The score on the top shows the value of the latter error function. The resulting object for the 12.5% overlap is shown on the right of Figure 9. As you can see, one object might be detected multiple times, so in order to detect mutiple objects we can do 6
  • 7. Figure 9: 50% overlap windowing on the left, 12.5% overlap in a row on the right Figure 10: NSM input on the left. NSM output form scanning method 1 in the middle.Scanning method 2 on the right. the following: Detecting Multiple Objects 1- Detect the first most likely object from inside the image (noisy recording STFT in pixel domain) 2- Remove that object from the image (a blank space, to assure it is not detected in the next round). 3- Detect the second object and iterate. Since we have the advantages of the user interface, we know how to many objects to look for and as a result how many times to iterate. 4.4 Non Maximum Suppression (NSM) The purpose of NMS is to see if the objects found in the image overlap or not. If they do, then we pick the one with the highest score, and if they do not overlap as much, we pick both. The figure below shows the amount of overlap between each patch and the resulting object. The plots on diagonals of Figure 9 are the overlap of the patch with itself. I then extracted the object with the highest overlap (as they already have the highest score). This results were improved with the 12.5% overlap. I added another rule in addition to ones mentioned in the top. I decreased the size of the window and then check for the amount of overlap for every other patch in the image. The resulting object and its mask is shown in Figure 10. 4.5 Vectorized Method for Object Recognition Another error function is the sum of squared error. error = #patches i=1 (vec(noisehog) − vec(imagehog(i)))2 = (4) 7
  • 8. Figure 11: Left: removing the whole object. Middle subtracting the template. Right: ideal case of subtraction #patches i=1 vec(noisehog)2 + vec(imagehog(i))2 − 2 ∗ vec(imagehog(i))vec(noisehog) (5) The first term in equation 5 can be evaluated by squaring the noise object and summing over all the elements. The last terms in equation 5 can be calculated by convolving the noise object with the image as shown in Figure 7. The middle term can be calculated using an integral image, which gives us the summation of each patch. 5 Results When resynthesizing the sound, we can either multiply the mask, as shown in Figure 12 with the spectrogram of the sound and get rid of the whole object, or we can only subtract the noise template within the mask from the signal. Note that, due to Spectral Subtraction, there might be some residue left when subtracting only the noise template. The ideal case is shown on the right side of Figure 11. Ideally, we would hope to subtract all of noise pixels (time-frequency bins) without subtracting any of clean signal pixels . In conclusion, the object detection method seems to be effective in removing the noise from the noisy recording for the case of having one noise object and two noise objects. For the case of three noise objects, there are some misdetections, which might be resolved with a more robust NMS alogorithm. However, if the clean signal and the noise are similiar in their time-frequency representation, it is likely that the algorithm will fail in detecting the noise object e.g. suppressing an intereference speech from a lecture recording. For future work, I suggest looking into predicting the most likely pixels inside the blank space in the noisy reording (where the noise was located before removing it) i.e. predicting the missing time-frequency bins in the signal or completing an incomplete image. In addition, when localizing a deformed object (when the user cannot mimick the noise accurately), it would be necessary to look for techniques that are robust to deformity in the example object given by the user. This project can also be applied to real-time sound classification projects e.g. Is there screaming in this sound? Is someone asking for help? 8
  • 9. References [1] Wilson, K.K., Raj, B., Smaragdis, R., Divakaran, A.: Speech denoising using non-negative matrix factorization with priors. In: ICASSP. 40294032 (2008) [2] Minje Kim and Paris Smaragdis, ”Single Channel Source Separation Using Smooth Nonnegative Matrix Factorization with Markov Random Fields,” IEEE Workshop for Machine Learning in Signal Processing (MLSP), Southampton, UK, September 2013 [3] Smith, Julius. ”The Short-Time Fourier Transform (STFT).” The Short-Time Fourier Transform (STFT). CCRMA.stanford, 2005 [4] Slaney, Malcolm. ”The History and Future of CASA.” Journal of Market- ing 31.4, American Marketing Association Factbook 1967-1968 (1967): 31-36. Ee.columbia. Web. [5] ”Department of Linguistics.” Acoustic Analysis of Sound: Spectral Analysis. MACQUARIE University [6] Smith, Julius O. ”Overlap-Add Synthesis.” Overlap-Add Synthesis. Ccrma.stanford, 2005 [7] Y. Ephraim and D. Malah Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator” // IEEE Trans. Acoustics, Speech, Signal Processing, vol. 32, pp. 1109- 1121, Dec. 1984 [8] A. Vedaldi and B. Fulkerso, VLFeat, An Open and Portable Library of Computer Vision Algorithms, 2008, http://www.vlfeat.org/ [9] Badgerati. ”Computer Vision The Integral Image.” Computer Science Source, 03 Sept. 2010 9