Applied Machine Learning for
Search Engine Relevance
Charles H Martin, PhD
Relevance as a Linear Regression
r =X†w+e
x: (tf-idf) bag of words vector
r: relevance score (i.e. 1/-1)
w: weight vector
w = X†r/X†X
x=1
querymodel*
form X from data
(i.e. group of queries)
Solveas a numericalminimization
(i.e. iterativemethods like SOR, CG, etc )
min  X†w-r 2  w 2 : 2-norm of w
*Actually will model and predict pairwise
relations and not exact rank. ..stay tuned.
Moore-PenrosePseudoinverse
Relevance as a Linear Regression:
Tikhonov Regularization
w = (X†X)-1 X†r
Problem: inverse may be not exist (numerical instabilities,poles)
Solution: add constant a to diagonalof (X†X)-1
w = (X†X + aI)-1 X†r
a: single, adjustable
smoothingparameter
Equivalentminimization problem
min X†w-r2 + a w2
More generally: form (something like) X†X + G†G + aI,
which is a self-adjoint , bounded operator =>
min  X†w-r 2 + a Gw 2 i.e. G chosen to avoid over-fitting
The Representer Theorem Revisited:
Kernels and Greens Functions
f(x) = S aiR(x, xi) R := Kernel
Problem: to estimate a function f(x) from trainingdata (xi)
Solution: solve a general minimizationproblem
min Loss[(f(xi), yi)] + a Gx2
Machine Learning Methodsfor Estimating Operator Equations(Steinke& Scholkpf 2006)
min Loss[(f(xi), yi)] + aTKa Kij = R(xi, xj)
Equivalentto: given a Linear regularization operator ( G:H->L2(x) )
where K is an integral operator: (Kf)(y) = R(x,y)f(x)dx
so K is the Green’s Function for (GG)†, or G= (K1/2)†
in Dirac Notation: R(x,y) = <y|(GG)†|x>
f(x) = S aiR(x,xi) + S buu (xi) ; u span null space of G
Personalized Relevance Algorithms:
eSelf Personality Subspace
qpages personalitytraitsp
Cars: 0.4
User
Reading
(musicsite) Present to user
(used sports car ad)
Learned Traits:
(Likes cars 0.4)
(Sports cars 0.3)
ads
Sports cars
0.0 =>0.3
Rock-n-roll
Hard rock
Computepersonalitytraits during user visit to web site
q values = stored learned “personalitytraits”
Providerelevance rankings(for pages or ads) which includepersonalitytraits
Personalized Relevance Algorithms:
eSelf Personality Subspace
model: L [p,q] = [h, u]
where L is a square matrix
h: history (observed outputs)
p: output nodes
(observables)
Web pages, Classified Ads, …
q: hidden nodes
(not observed)
Individualized
Personality
Traits
u: user segmentation
Personalized Search:
Effective Regression Problem
[p, q](t) = (Leff[q(t-1)]) -1 • [h, u](t) on each time step (t)
PLP PLQ p = h
QLP QLQ q u
Leff = (PLP + PLQ (QLQ)-1 QLP)
Leff p = h
p = (Leff [q,u])-1 h
PLP p + PLQ q = h
QLP p +QLQ q = 0
Formal solution:
=>
Adaptson each visit, finding relevantpages p(t) based on the links L, and the
learnedpersonalitytraits (q(t-1))
Regularizationof PLP achievedwith “Green’s Function / ResolventOperator”
i.e. G†G ~= PLQ (QLQ)-1 QLP
Equivalentto Gaussian Process on a Graph, and/orBayesianLinear Regression
Related Dimensional Noise Reductions:
Rank (k) Approximations of a Matrix
LatentSemantic Analysis(LSA)
(Truncated)SingularValueDecomposition (SVD):
DiagonalizetheDensity operator D = A†A
Retaina subset of (k) eigenvalues/vectors
Equivalentrelationsfor SVD
Optimalrank(k) apprx. X s.t. min (D-X)2
2
Decomposition: A = U∑ V† A†A = V (∑† ∑) V†
PDP PDQ
QDP QDQ
*VariableLatentSemanticIndexing (Yahoo!Research Labs)
http://www.cs.cornell.edu/people/adg/html/papers/vlsi.pdf
VLSA* provides a rank (k) apprx. to any q query: min E [ qT (D-X) 2
2 ]
Cangeneralize to variousnoise models: i.e. VLSA*, PLSA**
**ProbabilisticLatentSemanticIndexing(Recommind,Inc)
http://www.cs.brown.edu/~th/papers/Hofmann-UAI99.pdf
PLSA* provides a rank (k) apprx. of classes (z) min DKL [P -P(data)]
P = U∑ V†
P(d,w) = P(d|z) P(z)P(w|z) DKL = Kullback–Leibler divergence
Personalized Relevance:
Mobile Services and Advertising
France Telecom: Last Inch Relevance Engine
time
location
playgame send msg playsong
suggest
…
KA for Comm Services
• Based on Empirical Bayesian score and Suggestion mapping table,
a decision is made to one or more possible Comm services
• Based on Business Intelligence (BI) data mining and/or Pattern
Recognition algorithms (i.e. supervised or unsupervised learning)
, we compute statisticalscores indicating who are the most likely
people to Call, send an SMS, MMS, or E-Mail.
qEvents PersonalContext
(Sun. mornings)
p
Events
Map to a contextual
comm service
Suggestions foruser
(Call, SMS, MMS, E-mail)
Learned Traits:
On Sunday morning,
mostlikely to call Mom
Comm
Services
Mom (5)
Call [who]
SMS [who]
MMS [who]
Bob (3)
Phone company(1)
p( |POD ) > p( |SUN); p( |POD) < p( |SUN)
Comm/Call Patterns
LOC
Dayof Week
p( , P|POD) > 0; p( , ) = 0
POD
callsto different #'s
Bayesian Score Estimation
To estimate p(call|POD)
frequency:
p(call|POD) = # of times user called
someone at that POD
Bayesian:
p(call|POD) = p(POD|call)p(call)
Sq p(POD|q)p(q)
where q = call, sms, mms, or email
i.e. Bayesian Choice Estimator
• We seek to know the probability of a "call" (choice) at a given POD.
• We "borrow information" from other PODs, assuming this is less
biased, to improve our statisticalestimate
5 days
3 PODs
3 choices
f( | 1) = 2/5
p( | 1) = (2/5)(3/15) .
(2/5)(3/15)+(2/5)(3/15)+(1/5)(11/15)
= 6/23 ~ 1/4
1
2
3
frequency estimator
Bayesianchoice estimator
Note: the Bayesianestimate is
significantly lower because we now
expect we might see a at POD 1
Incorporating Feedback
• It is not enough to simple recognize call patterns in the Event Facts—it
is also necessary to incorporate feedback into our suggestions scores
• p( c | user, pod, loc, facts, feedback ) = ?
Event Facts
Suggestions
random
irrelevant
poor
good
p( c | user, pod, facts, feedback ) =
p( c | user, pod, facts ) p ( c | user, pod, feedback)
A: Simply Factorize:
Evaluateprobabilitiesindependently,
perhaps using different Bayesianmodels
Personalized Relevance:
Empirical Bayesian Models
Closed form models:
Correct a sample estimate (mean m, variance ) with a
weighted average of sample + complete dataset
m = B m + (1-B) m
B shrinkage
factor
i.e:
individual
sample
user
segment
1
play game send msg playsong
Canrank order mobile services
based on estimated likelihood(m , )
1 23
Personalized Relevance:
Empirical Bayesian Models
What is Empirical Bayes modeling?
specify Likelihood L(y|) and Prior () distributions
estimatethe posterior () = L(y|) () L(y|) ()
 L(y|) ()d (marginal)
CombinesBayesianismand frequentism:
Approximatesmarginal using (or posterior) using point estimate(MLE), MonteCarlo, etc.
Estimatesmarginal using empirical data
Uses empirical data to infer prior, plug into likelihood to make predictions
Note: Special case of Effective OperatorRegression:
P space ~ Q space ; PLQ = I ; u  0
Q-space defines prior information
Empirical Bayesian Methods:
Poisson Gamma Model
Likelihood L(y| ) = Poisson distribution ( y e- )/y!
ConjugatePrior ( a,b) = Gamma distribution ( a-1 b a e-b a )/ G(a) ;  > 0
posterior(k) L(y|) () = (( y e- )/y!) (( a-1 b a e-b a )/ G(a) )
y+a-1 e-(1+(1/b))
also a Gamma distribution(a’,b’)
a’ = y + a ; b’ = (1+1/b)-1
Take MLE estimate of Marginal = mean (m) of the posterior (ab)
Obtaina,b from the mean (m = ab) and variance (ab2) of complete data
FinalPoint estimate E(y)= a’b’ for a sample is a weighted averageof
sample mean y=my and prior mean m
E(y) = ( my + a ) (1+1/b)-1
E(y) = (b/1+b) my + (1/1+b) m
Linear Personality Matrix
events
suggestions
actio
n
Linear(or non-linear)
Matrixtransformation: M s = a
Notice: the personality matrix may or may not mix suggestions across events, can include semantic information
andcan then solve for the prob( s ) using a computational linearsolver: s = M-1a
Over time, we can estimate the Ma,s = prob( a | s )
i.e. calls
s1 = call
s2 = sms
s3 = mms
s4 = email
i.e. for a given time and location…
count how many times we suggested a call
but the user chose an email instead
Obviously we would like M to be diagonal…or as good as possible !
Can we devise an algorithm that will learn to give "optimal" suggestions?
Matrices for Pattern Recognition
(Statistical Factor Analysis)
Call on Mon @ pod 1
Call on Mon @ pod 2
Call on Mon @ pod 3
…
…
Smson Tue @ pod 1
…
Week 1 2 3 4 5 …
We can use apply ComputationalLinearAlgebra to remove noise and find patternsin data.
CalledFactor Analysisby Statisticians,SingularValue Decomposition(SVD) by Engineers.
Implementedin Oracle Data Mining(ODM) as Non-NegativeMatrixFactorization
1. Enumerate all choices 3. Formweekly choice
density Matrix AtA
2. Count # of times a choice
is made each week
4. Weekly patterns are
collapsedintothe density
Matrix At
A
They canbe detected
using spectral analysis
(i.e. principal eigenvalues)
All weekly patterns Pure Noise
Similarto Latent (Multinomial)DirichletAlgorithm (LDA), but much simpler to implement.
Suitablewhen the number (#) of choices is not too large, and patternsare weekly.
Search Engine Relevance : Listing on
Which 5 items to list at bottom of page ?
Statistical Machine Learning:
Support Vector Machines (SVM)
From Regression to Classification: Maximum Margin Solutions
2
w2
w
Classification := Find the line
thatseparates the pointswith
the maximum margin
min ½w2
2 subject to constraints
all “above” line
all “below” line
“above” : w.xi–b >= 1 + I
“below” : w.xi –b <= 1 + i
constraint specifications:
Simple minimization (regression) becomes a convex optimization (classification)
perhaps within some slack (i.e. min ½ w2
2
+ C S I )
SVM Light: Multivariate Rank Constraints
MultivariateClassification:
min ½w2
2 +C s.t.
for all y’: wT Ψ(x,y) - wT Ψ(x,y’) >= D(y,y’) - 
let Ψ(x,y’) = S y’x be a linear fcn
x1
x2
…
xn
- 0.1
+1.2
…
-0.7
x
- 1
+1
…
-1
y
sgn
wTx
maps docs to relevance scores (+1/-1)
learn weights (w) s.t. max wT Ψ(x,y’)
is correct for training set
(within a single slack constraint )
max wT Ψ(x,y’)
D(y,y’) is a multivariateloss function:(i.e. 1- Average Precision(y,y’) )
Ψ(x,y’) is a linear discriminantfunction: (i.e. sum of ordered pairs SiSj
yij (xi -xj) )
SvmLight Ranking SVMs
SVMperf : ROC Area, F1 Score, Precision/Recall
SVMmap : Mean Average Precision ( warning: buggy ! )
SVMrank : OrdinalRegression
Stnd Classificationon pairwise differences
min ½ w2
2 + C S I,j,k s.t
for all queries qk (later, may not be query specific in SVMstruct)
doc pairs di, dj wT Ψ(qk,di) - wT Ψ(qk,dj) >= 1- I,j,k
DROCArea = 1- # swapped pairs
Enforces a directed ordering
1 2 3 4 5 6 7 8
1 0 0 0 0 1 1 0
8 7 6 5 4 3 2 1
1 2 3 4 5 6 7 8
MAP ROC Area
0.56 0.47
0.51 0.53
A Support Vector Method for Optimizing Average Precision (Joachims et. Al. 2007)
Large Scale, Linear SVMs
• Solving the Primal
– Conjugate Gradient
– Joachims: cutting plane algorithm
– Nyogi
• HandlingLarge Numbers of Constraints
• Cutting Plane Algorithm
• Open Source Implementations:
– LibSVM
– SVMLight
Search Engine Relevance : Listing on
A ranking SVM consistentlyimproves Shopping.com <click rank> by %12
Various Sparse Matrix Problems:
Google Page Rank algorithm
Ma = a rank a series of web pages by simulating user
browsing patterns (a) based on probabilistic
model (M) of page links
Pattern Recognition, Inference
L p = h estimateunknown probabilities (p) based on
historical observations (h) and probability
model (L) of links between hidden nodes
Quantum Chemistry
H  = E  compute color of dyes, pigments given empirical
information on realted molecules and/or solving
massive eigenvalue problems
Quantum Chemistry:
the electronic structure eigenproblem
Solve a massive eigenvalue problem (109-1012)
H  (, , …) =   (, , …)
H nergy Matrix
 quantumstateeigenvector
,  , … electrons
Methods can have general applicability:
Davidson method for dominanteigenvalue/ eigenvectors
Motivation for Personalization Technology
From solution of understanding the conceptualfoundations
of semi-empirical models (noiseless dimensionalreduction)
E
Relations between Quantum Mechanics and
ProbabilisticLanguage Models
• QuantumStates  resemble the states(strings, words, phrases)
in probabilistic language models (HMMs, SCFGs), except:
 is a sum* of strings of electrons:
 (, ) = 0. |  1  2  1  2 | +0.2 |  2  3  1  2 | +…
• Energy Matrix H is known exactly, but large. Models of H can be
inferred from empirical data to simplify computations.
• Energies ~= Log [Probabilities], un-normalized
*Not just a single string!
Ab initio (from first principles):
Solve entire H  (, ) =   (, ) …approximately
OR
Semi-empirical:
Assume(, ) electrons statisticallyindependent:
 (, ) = p() q()
Treat  -electrons explicitly, ignore  (hidden):
PHP p() =  p() muchsmaller problem
Parameterize PHP matrix => Heff with empirical data using a small set
of molecules, then apply to others (dyes,pigments)
Dimensional Reduction in Quantum Chemistry:
where do semi-empirical Hamiltonians come from?
Effective Hamiltonians:
Semi-Empirical Pi-Electron Methods
Heff [] p() =  p()
PHP PHQ p = E p
QHP QHQ q q
Heff [] = (PHP – PHQ (E-QHQ)-1 QHP)
PHP p + PHQ q = E p
QHP p + QHQ q = E q
=>
implicit/ hidden
Final Heff can be solved iteratively (as with eSelf Leff),
or perturbatively in various forms
Solution is formally exact =>
Dimensional Reduction / “Renormalization”
Graphical Methods
Vij = + …
DecomposeHeff into effective interactions between  electrons
(Expand (E-QHQ)-1 in an infinite series, remove E dependence)
Represent diagrammatically, ~300 diagrams to evaluate
Precompileusing symbolic manipulation:
~35 MG executable; 8-10 hours to compile
run time: 3-4 hours/parameter
+ +
Effective Hamiltonians:
Numerical Calculations
VCC
-only effective empirical
16 11.5 11-12 (eV)
Compute ab initio empirical parameters :
Can test all basic assumptions of semi-empirical theory ,
“from first principles”
Alsoprovides highly accurate eigenvalue spectra
Augmentcommercial packages (i.e. Fujitsu MOPAC) to model
spectroscopy of photoactive proteins
example

More Related Content

PDF
Applied Machine Learning For Search Engine Relevance
PDF
Cheatsheet deep-learning-tips-tricks
PDF
(DL hacks輪読) Variational Inference with Rényi Divergence
PDF
Iclr2016 vaeまとめ
PDF
Cheatsheet recurrent-neural-networks
PDF
Cheatsheet convolutional-neural-networks
PDF
Cheatsheet supervised-learning
PDF
26 Machine Learning Unsupervised Fuzzy C-Means
Applied Machine Learning For Search Engine Relevance
Cheatsheet deep-learning-tips-tricks
(DL hacks輪読) Variational Inference with Rényi Divergence
Iclr2016 vaeまとめ
Cheatsheet recurrent-neural-networks
Cheatsheet convolutional-neural-networks
Cheatsheet supervised-learning
26 Machine Learning Unsupervised Fuzzy C-Means

What's hot (15)

PDF
18.1 combining models
PDF
(DL hacks輪読)Bayesian Neural Network
PDF
Distributed ADMM
PDF
Support Vector Machines (SVM)
 
PDF
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
PDF
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
PPTX
Support vector machines
PDF
Predicting organic reaction outcomes with weisfeiler lehman network
PDF
PMF BPMF and BPTF
PDF
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
PPT
Machine Learning and Statistical Analysis
PDF
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
PDF
K-means, EM and Mixture models
PDF
Cheatsheet unsupervised-learning
PDF
MLHEP Lectures - day 2, basic track
18.1 combining models
(DL hacks輪読)Bayesian Neural Network
Distributed ADMM
Support Vector Machines (SVM)
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Support vector machines
Predicting organic reaction outcomes with weisfeiler lehman network
PMF BPMF and BPTF
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Machine Learning and Statistical Analysis
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
K-means, EM and Mixture models
Cheatsheet unsupervised-learning
MLHEP Lectures - day 2, basic track
Ad

Viewers also liked (16)

PDF
Large-Scale Inverse Problems
PDF
Concept search for e commerce with solr
PPTX
Practical Machine Learning for Smarter Search with Spark+Solr
PPTX
Machine Learning for Search at LinkedIn
PPTX
eCommerce for Everyone: What to Expect in 2017 - State of Search
PPTX
Learning to Rank Personalized Search Results in Professional Networks
PDF
Machine Learning Search and SEO - Zenith; Duluth, MN.
PDF
Architecture of a search engine
PPTX
Permutation combination
PDF
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
PPT
3. permutation and combination
PPTX
Parallel sorting
PDF
Parallel sorting Algorithms
PPTX
Parallel sorting algorithm
PPTX
SEO in a Two Algorithm World
PPT
Permutations & Combinations
Large-Scale Inverse Problems
Concept search for e commerce with solr
Practical Machine Learning for Smarter Search with Spark+Solr
Machine Learning for Search at LinkedIn
eCommerce for Everyone: What to Expect in 2017 - State of Search
Learning to Rank Personalized Search Results in Professional Networks
Machine Learning Search and SEO - Zenith; Duluth, MN.
Architecture of a search engine
Permutation combination
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
3. permutation and combination
Parallel sorting
Parallel sorting Algorithms
Parallel sorting algorithm
SEO in a Two Algorithm World
Permutations & Combinations
Ad

Similar to Applied machine learning for search engine relevance 3 (20)

PPTX
R Language Introduction
PPTX
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
PPT
Q-Metrics in Theory and Practice
PPT
Q-Metrics in Theory And Practice
PPTX
Mining of massive datasets
PDF
Mit6 094 iap10_lec03
PPTX
Formal methods 5 - Pi calculus
PDF
PhysicsSIG2008-01-Seneviratne
PDF
Introduction to recommender systems
PPT
isabelle_webinar_jan..
PPT
Lect4
PDF
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
PPT
matlab lecture 4 solving mathematical problems.ppt
PPT
AlgorithmAnalysis2.ppt
PPTX
statistical computation using R- an intro..
PDF
Monte Carlo Statistical Methods
PPTX
Seminar psu 20.10.2013
PPTX
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
PPTX
PDF
Accelerating Metropolis Hastings with Lightweight Inference Compilation
R Language Introduction
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Q-Metrics in Theory and Practice
Q-Metrics in Theory And Practice
Mining of massive datasets
Mit6 094 iap10_lec03
Formal methods 5 - Pi calculus
PhysicsSIG2008-01-Seneviratne
Introduction to recommender systems
isabelle_webinar_jan..
Lect4
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
matlab lecture 4 solving mathematical problems.ppt
AlgorithmAnalysis2.ppt
statistical computation using R- an intro..
Monte Carlo Statistical Methods
Seminar psu 20.10.2013
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Accelerating Metropolis Hastings with Lightweight Inference Compilation

More from Charles Martin (20)

PDF
The Emergence of Signatures of AGI: The Physics of Learning
PDF
An Overview of the WeightWatcher Project: March 2025
PDF
Spin Glass Models of Neural Networks: The Curie-Weiss Model from Statistical ...
PDF
Overview of basic statistical mechanics of NNs
PDF
SETOL: SemiEmpirical Theory of (Deep Learning)
PDF
SETOL: a SemiEmpirical Theory of (Deep) Learning
PDF
WeightWatcher: Data Free Diagnostics for Deep Learning
PDF
Heavy Tails Workshop NeurIPS2023.pdf
PDF
LLM avalanche June 2023.pdf
PDF
WeightWatcher LLM Update
PDF
ICCF24.pdf
PDF
ENS Macrh 2022.pdf
PDF
Weight watcher Bay Area ACM Feb 28, 2022
PDF
Georgetown B-school Talk 2021
PDF
Search relevance
PDF
WeightWatcher Introduction
PDF
WeightWatcher Update: January 2021
PDF
Stanford ICME Lecture on Why Deep Learning Works
PDF
Building AI Products: Delivery Vs Discovery
PDF
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
The Emergence of Signatures of AGI: The Physics of Learning
An Overview of the WeightWatcher Project: March 2025
Spin Glass Models of Neural Networks: The Curie-Weiss Model from Statistical ...
Overview of basic statistical mechanics of NNs
SETOL: SemiEmpirical Theory of (Deep Learning)
SETOL: a SemiEmpirical Theory of (Deep) Learning
WeightWatcher: Data Free Diagnostics for Deep Learning
Heavy Tails Workshop NeurIPS2023.pdf
LLM avalanche June 2023.pdf
WeightWatcher LLM Update
ICCF24.pdf
ENS Macrh 2022.pdf
Weight watcher Bay Area ACM Feb 28, 2022
Georgetown B-school Talk 2021
Search relevance
WeightWatcher Introduction
WeightWatcher Update: January 2021
Stanford ICME Lecture on Why Deep Learning Works
Building AI Products: Delivery Vs Discovery
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PDF
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
PPTX
future_of_ai_comprehensive_20250822032121.pptx
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PDF
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PDF
Early detection and classification of bone marrow changes in lumbar vertebrae...
PDF
Lung cancer patients survival prediction using outlier detection and optimize...
PDF
Flame analysis and combustion estimation using large language and vision assi...
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PDF
Data Virtualization in Action: Scaling APIs and Apps with FME
PDF
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
PDF
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
PDF
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
PPTX
Microsoft User Copilot Training Slide Deck
PPTX
Internet of Everything -Basic concepts details
NewMind AI Weekly Chronicles – August ’25 Week IV
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
future_of_ai_comprehensive_20250822032121.pptx
Taming the Chaos: How to Turn Unstructured Data into Decisions
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
Consumable AI The What, Why & How for Small Teams.pdf
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
Convolutional neural network based encoder-decoder for efficient real-time ob...
Early detection and classification of bone marrow changes in lumbar vertebrae...
Lung cancer patients survival prediction using outlier detection and optimize...
Flame analysis and combustion estimation using large language and vision assi...
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
Data Virtualization in Action: Scaling APIs and Apps with FME
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
Microsoft User Copilot Training Slide Deck
Internet of Everything -Basic concepts details

Applied machine learning for search engine relevance 3

  • 1. Applied Machine Learning for Search Engine Relevance Charles H Martin, PhD
  • 2. Relevance as a Linear Regression r =X†w+e x: (tf-idf) bag of words vector r: relevance score (i.e. 1/-1) w: weight vector w = X†r/X†X x=1 querymodel* form X from data (i.e. group of queries) Solveas a numericalminimization (i.e. iterativemethods like SOR, CG, etc ) min  X†w-r 2  w 2 : 2-norm of w *Actually will model and predict pairwise relations and not exact rank. ..stay tuned. Moore-PenrosePseudoinverse
  • 3. Relevance as a Linear Regression: Tikhonov Regularization w = (X†X)-1 X†r Problem: inverse may be not exist (numerical instabilities,poles) Solution: add constant a to diagonalof (X†X)-1 w = (X†X + aI)-1 X†r a: single, adjustable smoothingparameter Equivalentminimization problem min X†w-r2 + a w2 More generally: form (something like) X†X + G†G + aI, which is a self-adjoint , bounded operator => min  X†w-r 2 + a Gw 2 i.e. G chosen to avoid over-fitting
  • 4. The Representer Theorem Revisited: Kernels and Greens Functions f(x) = S aiR(x, xi) R := Kernel Problem: to estimate a function f(x) from trainingdata (xi) Solution: solve a general minimizationproblem min Loss[(f(xi), yi)] + a Gx2 Machine Learning Methodsfor Estimating Operator Equations(Steinke& Scholkpf 2006) min Loss[(f(xi), yi)] + aTKa Kij = R(xi, xj) Equivalentto: given a Linear regularization operator ( G:H->L2(x) ) where K is an integral operator: (Kf)(y) = R(x,y)f(x)dx so K is the Green’s Function for (GG)†, or G= (K1/2)† in Dirac Notation: R(x,y) = <y|(GG)†|x> f(x) = S aiR(x,xi) + S buu (xi) ; u span null space of G
  • 5. Personalized Relevance Algorithms: eSelf Personality Subspace qpages personalitytraitsp Cars: 0.4 User Reading (musicsite) Present to user (used sports car ad) Learned Traits: (Likes cars 0.4) (Sports cars 0.3) ads Sports cars 0.0 =>0.3 Rock-n-roll Hard rock Computepersonalitytraits during user visit to web site q values = stored learned “personalitytraits” Providerelevance rankings(for pages or ads) which includepersonalitytraits
  • 6. Personalized Relevance Algorithms: eSelf Personality Subspace model: L [p,q] = [h, u] where L is a square matrix h: history (observed outputs) p: output nodes (observables) Web pages, Classified Ads, … q: hidden nodes (not observed) Individualized Personality Traits u: user segmentation
  • 7. Personalized Search: Effective Regression Problem [p, q](t) = (Leff[q(t-1)]) -1 • [h, u](t) on each time step (t) PLP PLQ p = h QLP QLQ q u Leff = (PLP + PLQ (QLQ)-1 QLP) Leff p = h p = (Leff [q,u])-1 h PLP p + PLQ q = h QLP p +QLQ q = 0 Formal solution: => Adaptson each visit, finding relevantpages p(t) based on the links L, and the learnedpersonalitytraits (q(t-1)) Regularizationof PLP achievedwith “Green’s Function / ResolventOperator” i.e. G†G ~= PLQ (QLQ)-1 QLP Equivalentto Gaussian Process on a Graph, and/orBayesianLinear Regression
  • 8. Related Dimensional Noise Reductions: Rank (k) Approximations of a Matrix LatentSemantic Analysis(LSA) (Truncated)SingularValueDecomposition (SVD): DiagonalizetheDensity operator D = A†A Retaina subset of (k) eigenvalues/vectors Equivalentrelationsfor SVD Optimalrank(k) apprx. X s.t. min (D-X)2 2 Decomposition: A = U∑ V† A†A = V (∑† ∑) V† PDP PDQ QDP QDQ *VariableLatentSemanticIndexing (Yahoo!Research Labs) http://www.cs.cornell.edu/people/adg/html/papers/vlsi.pdf VLSA* provides a rank (k) apprx. to any q query: min E [ qT (D-X) 2 2 ] Cangeneralize to variousnoise models: i.e. VLSA*, PLSA** **ProbabilisticLatentSemanticIndexing(Recommind,Inc) http://www.cs.brown.edu/~th/papers/Hofmann-UAI99.pdf PLSA* provides a rank (k) apprx. of classes (z) min DKL [P -P(data)] P = U∑ V† P(d,w) = P(d|z) P(z)P(w|z) DKL = Kullback–Leibler divergence
  • 9. Personalized Relevance: Mobile Services and Advertising France Telecom: Last Inch Relevance Engine time location playgame send msg playsong suggest …
  • 10. KA for Comm Services • Based on Empirical Bayesian score and Suggestion mapping table, a decision is made to one or more possible Comm services • Based on Business Intelligence (BI) data mining and/or Pattern Recognition algorithms (i.e. supervised or unsupervised learning) , we compute statisticalscores indicating who are the most likely people to Call, send an SMS, MMS, or E-Mail. qEvents PersonalContext (Sun. mornings) p Events Map to a contextual comm service Suggestions foruser (Call, SMS, MMS, E-mail) Learned Traits: On Sunday morning, mostlikely to call Mom Comm Services Mom (5) Call [who] SMS [who] MMS [who] Bob (3) Phone company(1)
  • 11. p( |POD ) > p( |SUN); p( |POD) < p( |SUN) Comm/Call Patterns LOC Dayof Week p( , P|POD) > 0; p( , ) = 0 POD callsto different #'s
  • 12. Bayesian Score Estimation To estimate p(call|POD) frequency: p(call|POD) = # of times user called someone at that POD Bayesian: p(call|POD) = p(POD|call)p(call) Sq p(POD|q)p(q) where q = call, sms, mms, or email
  • 13. i.e. Bayesian Choice Estimator • We seek to know the probability of a "call" (choice) at a given POD. • We "borrow information" from other PODs, assuming this is less biased, to improve our statisticalestimate 5 days 3 PODs 3 choices f( | 1) = 2/5 p( | 1) = (2/5)(3/15) . (2/5)(3/15)+(2/5)(3/15)+(1/5)(11/15) = 6/23 ~ 1/4 1 2 3 frequency estimator Bayesianchoice estimator Note: the Bayesianestimate is significantly lower because we now expect we might see a at POD 1
  • 14. Incorporating Feedback • It is not enough to simple recognize call patterns in the Event Facts—it is also necessary to incorporate feedback into our suggestions scores • p( c | user, pod, loc, facts, feedback ) = ? Event Facts Suggestions random irrelevant poor good p( c | user, pod, facts, feedback ) = p( c | user, pod, facts ) p ( c | user, pod, feedback) A: Simply Factorize: Evaluateprobabilitiesindependently, perhaps using different Bayesianmodels
  • 15. Personalized Relevance: Empirical Bayesian Models Closed form models: Correct a sample estimate (mean m, variance ) with a weighted average of sample + complete dataset m = B m + (1-B) m B shrinkage factor i.e: individual sample user segment 1 play game send msg playsong Canrank order mobile services based on estimated likelihood(m , ) 1 23
  • 16. Personalized Relevance: Empirical Bayesian Models What is Empirical Bayes modeling? specify Likelihood L(y|) and Prior () distributions estimatethe posterior () = L(y|) () L(y|) ()  L(y|) ()d (marginal) CombinesBayesianismand frequentism: Approximatesmarginal using (or posterior) using point estimate(MLE), MonteCarlo, etc. Estimatesmarginal using empirical data Uses empirical data to infer prior, plug into likelihood to make predictions Note: Special case of Effective OperatorRegression: P space ~ Q space ; PLQ = I ; u  0 Q-space defines prior information
  • 17. Empirical Bayesian Methods: Poisson Gamma Model Likelihood L(y| ) = Poisson distribution ( y e- )/y! ConjugatePrior ( a,b) = Gamma distribution ( a-1 b a e-b a )/ G(a) ;  > 0 posterior(k) L(y|) () = (( y e- )/y!) (( a-1 b a e-b a )/ G(a) ) y+a-1 e-(1+(1/b)) also a Gamma distribution(a’,b’) a’ = y + a ; b’ = (1+1/b)-1 Take MLE estimate of Marginal = mean (m) of the posterior (ab) Obtaina,b from the mean (m = ab) and variance (ab2) of complete data FinalPoint estimate E(y)= a’b’ for a sample is a weighted averageof sample mean y=my and prior mean m E(y) = ( my + a ) (1+1/b)-1 E(y) = (b/1+b) my + (1/1+b) m
  • 18. Linear Personality Matrix events suggestions actio n Linear(or non-linear) Matrixtransformation: M s = a Notice: the personality matrix may or may not mix suggestions across events, can include semantic information andcan then solve for the prob( s ) using a computational linearsolver: s = M-1a Over time, we can estimate the Ma,s = prob( a | s ) i.e. calls s1 = call s2 = sms s3 = mms s4 = email i.e. for a given time and location… count how many times we suggested a call but the user chose an email instead Obviously we would like M to be diagonal…or as good as possible ! Can we devise an algorithm that will learn to give "optimal" suggestions?
  • 19. Matrices for Pattern Recognition (Statistical Factor Analysis) Call on Mon @ pod 1 Call on Mon @ pod 2 Call on Mon @ pod 3 … … Smson Tue @ pod 1 … Week 1 2 3 4 5 … We can use apply ComputationalLinearAlgebra to remove noise and find patternsin data. CalledFactor Analysisby Statisticians,SingularValue Decomposition(SVD) by Engineers. Implementedin Oracle Data Mining(ODM) as Non-NegativeMatrixFactorization 1. Enumerate all choices 3. Formweekly choice density Matrix AtA 2. Count # of times a choice is made each week 4. Weekly patterns are collapsedintothe density Matrix At A They canbe detected using spectral analysis (i.e. principal eigenvalues) All weekly patterns Pure Noise Similarto Latent (Multinomial)DirichletAlgorithm (LDA), but much simpler to implement. Suitablewhen the number (#) of choices is not too large, and patternsare weekly.
  • 20. Search Engine Relevance : Listing on Which 5 items to list at bottom of page ?
  • 21. Statistical Machine Learning: Support Vector Machines (SVM) From Regression to Classification: Maximum Margin Solutions 2 w2 w Classification := Find the line thatseparates the pointswith the maximum margin min ½w2 2 subject to constraints all “above” line all “below” line “above” : w.xi–b >= 1 + I “below” : w.xi –b <= 1 + i constraint specifications: Simple minimization (regression) becomes a convex optimization (classification) perhaps within some slack (i.e. min ½ w2 2 + C S I )
  • 22. SVM Light: Multivariate Rank Constraints MultivariateClassification: min ½w2 2 +C s.t. for all y’: wT Ψ(x,y) - wT Ψ(x,y’) >= D(y,y’) -  let Ψ(x,y’) = S y’x be a linear fcn x1 x2 … xn - 0.1 +1.2 … -0.7 x - 1 +1 … -1 y sgn wTx maps docs to relevance scores (+1/-1) learn weights (w) s.t. max wT Ψ(x,y’) is correct for training set (within a single slack constraint ) max wT Ψ(x,y’) D(y,y’) is a multivariateloss function:(i.e. 1- Average Precision(y,y’) ) Ψ(x,y’) is a linear discriminantfunction: (i.e. sum of ordered pairs SiSj yij (xi -xj) )
  • 23. SvmLight Ranking SVMs SVMperf : ROC Area, F1 Score, Precision/Recall SVMmap : Mean Average Precision ( warning: buggy ! ) SVMrank : OrdinalRegression Stnd Classificationon pairwise differences min ½ w2 2 + C S I,j,k s.t for all queries qk (later, may not be query specific in SVMstruct) doc pairs di, dj wT Ψ(qk,di) - wT Ψ(qk,dj) >= 1- I,j,k DROCArea = 1- # swapped pairs Enforces a directed ordering 1 2 3 4 5 6 7 8 1 0 0 0 0 1 1 0 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 MAP ROC Area 0.56 0.47 0.51 0.53 A Support Vector Method for Optimizing Average Precision (Joachims et. Al. 2007)
  • 24. Large Scale, Linear SVMs • Solving the Primal – Conjugate Gradient – Joachims: cutting plane algorithm – Nyogi • HandlingLarge Numbers of Constraints • Cutting Plane Algorithm • Open Source Implementations: – LibSVM – SVMLight
  • 25. Search Engine Relevance : Listing on A ranking SVM consistentlyimproves Shopping.com <click rank> by %12
  • 26. Various Sparse Matrix Problems: Google Page Rank algorithm Ma = a rank a series of web pages by simulating user browsing patterns (a) based on probabilistic model (M) of page links Pattern Recognition, Inference L p = h estimateunknown probabilities (p) based on historical observations (h) and probability model (L) of links between hidden nodes Quantum Chemistry H  = E  compute color of dyes, pigments given empirical information on realted molecules and/or solving massive eigenvalue problems
  • 27. Quantum Chemistry: the electronic structure eigenproblem Solve a massive eigenvalue problem (109-1012) H  (, , …) =   (, , …) H nergy Matrix  quantumstateeigenvector ,  , … electrons Methods can have general applicability: Davidson method for dominanteigenvalue/ eigenvectors Motivation for Personalization Technology From solution of understanding the conceptualfoundations of semi-empirical models (noiseless dimensionalreduction) E
  • 28. Relations between Quantum Mechanics and ProbabilisticLanguage Models • QuantumStates  resemble the states(strings, words, phrases) in probabilistic language models (HMMs, SCFGs), except:  is a sum* of strings of electrons:  (, ) = 0. |  1  2  1  2 | +0.2 |  2  3  1  2 | +… • Energy Matrix H is known exactly, but large. Models of H can be inferred from empirical data to simplify computations. • Energies ~= Log [Probabilities], un-normalized *Not just a single string!
  • 29. Ab initio (from first principles): Solve entire H  (, ) =   (, ) …approximately OR Semi-empirical: Assume(, ) electrons statisticallyindependent:  (, ) = p() q() Treat  -electrons explicitly, ignore  (hidden): PHP p() =  p() muchsmaller problem Parameterize PHP matrix => Heff with empirical data using a small set of molecules, then apply to others (dyes,pigments) Dimensional Reduction in Quantum Chemistry: where do semi-empirical Hamiltonians come from?
  • 30. Effective Hamiltonians: Semi-Empirical Pi-Electron Methods Heff [] p() =  p() PHP PHQ p = E p QHP QHQ q q Heff [] = (PHP – PHQ (E-QHQ)-1 QHP) PHP p + PHQ q = E p QHP p + QHQ q = E q => implicit/ hidden Final Heff can be solved iteratively (as with eSelf Leff), or perturbatively in various forms Solution is formally exact => Dimensional Reduction / “Renormalization”
  • 31. Graphical Methods Vij = + … DecomposeHeff into effective interactions between  electrons (Expand (E-QHQ)-1 in an infinite series, remove E dependence) Represent diagrammatically, ~300 diagrams to evaluate Precompileusing symbolic manipulation: ~35 MG executable; 8-10 hours to compile run time: 3-4 hours/parameter + +
  • 32. Effective Hamiltonians: Numerical Calculations VCC -only effective empirical 16 11.5 11-12 (eV) Compute ab initio empirical parameters : Can test all basic assumptions of semi-empirical theory , “from first principles” Alsoprovides highly accurate eigenvalue spectra Augmentcommercial packages (i.e. Fujitsu MOPAC) to model spectroscopy of photoactive proteins example