Applied machine learning for search engine relevance 3

Applied Machine Learning for
Search Engine Relevance
Charles H Martin, PhD

Relevance as a Linear Regression
r =X†w+e
x: (tf-idf) bag of words vector
r: relevance score (i.e. 1/-1)
w: weight vector
w = X†r/X†X
x=1
querymodel*
form X from data
(i.e. group of queries)
Solveas a numericalminimization
(i.e. iterativemethods like SOR, CG, etc )
min  X†w-r 2  w 2 : 2-norm of w
*Actually will model and predict pairwise
relations and not exact rank. ..stay tuned.
Moore-PenrosePseudoinverse

Relevance as a Linear Regression:
Tikhonov Regularization
w = (X†X)-1 X†r
Problem: inverse may be not exist (numerical instabilities,poles)
Solution: add constant a to diagonalof (X†X)-1
w = (X†X + aI)-1 X†r
a: single, adjustable
smoothingparameter
Equivalentminimization problem
min X†w-r2 + a w2
More generally: form (something like) X†X + G†G + aI,
which is a self-adjoint , bounded operator =>
min  X†w-r 2 + a Gw 2 i.e. G chosen to avoid over-fitting

The Representer Theorem Revisited:
Kernels and Greens Functions
f(x) = S aiR(x, xi) R := Kernel
Problem: to estimate a function f(x) from trainingdata (xi)
Solution: solve a general minimizationproblem
min Loss[(f(xi), yi)] + a Gx2
Machine Learning Methodsfor Estimating Operator Equations(Steinke& Scholkpf 2006)
min Loss[(f(xi), yi)] + aTKa Kij = R(xi, xj)
Equivalentto: given a Linear regularization operator ( G:H->L2(x) )
where K is an integral operator: (Kf)(y) = R(x,y)f(x)dx
so K is the Green’s Function for (GG)†, or G= (K1/2)†
in Dirac Notation: R(x,y) = <y|(GG)†|x>
f(x) = S aiR(x,xi) + S buu (xi) ; u span null space of G

Personalized Relevance Algorithms:
eSelf Personality Subspace
qpages personalitytraitsp
Cars: 0.4
User
Reading
(musicsite) Present to user
(used sports car ad)
Learned Traits:
(Likes cars 0.4)
(Sports cars 0.3)
ads
Sports cars
0.0 =>0.3
Rock-n-roll
Hard rock
Computepersonalitytraits during user visit to web site
q values = stored learned “personalitytraits”
Providerelevance rankings(for pages or ads) which includepersonalitytraits

Personalized Relevance Algorithms:
eSelf Personality Subspace
model: L [p,q] = [h, u]
where L is a square matrix
h: history (observed outputs)
p: output nodes
(observables)
Web pages, Classified Ads, …
q: hidden nodes
(not observed)
Individualized
Personality
Traits
u: user segmentation

Personalized Search:
Effective Regression Problem
[p, q](t) = (Leff[q(t-1)]) -1 • [h, u](t) on each time step (t)
PLP PLQ p = h
QLP QLQ q u
Leff = (PLP + PLQ (QLQ)-1 QLP)
Leff p = h
p = (Leff [q,u])-1 h
PLP p + PLQ q = h
QLP p +QLQ q = 0
Formal solution:
=>
Adaptson each visit, finding relevantpages p(t) based on the links L, and the
learnedpersonalitytraits (q(t-1))
Regularizationof PLP achievedwith “Green’s Function / ResolventOperator”
i.e. G†G ~= PLQ (QLQ)-1 QLP
Equivalentto Gaussian Process on a Graph, and/orBayesianLinear Regression

Related Dimensional Noise Reductions:
Rank (k) Approximations of a Matrix
LatentSemantic Analysis(LSA)
(Truncated)SingularValueDecomposition (SVD):
DiagonalizetheDensity operator D = A†A
Retaina subset of (k) eigenvalues/vectors
Equivalentrelationsfor SVD
Optimalrank(k) apprx. X s.t. min (D-X)2
2
Decomposition: A = U∑ V† A†A = V (∑† ∑) V†
PDP PDQ
QDP QDQ
*VariableLatentSemanticIndexing (Yahoo!Research Labs)
http://www.cs.cornell.edu/people/adg/html/papers/vlsi.pdf
VLSA* provides a rank (k) apprx. to any q query: min E [ qT (D-X) 2
2 ]
Cangeneralize to variousnoise models: i.e. VLSA*, PLSA**
**ProbabilisticLatentSemanticIndexing(Recommind,Inc)
http://www.cs.brown.edu/~th/papers/Hofmann-UAI99.pdf
PLSA* provides a rank (k) apprx. of classes (z) min DKL [P -P(data)]
P = U∑ V†
P(d,w) = P(d|z) P(z)P(w|z) DKL = Kullback–Leibler divergence

Personalized Relevance:
Mobile Services and Advertising
France Telecom: Last Inch Relevance Engine
time
location
playgame send msg playsong
suggest
…

KA for Comm Services
• Based on Empirical Bayesian score and Suggestion mapping table,
a decision is made to one or more possible Comm services
• Based on Business Intelligence (BI) data mining and/or Pattern
Recognition algorithms (i.e. supervised or unsupervised learning)
, we compute statisticalscores indicating who are the most likely
people to Call, send an SMS, MMS, or E-Mail.
qEvents PersonalContext
(Sun. mornings)
p
Events
Map to a contextual
comm service
Suggestions foruser
(Call, SMS, MMS, E-mail)
Learned Traits:
On Sunday morning,
mostlikely to call Mom
Comm
Services
Mom (5)
Call [who]
SMS [who]
MMS [who]
Bob (3)
Phone company(1)

i.e. Bayesian Choice Estimator
• We seek to know the probability of a "call" (choice) at a given POD.
• We "borrow information" from other PODs, assuming this is less
biased, to improve our statisticalestimate
5 days
3 PODs
3 choices
f( | 1) = 2/5
p( | 1) = (2/5)(3/15) .
(2/5)(3/15)+(2/5)(3/15)+(1/5)(11/15)
= 6/23 ~ 1/4
1
2
3
frequency estimator
Bayesianchoice estimator
Note: the Bayesianestimate is
significantly lower because we now
expect we might see a at POD 1

Incorporating Feedback
• It is not enough to simple recognize call patterns in the Event Facts—it
is also necessary to incorporate feedback into our suggestions scores
• p( c | user, pod, loc, facts, feedback ) = ?
Event Facts
Suggestions
random
irrelevant
poor
good
p( c | user, pod, facts, feedback ) =
p( c | user, pod, facts ) p ( c | user, pod, feedback)
A: Simply Factorize:
Evaluateprobabilitiesindependently,
perhaps using different Bayesianmodels

Empirical Bayesian Models
Closed form models:
Correct a sample estimate (mean m, variance ) with a
weighted average of sample + complete dataset
m = B m + (1-B) m
B shrinkage
factor
i.e:
individual
sample
user
segment
1
play game send msg playsong
Canrank order mobile services
based on estimated likelihood(m , )
1 23

Empirical Bayesian Models
What is Empirical Bayes modeling?
specify Likelihood L(y|) and Prior () distributions
estimatethe posterior () = L(y|) () L(y|) ()
 L(y|) ()d (marginal)
CombinesBayesianismand frequentism:
Approximatesmarginal using (or posterior) using point estimate(MLE), MonteCarlo, etc.
Estimatesmarginal using empirical data
Uses empirical data to infer prior, plug into likelihood to make predictions
Note: Special case of Effective OperatorRegression:
P space ~ Q space ; PLQ = I ; u  0
Q-space defines prior information

Empirical Bayesian Methods:
Poisson Gamma Model
Likelihood L(y| ) = Poisson distribution ( y e- )/y!
ConjugatePrior ( a,b) = Gamma distribution ( a-1 b a e-b a )/ G(a) ;  > 0
posterior(k) L(y|) () = (( y e- )/y!) (( a-1 b a e-b a )/ G(a) )
y+a-1 e-(1+(1/b))
also a Gamma distribution(a’,b’)
a’ = y + a ; b’ = (1+1/b)-1
Take MLE estimate of Marginal = mean (m) of the posterior (ab)
Obtaina,b from the mean (m = ab) and variance (ab2) of complete data
FinalPoint estimate E(y)= a’b’ for a sample is a weighted averageof
sample mean y=my and prior mean m
E(y) = ( my + a ) (1+1/b)-1
E(y) = (b/1+b) my + (1/1+b) m

Linear Personality Matrix
events
suggestions
actio
n
Linear(or non-linear)
Matrixtransformation: M s = a
Notice: the personality matrix may or may not mix suggestions across events, can include semantic information
andcan then solve for the prob( s ) using a computational linearsolver: s = M-1a
Over time, we can estimate the Ma,s = prob( a | s )
i.e. calls
s1 = call
s2 = sms
s3 = mms
s4 = email
i.e. for a given time and location…
count how many times we suggested a call
but the user chose an email instead
Obviously we would like M to be diagonal…or as good as possible !
Can we devise an algorithm that will learn to give "optimal" suggestions?

Matrices for Pattern Recognition
(Statistical Factor Analysis)
Call on Mon @ pod 1
Call on Mon @ pod 2
Call on Mon @ pod 3
…
…
Smson Tue @ pod 1
…
Week 1 2 3 4 5 …
We can use apply ComputationalLinearAlgebra to remove noise and find patternsin data.
CalledFactor Analysisby Statisticians,SingularValue Decomposition(SVD) by Engineers.
Implementedin Oracle Data Mining(ODM) as Non-NegativeMatrixFactorization
1. Enumerate all choices 3. Formweekly choice
density Matrix AtA
2. Count # of times a choice
is made each week
4. Weekly patterns are
collapsedintothe density
Matrix At
A
They canbe detected
using spectral analysis
(i.e. principal eigenvalues)
All weekly patterns Pure Noise
Similarto Latent (Multinomial)DirichletAlgorithm (LDA), but much simpler to implement.
Suitablewhen the number (#) of choices is not too large, and patternsare weekly.

Search Engine Relevance : Listing on
Which 5 items to list at bottom of page ?

Statistical Machine Learning:
Support Vector Machines (SVM)
From Regression to Classification: Maximum Margin Solutions
2
w2
w
Classification := Find the line
thatseparates the pointswith
the maximum margin
min ½w2
2 subject to constraints
all “above” line
all “below” line
“above” : w.xi–b >= 1 + I
“below” : w.xi –b <= 1 + i
constraint specifications:
Simple minimization (regression) becomes a convex optimization (classification)
perhaps within some slack (i.e. min ½ w2
2
+ C S I )

SVM Light: Multivariate Rank Constraints
MultivariateClassification:
min ½w2
2 +C s.t.
for all y’: wT Ψ(x,y) - wT Ψ(x,y’) >= D(y,y’) - 
let Ψ(x,y’) = S y’x be a linear fcn
x1
x2
…
xn
- 0.1
+1.2
…
-0.7
x
- 1
+1
…
-1
y
sgn
wTx
maps docs to relevance scores (+1/-1)
learn weights (w) s.t. max wT Ψ(x,y’)
is correct for training set
(within a single slack constraint )
max wT Ψ(x,y’)
D(y,y’) is a multivariateloss function:(i.e. 1- Average Precision(y,y’) )
Ψ(x,y’) is a linear discriminantfunction: (i.e. sum of ordered pairs SiSj
yij (xi -xj) )

SvmLight Ranking SVMs
SVMperf : ROC Area, F1 Score, Precision/Recall
SVMmap : Mean Average Precision ( warning: buggy ! )
SVMrank : OrdinalRegression
Stnd Classificationon pairwise differences
min ½ w2
2 + C S I,j,k s.t
for all queries qk (later, may not be query specific in SVMstruct)
doc pairs di, dj wT Ψ(qk,di) - wT Ψ(qk,dj) >= 1- I,j,k
DROCArea = 1- # swapped pairs
Enforces a directed ordering
1 2 3 4 5 6 7 8
1 0 0 0 0 1 1 0
8 7 6 5 4 3 2 1
1 2 3 4 5 6 7 8
MAP ROC Area
0.56 0.47
0.51 0.53
A Support Vector Method for Optimizing Average Precision (Joachims et. Al. 2007)

Large Scale, Linear SVMs
• Solving the Primal
– Conjugate Gradient
– Joachims: cutting plane algorithm
– Nyogi
• HandlingLarge Numbers of Constraints
• Cutting Plane Algorithm
• Open Source Implementations:
– LibSVM
– SVMLight

Search Engine Relevance : Listing on
A ranking SVM consistentlyimproves Shopping.com <click rank> by %12

Various Sparse Matrix Problems:
Google Page Rank algorithm
Ma = a rank a series of web pages by simulating user
browsing patterns (a) based on probabilistic
model (M) of page links
Pattern Recognition, Inference
L p = h estimateunknown probabilities (p) based on
historical observations (h) and probability
model (L) of links between hidden nodes
Quantum Chemistry
H  = E  compute color of dyes, pigments given empirical
information on realted molecules and/or solving
massive eigenvalue problems

Quantum Chemistry:
the electronic structure eigenproblem
Solve a massive eigenvalue problem (109-1012)
H  (, , …) =   (, , …)
H nergy Matrix
 quantumstateeigenvector
,  , … electrons
Methods can have general applicability:
Davidson method for dominanteigenvalue/ eigenvectors
Motivation for Personalization Technology
From solution of understanding the conceptualfoundations
of semi-empirical models (noiseless dimensionalreduction)
E

Relations between Quantum Mechanics and
ProbabilisticLanguage Models
• QuantumStates  resemble the states(strings, words, phrases)
in probabilistic language models (HMMs, SCFGs), except:
 is a sum* of strings of electrons:
 (, ) = 0. |  1  2  1  2 | +0.2 |  2  3  1  2 | +…
• Energy Matrix H is known exactly, but large. Models of H can be
inferred from empirical data to simplify computations.
• Energies ~= Log [Probabilities], un-normalized
*Not just a single string!

Ab initio (from first principles):
Solve entire H  (, ) =   (, ) …approximately
OR
Semi-empirical:
Assume(, ) electrons statisticallyindependent:
 (, ) = p() q()
Treat  -electrons explicitly, ignore  (hidden):
PHP p() =  p() muchsmaller problem
Parameterize PHP matrix => Heff with empirical data using a small set
of molecules, then apply to others (dyes,pigments)
Dimensional Reduction in Quantum Chemistry:
where do semi-empirical Hamiltonians come from?

Effective Hamiltonians:
Semi-Empirical Pi-Electron Methods
Heff [] p() =  p()
PHP PHQ p = E p
QHP QHQ q q
Heff [] = (PHP – PHQ (E-QHQ)-1 QHP)
PHP p + PHQ q = E p
QHP p + QHQ q = E q
=>
implicit/ hidden
Final Heff can be solved iteratively (as with eSelf Leff),
or perturbatively in various forms
Solution is formally exact =>
Dimensional Reduction / “Renormalization”

Graphical Methods
Vij = + …
DecomposeHeff into effective interactions between  electrons
(Expand (E-QHQ)-1 in an infinite series, remove E dependence)
Represent diagrammatically, ~300 diagrams to evaluate
Precompileusing symbolic manipulation:
~35 MG executable; 8-10 hours to compile
run time: 3-4 hours/parameter
+ +

Effective Hamiltonians:
Numerical Calculations
VCC
-only effective empirical
16 11.5 11-12 (eV)
Compute ab initio empirical parameters :
Can test all basic assumptions of semi-empirical theory ,
“from first principles”
Alsoprovides highly accurate eigenvalue spectra
Augmentcommercial packages (i.e. Fujitsu MOPAC) to model
spectroscopy of photoactive proteins
example

Applied machine learning for search engine relevance 3

More Related Content

What's hot (15)

Viewers also liked (16)

Similar to Applied machine learning for search engine relevance 3 (20)

More from Charles Martin (20)

Recently uploaded (20)

Applied machine learning for search engine relevance 3