SlideShare a Scribd company logo
Opportunities and Challenges of Web Search and Mining   Lee-Feng Chien ( 簡立峰 ) Academia Sinica & National Taiwan University
Outline   Web SE Inside SE Google’s Business Models  Google’s Impacts  Recent Development  Next-Generation WSE Web Mining
WSE = Google Globalization!
WSE = Google
Problems of WSE Inside WSE . Fast . Coverage  . Accuracy
Problems of WSE Inside WSE . Fast . Coverage  . Accuracy   Business  . Profitable  . Models  . Competitions
Problems of WSE Inside WSE . Fast . Coverage  . Accuracy   Business  . Profitable  . Models  . Competition   Impacts  . Web Computing  . Knowledge Windows  . New Paradigm of Civilization
I.  Some Must-Know   Statistics
Online Language Populations Source: Global Reach (global-reach.biz/globstats)
Top Ten Languages in the Web Source: Internet World Stats  More and more non-English users! 100.0 % 6,390,147,487 12.5 % 800,040,498 WORLD TOTAL 12.7 % 2,602,992,587 3.9 % 101,686,725 Rest of the Languages 87.3 % 3,787,154,900 18.4 % 698,353,773 TOP TEN LANGUAGES 1.7 % 24,125,950 56.6 % 13,657,170 Dutch 2.9 % 224,664,100 10.3 % 23,058,254 Portuguese 3.6 % 57,987,100 49.3 % 28,610,000 Italian 3.8 % 74,730,000 41.0 % 30,670,000 Korean 4.4 % 375,164,185 9.3 % 35,034,269 French 6.7 % 386,413,200 13.9 % 53,670,063 Spanish 6.8 % 95,893,300 56.3 % 54,035,201 German 8.3 % 127,853,600 52.1 % 66,548,060 Japanese 13.2 % 1,321,669,200 8.0 % 105,484,112 Chinese 35.9 % 1,098,654,265 26.2 % 287,369,520 English Language as % of Total Internet Users World Population Estimate for Language Average Penetration Internet Users, by Language TOP TEN LANGUAGES IN THE INTERNET
Web Content Source:  Network Wizards Jan 99 Internet Domain Survey More and more  non-English pages
Web Users and Pages  (5 years ago) Challenge of Scalability ! Chinese Users: 110M Including 87M (CN), 4.9M (HK), 8.8M (TW), 2.14M (SG), and others. Source: Global Reach, 2004
Number of Chinese Web Pages 573,000,000 pages Scalability Problem !
Number of Web Pages   The world’s  largest search engine ? 4,285,199,774 pages (Google) 4.28 billion Web pages, 880 million images, and other documents Billions Of Textual Documents Indexed As of Sept 2, 2003 KEY: GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista.  Source: Search Engine Watch
The top 10 Internet trends 2004 predicted by eOneNet.com   1.    World  Internet population  will continue to grow at an exponential rate, with  China  taking the lead in Asia having more than 100 million Internet users. 2.    Broadband Internet penetration will continue to grow with China and US in the lead with an expected growth rate exceeding 30% each. 3.    Online retail sales will still be led by the US with an expected revenue exceeding US$80 billion. 4.    Paid search will account for the biggest online ad spending. With the successful paid search business models of Google and Overture, more search engines will offer paid search advertising.
The top 10 Internet trends 2004 predicted by eOneNet.com   5.    Spams will increase at least 20% despite the new US anti-spam law. The US legislators will be forced to consider amending the anti-spam law from an opt-out law to an opt-in law.  6.    Ads placed in opt-in email newsletters will increase 25% as legitimate marketers find this is the easier way to comply with the anti-spam law and a better way of targeting customers. 7.    Rich media will continue to be hot. More than 25% of online ads served will contain rich media contents.
The top 10 Internet trends 2004 predicted by eOneNet.com   8.    20% more small businesses will develop their own websites or use the Internet as a sales and marketing channel. 9.    Entertainment online will be grow at a rapid pace, with more sites offering videos and digital music download services. 10.    The Internet boom will revive with more Internet companies going for IPO both in the US and in Asia, in particular kicked off by the most anticipated Google IPO in Spring.
II.  Inside WSE
Components  Crawler/Spider  Index Server  Query Server  Document Delivery
Architecture   SE SE SE Browser Web 1B queries/day Quality results Log .Spam . Freshness 5B pages Scalable  Index Index Index Spider Indexer Archive (1) (2) (3) (4) (5)
Spider Get all Pages from the Web  Web Traverse  Challenges  Performance, e.g., #Pages/Per PC Coverage  Currency  Spam Filtering  Hidden Web
Index Server   Index occurrences of all words in the pages  Data Cleanness  Challenges  Space Overhead,#pages/PC Incremental  Scalability & Distributed Processing Multiple Languages
System Anatomy
Data Structure Lexicon:  fit in memory two different forms Hit list: account for most space use 2 bytes to save space Forward index: barrels are sorted  by wordID. Inside barrel,  sorted by docID Inverted Index: some content as  the forward index, but sorted by wordID. doc list is sorted by docID
Query Server Search Relevant URLs for queries via looking up indices  Challenges  Speed, check #queries/Per Sec  Functions supported  Localization
PageRank
PageRank (Cont.) be the set of pages that point to  u.  be the number of  links from  u  and let  c  be a factor used for normalization, then a simplified version of PageRank:
Search Functions Phrase search, e.g. "petite galerie" Truncation, e.g. librar*, wom*n Constraining search, e.g. title:"The Wall Street Journal" Proximity search, e.g. gold near silver Boolean, e.g. +noir +film -"pinot noir" Parentheses  and Nested Boolean, e.g. silver and not (gold or platinum) Limit search, e.g. limit by date range Capitalization, e.g. turkey vs. Turkey Ranking fields and refine search LiveTopics Translate Service Other
Document Delivery   Bottleneck of Bandwidth  Presentation  Caching  Queries, Search Results Aakman Model
III.  Business
What is Google?   Specialized web search engine Founded in 1998 by 2 graduate students at  Stanford University (Larry Page and Sergey Brin) Provides a comprehensive, relevant, and easy-to-use web search and browsing service (free) Google’s  features :  fast, unbiased, and accurate results, allows access to over 4 billion web pages, and over 800 million images (most important; valid web pages)
Company Facts Employees:  1,300+ Languages spoken: 34 Worldwide Offices:  21 (Mostly in US & Europe) Annual Revenues: $900m
Google Revenue Revenue—(an e-business):    ½ from selling relevant text-based ads    (sponsored links near search results) ½ from licensing its search technology to     companies like Yahoo Source:  Eric Schmidt Interview,  PCWorld.com (January 30, 2002)
Sources of Revenue   Adwords  (150,000 advertisers) “sponsored links” ad cost-per-click pricing; only when people click on the link  -- Advertisement is extremely cheap and effective i.e.  Edmunds.com spent “$250,000 a month in advertising“ because $1 spent generated $1.70. Google Search Appliance   an integrated hardware/software solution that extends the power of Google to corporate intranets and web servers -- Customers include:  Cisco Systems, Sony, Procter &    Gamble, Sun Microsystems, etc
Challenges (cont.) Easy entry into the Search Engine Industry Lack of customer lock-in (vs. Microsoft);  Google will focus on creating services to voluntarily  draw in customers   Large, well-known competitors are focusing on in-house search technology (Yahoo, Microsoft, AOL, eBay, Amazon) Customers are becoming competitors  (Yahoo, AOL)
Competitors: Ebay and Amazon Ebay ( www.ebay.com ) E-commerce Web-based marketplace in which a community of buyers and sellers are brought together to browse, buy and sell various items  -- Business revenue: Charges Proceeds (Fees)  (5%)  0.01-$25  (2.5%)  $25-$1000  (1.25%)  over $1000 Amazon ( www.amazon.com ) E-commerce a customer-centric company that sells a range of products that it purchases from manufacturers and distributors
Competitors: Microsoft and Yahoo Microsoft is developing its own search engine -- Can “lasso” users into its search engine through  its operating system --  Has the “braniacs” to implement top of the line  search engine technology Yahoo was customer of Google (may now become Google’s biggest competitor) -- Offers placement under sponsored links and  within actual results (“unethical”)
IV.  Impacts
Impacts   Web Computing  Knowledge Windows  New Web OS
Web Computing   Faster than local search  Very-large scale of computing systems  Realize global users’ behaviors  Acquire global information sources
Web Computing   Local disc or global disc?  Personal information management?  Gmails Photo search
Knowledge Windows   Windows of Information Search  Alliance with online databases  Windows of Personal Knowledge Management  Knowledge Windows
New Web OS Merged with Linux OS Software download from end-users  Information Service OS
V.  New Gen. of WSE
Advanced Google Is Google good enough? “ Takano” “ Takano NII” “ Takano NII Japan” More about Google Services http://www.google.com/options/
New Features in Google Google Labs:  http://labs.google.com/   Google Desktop Search Searching text, Web, Word, Excel, PowerPoint, Outlook, AOL Instant Messenger Google SMS Searching phone book, dictionary, product prices, … Google Print Searching books
 
Other Search Tools A9.com (by Amazon) Bookmark, history, discover, diary Books, movies, … Clusty.com (by Vivisimo) Clustering engine Snap.com (by Idealab) Sorting by popularity, satisfaction, Web popularity, Web satisfaction, domain, … Alexa.com (by Amazon) Average user review ratings, … Others: Yahoo, AskJeeves, AOL Search, HotBot, MSN, Netscape, Lycos, Altavista, LookSmart, Gigablast, Overture, About, FindWhat, Teoma, InformSearch, …
Clusty.com
Example on Vivisimo
Vivisimo  (cont.)
New Directions   Personalization  Photo search, email search & filtering  Information Extraction  EX: Scholar search  Information Agent Deep Web Search
VI.  Web Mining
Web Search/Information Retrieval Millions of Users Web Search Engine Information Seeking
Improving Search via Mining Millions of Users Web texts, images, logs   … Search Engine Knowledge Discovery
Valuable Web Resources  Web logs, texts, images ,  … Knowledge  Discovery Millions of Users Hyper Links Anchor Texts  Search Result Pages Query Logs Query Session Logs Clicked Stream Logs Deep Web, ….
Discovered Knowledge  Web logs, texts, images ,  … Knowledge  Discovery Millions of Users Users’ Preferences/Need:  Topic, Location,  Timing, … Authority/Popularity: Site, File, People,  Company, Product Clusters/Associations/ Relations:  Site, Page, People,  Company, Product,  Query
Web Mining for IR Web logs, texts, images ,  … Knowledge  Discovery Millions of Users Search Classification Clustering Cross-language IR Information Extraction   Text mining Filtering
CS 276 / LING 239I Information Retrieval and Web Mining Prabhakar Raghavan and Hinrich Schütze Course Description: Basic and advanced techniques for text-based information systems: efficient text indexing; Boolean, vector space, and probabilistic retrieval models; evaluation and interface issues; Web search including crawling, link-based algorithms, and Web  metadata ; text/Web  clustering ,  classification ,  wrapper , information  extraction , and collaborative  filtering  systems; text  mining . Projects can be chosen from diverse topics in information retrieval.
Computational Linguistics, 29 , Issue 3,  September 2003 .
Research at  Web   Knowledge   Discovery  Lab
Research at  Web   Knowledge   Discovery  Lab Live series  LiveTrans SIGIR’04, ACL’04, JCDL’04 ACM Trans. On Information System, 2004 Online Translation of unknown queries via Web LiveClassifier  WWW’04, IJCNLP’04 ACM Trans. on ALIP, 2004  Training classifiers and classifying short text via Web
Research at  Web   Knowledge   Discovery  Lab LiveCluster  CIKM’04 ACM Trans. On Information System, 2004 Generating taxonomy from terms or documents
LiveTrans:  Cross-language Web Search
LiveClassifier : Classifying search results into user-defined classification tree
LiveClassifier  :  Paper Title Categorization Note: no labeled training data
LiveCluster :  Taxonomy Generation
Terms Clustering
Query Clustering   勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 eva 長榮航空 長榮 航空公司 航空 華航 中華航空 補帖 大補帖 泡麵 dbt 武俠 金庸 武俠小說 黃易 作家 武俠 金庸 武俠小說 黃易 作家 補帖 大補帖 泡麵 dbt eva 長榮航空  (EVA airline) 長榮  (EVA) 航空公司  (airline) 航空  (airway) 華航  (China airline) 中華航空  (China airline) 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 cut 1 2 3 4 5 1 2 3 4 5
 
Outline Translating Unknown Queries (SIGIR’04) Training Text Classifiers (WWW’04) Generating Taxonomy/Topic Hierarchies   (TOIS’04)
Translating Unknown Queries Anchor Text Mining  Probabilistic Modeling  (ACM TALIP’02) Transitive Translation  (ACM TOIS’04) Search-Result Page Mining   Translation Extraction & Selection  (JCDL’04) CLIR & Other Applications  (SIGIR’04, ACL’04)  Note: First work dealing with online translation
Introduction (cont.) Bottleneck of CLIR service Real queries are often  short Out-of-dictionary terms  and might have local variations  Ex: proper nouns, new terminologies, … Need for a powerful  query translation  engine Up-to-date dictionary 比爾蓋茲 Bill Gates 柯林頓 / 克林頓 Clinton 嚴重急性呼吸道症候群 / 非典 / 沙士 SARS 數位圖書館 / 數字圖書館  Digital library 班夫 / 班芙   Banff 石川県   Ishikawa 国立情報学研究所 NII Japan 羅浮宮 louvre  museum Chinese Translation English Terminologies
Web Mining of  Query Translations Different problems for different resources Source Term Target Translations Term Translation Web Mining Anchor-Text Mining Search-Result  Mining OOD Yahoo <->  雅虎
Anchor Text (Yahoo <->  雅虎 ) Applies to most languages Translation candidates are likely to appear in the same anchor-text-set
Search Result Page  (National Palace Museum vs.  故宮博物院 ) Mixed-language  characteristic in Chinese pages
Problems Term extraction Translation selection & noisy reduction  Language pairs with limited corpora  Processing speed  Data cleanness (language identification) Language independence
Term Extraction: SCPCD
… … Term Selection:   Probabilistic Inference Model Page Authority Co-occurrence Page Rank Integrating anchor texts and link structures into  probabilistic inference model Based on co-occurrence & page authority
Observation of Anchor Text Source Term(Ts)  Translation(Tt) 雅虎 => Yahoo
-  in USA Taiwan  - www.yahoo.com www.yahoo.com.tw Yahoo Yahoo Source Query Observation of Anchor Text
-  in USA Taiwan  - 台灣  - 搜尋引擎 www.yahoo.com www.yahoo.com.tw Yahoo 雅虎 雅虎 Yahoo Translation Candidates Anchor-Text Set  Observation of Anchor Text
…… (#in-link= 187) …… (#in-link= 21) -  in USA Taiwan  - 台灣  - 搜尋引擎 www.yahoo.com www.yahoo.com.tw Yahoo 雅虎 雅虎 Yahoo Page Authority Observation of Anchor Text Co-occurrence
Search Result Mining Term Extraction Source Query Target Translations Web Pages Search Engine PAT-tree based term extraction method [Chien, SIGIR ‘97] Term  Selection
Term Selection  How to decide the ranking? S, T i : frequently  co-occur  in the same pages Not necessarily true for  synonyms  and  antonyms S, T i : the result pages containing  similar co-occurring context terms  as feature vectors Query S . . . T 1 T 2 T n
Chi-Square Test Chi-Square Test: a statistical method for  co-occurrence  analysis  [Gale & Church ‘91] a : # of pages containing both terms  s  and  t b : # of pages containing term  s  but not  t c : # of pages containing term   t  but not  s d : # of pages containing neither term  s  nor  t N : the total number of pages, i.e.,  N =  a + b + c + d
Context Vector Analysis Context Vector Analysis: co-occurring context terms as feature vectors Similarity measure:  cosine  measure
Indirect Association Problem   Cisco s t s 1 t 1 系統  (system) system Fig. 4. An illustration showing the indirect association problem, in which the dashed arrows indicate possible indirect association errors. 思科  (Cisco)
Competitive Linking Algorithm t 1 system s t 2 系統   (system) Cisco 資訊   (information) 網路   (network) 電腦   (computer) St 1 思科   (Cisco) Fig. 6. An illustration showing a bipartite graph generated by using Algorithm 2 . St 2
Combined Method To take advantage of both methods Anchor-text-based: higher  precision Search-result-based: higher  coverage R m (s,t)  : Ranking of score  in different methods
Experiments Performance on Query Translation Test Bed: real query terms from the  Dreamer  search engine log in Taiwan 228,566 unique terms, during a period of 3 months in 1998 Random-query test set :  50 query terms in Chinese, randomly selected from the top 20,000 queries in the log 40 of them were out-of-dictionary
Random Query Test Set Many query terms didn’t appear in anchor-text sets (coverage) 72% 66.0% 64.0% 44.0% Combined 32% 32.0% 32.0% 20.0% AT 68% 52.0% 50.0% 36.0% X 2 68% 54.0% 54.0% 40.0% CV Coverage Top-5 Top-3 Top-1 Method Table 2.  Coverage and top 1~5 inclusion rates obtained with the four different methods for the random-query set.
Other Experiments 430 popular Chinese queries, 67.4% top-1 inclusion rate Common terms: randomly selected 100 common nouns and 100 common verbs from general-purpose Chinese dictionary
Transitive Translation Top-n inclusion rates  obtained with different models. Model Top1 Top2 Top3 Top4 Top5 Direct Translation 35.7% 43.0% 46.9% 49.6% 51.2% Indirect Translation 44.2% 55.1% 58.0% 59.7% 60.5% Transitive Translation 49.2% 58.1% 60.9% 61.6% 62.0%
Transitive Translation Model
Chinese-Japanese Translation   61.9% 61.3% 58.6% 51.4% 42.9% Transitive 59.6% 58.6% 56.6% 49.4% 40.2% Indirect  15.1% 15.1% 14.3% 12.8% 10.5% Direct  Top5 Top4 Top3 Top2 Top1 Model Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese. ソニー ナイキ スタンフォード シドニー インターネット ネットワーク ホームページ コンピューター データベース インフォメーション 索尼 耐克 斯坦福 悉尼 互联网 网络 主页 计算机 数据库 信息 Sony Nike Stanford Sydney internet network homepage computer database information 新力 耐吉 史丹佛 雪梨 網際網路 網路 首頁 電腦 資料庫 資訊 Japanese Simplified Chinese English Extracted target translations Source terms (Traditional Chinese) Table VII. Selected source query terms in traditional Chinese and their translations in English, simplified Chinese and Japanese respectively, which were extracted by our transitive translation model.
Translation Lexicons with Regional Variations   (a)  Taiwan  (b)  Mainland China  (c)  Hong Kong Figure 1:  E xample s   of  search-result page s   in different Chinese regions that were obtained via  the English query  words  “ George Bush ”  from Google.
Summary  A work dealing with live translation of unknown queries  Anchor-text-based High precision  for high-frequency terms Effective for proper nouns in multiple languages Not applicable if size of anchor-text set not enough Search-result-based Exploit rich Web resources High coverage  for English-Chinese language pair
LiveCluster:  Generating Taxonomy from terms or documents
Taxonomy Generation from Terms
Hierarchical Query Clustering
The Steps   Feature Extraction Use co-occurred  seed terms  extracted from  retrieved top pages Term Vector   Each query term is assigned a term vector  Record the co-occurred feature terms and their frequency values in the retrieved documents.  Term Similarity   tf*idf-based   Cosine measurement Hierarchical Term Clustering Cluster popular query terms in the log into initial categories Query terms with  similar features   are  grouped into clusters.
Feature Extraction   Use co-occurred  seed terms  extracted from  retrieved top pages Creative Nude Photography Network -- Fine Art Nude and  ...   ...  The Creative  Nude  and  Erotic Photography  Network is the number one net portal to the best in fine art  nude  and  erotic photography ! Over 100 CNPN Member Sites  ...   Nude Places ...  to be  naked . Walking in the forest, cruising the lake in open boats, swimming, picnicking and  nude  photography are all enjoyed in the  nude . 60 minutes $39.95.  ...   A Brave Nude World ...  A Brave  Nude  World! Warning: This site contains links to fine art  nude  &  erotic photography . If you are under 18 or do not wish to view this material, You can  ...   nude Co-occurred  feature terms 3/2 erotic photography 1/1 naked … … … 3/2 art 2/2 photography tf/df term
Term Weighting
Extraction of Basic Feature Terms Performance of different features: randomly selected,  hi-frequency, and  seed terms Popular queries not affected by ephemeral trends, e.g., “movie”, “basketball”, “mutual fund”, etc. More expressive and  distinguishable  in describing a particular category Two logs compared and extracted  9,709  overlapping top query terms as feature terms
Task I: Query Clustering   (Cont.) Feature Extraction Use co-occurred seed terms extracted from retrieved top pages Term Vector   Each query term is assigned a term vector  Record the co-occurred feature terms and their frequency values in the retrieved documents.  Term Similarity TF *IDF-based Cosine measurement Hierarchical Term Clustering Cluster popular query terms in the log into initial categories Query terms with  similar features   are  grouped into clusters.
Term Similarity
Hierarchical Term Clustering   Agglomerative hierarchical clustering  (AHC) Compute the similarity between all pairs of clusters Estimate similarity between all pairs of composed terms Use the lowest term similarity value as the cluster similarity value Merge the most similar (closest) two clusters Complete linkage method Update  the cluster vector of the new cluster  Repeat steps 2 and 3 until  only a single cluster remains
 
Clustering Results   勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 eva 長榮航空 長榮 航空公司 航空 華航 中華航空 補帖 大補帖 泡麵 dbt 武俠 金庸 武俠小說 黃易 作家 武俠 金庸 武俠小說 黃易 作家 補帖 大補帖 泡麵 dbt eva 長榮航空  (EVA airline) 長榮  (EVA) 航空公司  (airline) 航空  (airway) 華航  (China airline) 中華航空  (China airline) 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 cut 1 2 3 4 5 1 2 3 4 5
Cluster Partition
Quality Function
Quality Function  (Cont.)
Quality Function  (Cont.)
Preliminary Experiment Test queries  Two sets: top 1k queries and random 1k queries Each of the test queries has been manually assigned according  classes   Evaluation metrics F-Measure
Evaluation: F-Measure
Obtained F-Measures
 
Results of Hierarchical Structure Generation

More Related Content

Similar to Web Search And Mining (Ntuim) (20)

KEY
SEO in 2012 - Some trends about what will happen
Simon Sundén
 
PPT
Searching the internet - what patent searchers should know
Eric Sieverts
 
PPTX
Exploring Search Engines and their usage online
Mohammad Usman
 
PDF
SEO: FTW!
Federico Lucignano
 
PPT
Search
brisso99
 
PDF
sunny-slides
20DC11NOUFALN
 
PPT
Cse535 chapter19-web search
Yihong Chen
 
PPTX
Search engines
Anshuman Tyagi
 
PPTX
DC presentation 1
Harini Sirisena
 
PPT
Search Engine Marketing
merve_g_
 
PDF
Web search engines and search technology
Stefanos Anastasiadis
 
PDF
Smashing SIlos: UX is the New SEO
BrightEdge
 
PPT
Chewy Trewella - Google Searchtips
sounddelivery
 
PDF
Smashing silos ia-ux-meetup-mar112014
Marianne Sweeny
 
PDF
Tolmachev Alexander Web Search Engines
AlexanderTolmachev
 
PDF
History page-brin thesis - anatomy of a large scale hypertextual web search...
Bitsytask
 
PPTX
Seoppt
DIGIWEB2
 
PPTX
SEO 101 For SMEs: Tips & Tricks to Rank High and be Found in Google!
IMSeoKing.com
 
PPTX
Lost in the Net: Navigating Search Engines
Johan Koren
 
SEO in 2012 - Some trends about what will happen
Simon Sundén
 
Searching the internet - what patent searchers should know
Eric Sieverts
 
Exploring Search Engines and their usage online
Mohammad Usman
 
Search
brisso99
 
sunny-slides
20DC11NOUFALN
 
Cse535 chapter19-web search
Yihong Chen
 
Search engines
Anshuman Tyagi
 
DC presentation 1
Harini Sirisena
 
Search Engine Marketing
merve_g_
 
Web search engines and search technology
Stefanos Anastasiadis
 
Smashing SIlos: UX is the New SEO
BrightEdge
 
Chewy Trewella - Google Searchtips
sounddelivery
 
Smashing silos ia-ux-meetup-mar112014
Marianne Sweeny
 
Tolmachev Alexander Web Search Engines
AlexanderTolmachev
 
History page-brin thesis - anatomy of a large scale hypertextual web search...
Bitsytask
 
Seoppt
DIGIWEB2
 
SEO 101 For SMEs: Tips & Tricks to Rank High and be Found in Google!
IMSeoKing.com
 
Lost in the Net: Navigating Search Engines
Johan Koren
 

Recently uploaded (20)

PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
PDF
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
PDF
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Ad

Web Search And Mining (Ntuim)

  • 1. Opportunities and Challenges of Web Search and Mining Lee-Feng Chien ( 簡立峰 ) Academia Sinica & National Taiwan University
  • 2. Outline Web SE Inside SE Google’s Business Models Google’s Impacts Recent Development Next-Generation WSE Web Mining
  • 3. WSE = Google Globalization!
  • 5. Problems of WSE Inside WSE . Fast . Coverage . Accuracy
  • 6. Problems of WSE Inside WSE . Fast . Coverage . Accuracy Business . Profitable . Models . Competitions
  • 7. Problems of WSE Inside WSE . Fast . Coverage . Accuracy Business . Profitable . Models . Competition Impacts . Web Computing . Knowledge Windows . New Paradigm of Civilization
  • 8. I. Some Must-Know Statistics
  • 9. Online Language Populations Source: Global Reach (global-reach.biz/globstats)
  • 10. Top Ten Languages in the Web Source: Internet World Stats More and more non-English users! 100.0 % 6,390,147,487 12.5 % 800,040,498 WORLD TOTAL 12.7 % 2,602,992,587 3.9 % 101,686,725 Rest of the Languages 87.3 % 3,787,154,900 18.4 % 698,353,773 TOP TEN LANGUAGES 1.7 % 24,125,950 56.6 % 13,657,170 Dutch 2.9 % 224,664,100 10.3 % 23,058,254 Portuguese 3.6 % 57,987,100 49.3 % 28,610,000 Italian 3.8 % 74,730,000 41.0 % 30,670,000 Korean 4.4 % 375,164,185 9.3 % 35,034,269 French 6.7 % 386,413,200 13.9 % 53,670,063 Spanish 6.8 % 95,893,300 56.3 % 54,035,201 German 8.3 % 127,853,600 52.1 % 66,548,060 Japanese 13.2 % 1,321,669,200 8.0 % 105,484,112 Chinese 35.9 % 1,098,654,265 26.2 % 287,369,520 English Language as % of Total Internet Users World Population Estimate for Language Average Penetration Internet Users, by Language TOP TEN LANGUAGES IN THE INTERNET
  • 11. Web Content Source: Network Wizards Jan 99 Internet Domain Survey More and more non-English pages
  • 12. Web Users and Pages (5 years ago) Challenge of Scalability ! Chinese Users: 110M Including 87M (CN), 4.9M (HK), 8.8M (TW), 2.14M (SG), and others. Source: Global Reach, 2004
  • 13. Number of Chinese Web Pages 573,000,000 pages Scalability Problem !
  • 14. Number of Web Pages The world’s largest search engine ? 4,285,199,774 pages (Google) 4.28 billion Web pages, 880 million images, and other documents Billions Of Textual Documents Indexed As of Sept 2, 2003 KEY: GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista. Source: Search Engine Watch
  • 15. The top 10 Internet trends 2004 predicted by eOneNet.com 1.    World Internet population will continue to grow at an exponential rate, with China taking the lead in Asia having more than 100 million Internet users. 2.    Broadband Internet penetration will continue to grow with China and US in the lead with an expected growth rate exceeding 30% each. 3.    Online retail sales will still be led by the US with an expected revenue exceeding US$80 billion. 4.    Paid search will account for the biggest online ad spending. With the successful paid search business models of Google and Overture, more search engines will offer paid search advertising.
  • 16. The top 10 Internet trends 2004 predicted by eOneNet.com 5.    Spams will increase at least 20% despite the new US anti-spam law. The US legislators will be forced to consider amending the anti-spam law from an opt-out law to an opt-in law. 6.    Ads placed in opt-in email newsletters will increase 25% as legitimate marketers find this is the easier way to comply with the anti-spam law and a better way of targeting customers. 7.    Rich media will continue to be hot. More than 25% of online ads served will contain rich media contents.
  • 17. The top 10 Internet trends 2004 predicted by eOneNet.com 8.    20% more small businesses will develop their own websites or use the Internet as a sales and marketing channel. 9.    Entertainment online will be grow at a rapid pace, with more sites offering videos and digital music download services. 10.    The Internet boom will revive with more Internet companies going for IPO both in the US and in Asia, in particular kicked off by the most anticipated Google IPO in Spring.
  • 18. II. Inside WSE
  • 19. Components Crawler/Spider Index Server Query Server Document Delivery
  • 20. Architecture SE SE SE Browser Web 1B queries/day Quality results Log .Spam . Freshness 5B pages Scalable Index Index Index Spider Indexer Archive (1) (2) (3) (4) (5)
  • 21. Spider Get all Pages from the Web Web Traverse Challenges Performance, e.g., #Pages/Per PC Coverage Currency Spam Filtering Hidden Web
  • 22. Index Server Index occurrences of all words in the pages Data Cleanness Challenges Space Overhead,#pages/PC Incremental Scalability & Distributed Processing Multiple Languages
  • 24. Data Structure Lexicon: fit in memory two different forms Hit list: account for most space use 2 bytes to save space Forward index: barrels are sorted by wordID. Inside barrel, sorted by docID Inverted Index: some content as the forward index, but sorted by wordID. doc list is sorted by docID
  • 25. Query Server Search Relevant URLs for queries via looking up indices Challenges Speed, check #queries/Per Sec Functions supported Localization
  • 27. PageRank (Cont.) be the set of pages that point to u. be the number of links from u and let c be a factor used for normalization, then a simplified version of PageRank:
  • 28. Search Functions Phrase search, e.g. &quot;petite galerie&quot; Truncation, e.g. librar*, wom*n Constraining search, e.g. title:&quot;The Wall Street Journal&quot; Proximity search, e.g. gold near silver Boolean, e.g. +noir +film -&quot;pinot noir&quot; Parentheses and Nested Boolean, e.g. silver and not (gold or platinum) Limit search, e.g. limit by date range Capitalization, e.g. turkey vs. Turkey Ranking fields and refine search LiveTopics Translate Service Other
  • 29. Document Delivery Bottleneck of Bandwidth Presentation Caching Queries, Search Results Aakman Model
  • 31. What is Google? Specialized web search engine Founded in 1998 by 2 graduate students at Stanford University (Larry Page and Sergey Brin) Provides a comprehensive, relevant, and easy-to-use web search and browsing service (free) Google’s features : fast, unbiased, and accurate results, allows access to over 4 billion web pages, and over 800 million images (most important; valid web pages)
  • 32. Company Facts Employees: 1,300+ Languages spoken: 34 Worldwide Offices: 21 (Mostly in US & Europe) Annual Revenues: $900m
  • 33. Google Revenue Revenue—(an e-business): ½ from selling relevant text-based ads (sponsored links near search results) ½ from licensing its search technology to companies like Yahoo Source: Eric Schmidt Interview, PCWorld.com (January 30, 2002)
  • 34. Sources of Revenue Adwords (150,000 advertisers) “sponsored links” ad cost-per-click pricing; only when people click on the link -- Advertisement is extremely cheap and effective i.e. Edmunds.com spent “$250,000 a month in advertising“ because $1 spent generated $1.70. Google Search Appliance an integrated hardware/software solution that extends the power of Google to corporate intranets and web servers -- Customers include: Cisco Systems, Sony, Procter & Gamble, Sun Microsystems, etc
  • 35. Challenges (cont.) Easy entry into the Search Engine Industry Lack of customer lock-in (vs. Microsoft); Google will focus on creating services to voluntarily draw in customers Large, well-known competitors are focusing on in-house search technology (Yahoo, Microsoft, AOL, eBay, Amazon) Customers are becoming competitors (Yahoo, AOL)
  • 36. Competitors: Ebay and Amazon Ebay ( www.ebay.com ) E-commerce Web-based marketplace in which a community of buyers and sellers are brought together to browse, buy and sell various items -- Business revenue: Charges Proceeds (Fees) (5%) 0.01-$25 (2.5%) $25-$1000 (1.25%) over $1000 Amazon ( www.amazon.com ) E-commerce a customer-centric company that sells a range of products that it purchases from manufacturers and distributors
  • 37. Competitors: Microsoft and Yahoo Microsoft is developing its own search engine -- Can “lasso” users into its search engine through its operating system -- Has the “braniacs” to implement top of the line search engine technology Yahoo was customer of Google (may now become Google’s biggest competitor) -- Offers placement under sponsored links and within actual results (“unethical”)
  • 39. Impacts Web Computing Knowledge Windows New Web OS
  • 40. Web Computing Faster than local search Very-large scale of computing systems Realize global users’ behaviors Acquire global information sources
  • 41. Web Computing Local disc or global disc? Personal information management? Gmails Photo search
  • 42. Knowledge Windows Windows of Information Search Alliance with online databases Windows of Personal Knowledge Management Knowledge Windows
  • 43. New Web OS Merged with Linux OS Software download from end-users Information Service OS
  • 44. V. New Gen. of WSE
  • 45. Advanced Google Is Google good enough? “ Takano” “ Takano NII” “ Takano NII Japan” More about Google Services http://www.google.com/options/
  • 46. New Features in Google Google Labs: http://labs.google.com/ Google Desktop Search Searching text, Web, Word, Excel, PowerPoint, Outlook, AOL Instant Messenger Google SMS Searching phone book, dictionary, product prices, … Google Print Searching books
  • 47.  
  • 48. Other Search Tools A9.com (by Amazon) Bookmark, history, discover, diary Books, movies, … Clusty.com (by Vivisimo) Clustering engine Snap.com (by Idealab) Sorting by popularity, satisfaction, Web popularity, Web satisfaction, domain, … Alexa.com (by Amazon) Average user review ratings, … Others: Yahoo, AskJeeves, AOL Search, HotBot, MSN, Netscape, Lycos, Altavista, LookSmart, Gigablast, Overture, About, FindWhat, Teoma, InformSearch, …
  • 52. New Directions Personalization Photo search, email search & filtering Information Extraction EX: Scholar search Information Agent Deep Web Search
  • 53. VI. Web Mining
  • 54. Web Search/Information Retrieval Millions of Users Web Search Engine Information Seeking
  • 55. Improving Search via Mining Millions of Users Web texts, images, logs … Search Engine Knowledge Discovery
  • 56. Valuable Web Resources Web logs, texts, images , … Knowledge Discovery Millions of Users Hyper Links Anchor Texts Search Result Pages Query Logs Query Session Logs Clicked Stream Logs Deep Web, ….
  • 57. Discovered Knowledge Web logs, texts, images , … Knowledge Discovery Millions of Users Users’ Preferences/Need: Topic, Location, Timing, … Authority/Popularity: Site, File, People, Company, Product Clusters/Associations/ Relations: Site, Page, People, Company, Product, Query
  • 58. Web Mining for IR Web logs, texts, images , … Knowledge Discovery Millions of Users Search Classification Clustering Cross-language IR Information Extraction Text mining Filtering
  • 59. CS 276 / LING 239I Information Retrieval and Web Mining Prabhakar Raghavan and Hinrich Schütze Course Description: Basic and advanced techniques for text-based information systems: efficient text indexing; Boolean, vector space, and probabilistic retrieval models; evaluation and interface issues; Web search including crawling, link-based algorithms, and Web metadata ; text/Web clustering , classification , wrapper , information extraction , and collaborative filtering systems; text mining . Projects can be chosen from diverse topics in information retrieval.
  • 60. Computational Linguistics, 29 , Issue 3, September 2003 .
  • 61. Research at Web Knowledge Discovery Lab
  • 62. Research at Web Knowledge Discovery Lab Live series LiveTrans SIGIR’04, ACL’04, JCDL’04 ACM Trans. On Information System, 2004 Online Translation of unknown queries via Web LiveClassifier WWW’04, IJCNLP’04 ACM Trans. on ALIP, 2004 Training classifiers and classifying short text via Web
  • 63. Research at Web Knowledge Discovery Lab LiveCluster CIKM’04 ACM Trans. On Information System, 2004 Generating taxonomy from terms or documents
  • 65. LiveClassifier : Classifying search results into user-defined classification tree
  • 66. LiveClassifier : Paper Title Categorization Note: no labeled training data
  • 67. LiveCluster : Taxonomy Generation
  • 69. Query Clustering 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 eva 長榮航空 長榮 航空公司 航空 華航 中華航空 補帖 大補帖 泡麵 dbt 武俠 金庸 武俠小說 黃易 作家 武俠 金庸 武俠小說 黃易 作家 補帖 大補帖 泡麵 dbt eva 長榮航空 (EVA airline) 長榮 (EVA) 航空公司 (airline) 航空 (airway) 華航 (China airline) 中華航空 (China airline) 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 cut 1 2 3 4 5 1 2 3 4 5
  • 70.  
  • 71. Outline Translating Unknown Queries (SIGIR’04) Training Text Classifiers (WWW’04) Generating Taxonomy/Topic Hierarchies (TOIS’04)
  • 72. Translating Unknown Queries Anchor Text Mining Probabilistic Modeling (ACM TALIP’02) Transitive Translation (ACM TOIS’04) Search-Result Page Mining Translation Extraction & Selection (JCDL’04) CLIR & Other Applications (SIGIR’04, ACL’04) Note: First work dealing with online translation
  • 73. Introduction (cont.) Bottleneck of CLIR service Real queries are often short Out-of-dictionary terms and might have local variations Ex: proper nouns, new terminologies, … Need for a powerful query translation engine Up-to-date dictionary 比爾蓋茲 Bill Gates 柯林頓 / 克林頓 Clinton 嚴重急性呼吸道症候群 / 非典 / 沙士 SARS 數位圖書館 / 數字圖書館 Digital library 班夫 / 班芙 Banff 石川県 Ishikawa 国立情報学研究所 NII Japan 羅浮宮 louvre museum Chinese Translation English Terminologies
  • 74. Web Mining of Query Translations Different problems for different resources Source Term Target Translations Term Translation Web Mining Anchor-Text Mining Search-Result Mining OOD Yahoo <-> 雅虎
  • 75. Anchor Text (Yahoo <-> 雅虎 ) Applies to most languages Translation candidates are likely to appear in the same anchor-text-set
  • 76. Search Result Page (National Palace Museum vs. 故宮博物院 ) Mixed-language characteristic in Chinese pages
  • 77. Problems Term extraction Translation selection & noisy reduction Language pairs with limited corpora Processing speed Data cleanness (language identification) Language independence
  • 79. … … Term Selection: Probabilistic Inference Model Page Authority Co-occurrence Page Rank Integrating anchor texts and link structures into probabilistic inference model Based on co-occurrence & page authority
  • 80. Observation of Anchor Text Source Term(Ts) Translation(Tt) 雅虎 => Yahoo
  • 81. - in USA Taiwan - www.yahoo.com www.yahoo.com.tw Yahoo Yahoo Source Query Observation of Anchor Text
  • 82. - in USA Taiwan - 台灣 - 搜尋引擎 www.yahoo.com www.yahoo.com.tw Yahoo 雅虎 雅虎 Yahoo Translation Candidates Anchor-Text Set Observation of Anchor Text
  • 83. …… (#in-link= 187) …… (#in-link= 21) - in USA Taiwan - 台灣 - 搜尋引擎 www.yahoo.com www.yahoo.com.tw Yahoo 雅虎 雅虎 Yahoo Page Authority Observation of Anchor Text Co-occurrence
  • 84. Search Result Mining Term Extraction Source Query Target Translations Web Pages Search Engine PAT-tree based term extraction method [Chien, SIGIR ‘97] Term Selection
  • 85. Term Selection How to decide the ranking? S, T i : frequently co-occur in the same pages Not necessarily true for synonyms and antonyms S, T i : the result pages containing similar co-occurring context terms as feature vectors Query S . . . T 1 T 2 T n
  • 86. Chi-Square Test Chi-Square Test: a statistical method for co-occurrence analysis [Gale & Church ‘91] a : # of pages containing both terms s and t b : # of pages containing term s but not t c : # of pages containing term t but not s d : # of pages containing neither term s nor t N : the total number of pages, i.e., N = a + b + c + d
  • 87. Context Vector Analysis Context Vector Analysis: co-occurring context terms as feature vectors Similarity measure: cosine measure
  • 88. Indirect Association Problem Cisco s t s 1 t 1 系統 (system) system Fig. 4. An illustration showing the indirect association problem, in which the dashed arrows indicate possible indirect association errors. 思科 (Cisco)
  • 89. Competitive Linking Algorithm t 1 system s t 2 系統 (system) Cisco 資訊 (information) 網路 (network) 電腦 (computer) St 1 思科 (Cisco) Fig. 6. An illustration showing a bipartite graph generated by using Algorithm 2 . St 2
  • 90. Combined Method To take advantage of both methods Anchor-text-based: higher precision Search-result-based: higher coverage R m (s,t) : Ranking of score in different methods
  • 91. Experiments Performance on Query Translation Test Bed: real query terms from the Dreamer search engine log in Taiwan 228,566 unique terms, during a period of 3 months in 1998 Random-query test set : 50 query terms in Chinese, randomly selected from the top 20,000 queries in the log 40 of them were out-of-dictionary
  • 92. Random Query Test Set Many query terms didn’t appear in anchor-text sets (coverage) 72% 66.0% 64.0% 44.0% Combined 32% 32.0% 32.0% 20.0% AT 68% 52.0% 50.0% 36.0% X 2 68% 54.0% 54.0% 40.0% CV Coverage Top-5 Top-3 Top-1 Method Table 2. Coverage and top 1~5 inclusion rates obtained with the four different methods for the random-query set.
  • 93. Other Experiments 430 popular Chinese queries, 67.4% top-1 inclusion rate Common terms: randomly selected 100 common nouns and 100 common verbs from general-purpose Chinese dictionary
  • 94. Transitive Translation Top-n inclusion rates obtained with different models. Model Top1 Top2 Top3 Top4 Top5 Direct Translation 35.7% 43.0% 46.9% 49.6% 51.2% Indirect Translation 44.2% 55.1% 58.0% 59.7% 60.5% Transitive Translation 49.2% 58.1% 60.9% 61.6% 62.0%
  • 96. Chinese-Japanese Translation 61.9% 61.3% 58.6% 51.4% 42.9% Transitive 59.6% 58.6% 56.6% 49.4% 40.2% Indirect 15.1% 15.1% 14.3% 12.8% 10.5% Direct Top5 Top4 Top3 Top2 Top1 Model Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese. ソニー ナイキ スタンフォード シドニー インターネット ネットワーク ホームページ コンピューター データベース インフォメーション 索尼 耐克 斯坦福 悉尼 互联网 网络 主页 计算机 数据库 信息 Sony Nike Stanford Sydney internet network homepage computer database information 新力 耐吉 史丹佛 雪梨 網際網路 網路 首頁 電腦 資料庫 資訊 Japanese Simplified Chinese English Extracted target translations Source terms (Traditional Chinese) Table VII. Selected source query terms in traditional Chinese and their translations in English, simplified Chinese and Japanese respectively, which were extracted by our transitive translation model.
  • 97. Translation Lexicons with Regional Variations (a) Taiwan (b) Mainland China (c) Hong Kong Figure 1: E xample s of search-result page s in different Chinese regions that were obtained via the English query words “ George Bush ” from Google.
  • 98. Summary A work dealing with live translation of unknown queries Anchor-text-based High precision for high-frequency terms Effective for proper nouns in multiple languages Not applicable if size of anchor-text set not enough Search-result-based Exploit rich Web resources High coverage for English-Chinese language pair
  • 99. LiveCluster: Generating Taxonomy from terms or documents
  • 102. The Steps Feature Extraction Use co-occurred seed terms extracted from retrieved top pages Term Vector Each query term is assigned a term vector Record the co-occurred feature terms and their frequency values in the retrieved documents. Term Similarity tf*idf-based Cosine measurement Hierarchical Term Clustering Cluster popular query terms in the log into initial categories Query terms with similar features are grouped into clusters.
  • 103. Feature Extraction Use co-occurred seed terms extracted from retrieved top pages Creative Nude Photography Network -- Fine Art Nude and ... ... The Creative Nude and Erotic Photography Network is the number one net portal to the best in fine art nude and erotic photography ! Over 100 CNPN Member Sites ... Nude Places ... to be naked . Walking in the forest, cruising the lake in open boats, swimming, picnicking and nude photography are all enjoyed in the nude . 60 minutes $39.95. ... A Brave Nude World ... A Brave Nude World! Warning: This site contains links to fine art nude & erotic photography . If you are under 18 or do not wish to view this material, You can ... nude Co-occurred feature terms 3/2 erotic photography 1/1 naked … … … 3/2 art 2/2 photography tf/df term
  • 105. Extraction of Basic Feature Terms Performance of different features: randomly selected, hi-frequency, and seed terms Popular queries not affected by ephemeral trends, e.g., “movie”, “basketball”, “mutual fund”, etc. More expressive and distinguishable in describing a particular category Two logs compared and extracted 9,709 overlapping top query terms as feature terms
  • 106. Task I: Query Clustering (Cont.) Feature Extraction Use co-occurred seed terms extracted from retrieved top pages Term Vector Each query term is assigned a term vector Record the co-occurred feature terms and their frequency values in the retrieved documents. Term Similarity TF *IDF-based Cosine measurement Hierarchical Term Clustering Cluster popular query terms in the log into initial categories Query terms with similar features are grouped into clusters.
  • 108. Hierarchical Term Clustering Agglomerative hierarchical clustering (AHC) Compute the similarity between all pairs of clusters Estimate similarity between all pairs of composed terms Use the lowest term similarity value as the cluster similarity value Merge the most similar (closest) two clusters Complete linkage method Update the cluster vector of the new cluster Repeat steps 2 and 3 until only a single cluster remains
  • 109.  
  • 110. Clustering Results 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 eva 長榮航空 長榮 航空公司 航空 華航 中華航空 補帖 大補帖 泡麵 dbt 武俠 金庸 武俠小說 黃易 作家 武俠 金庸 武俠小說 黃易 作家 補帖 大補帖 泡麵 dbt eva 長榮航空 (EVA airline) 長榮 (EVA) 航空公司 (airline) 航空 (airway) 華航 (China airline) 中華航空 (China airline) 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 cut 1 2 3 4 5 1 2 3 4 5
  • 113. Quality Function (Cont.)
  • 114. Quality Function (Cont.)
  • 115. Preliminary Experiment Test queries Two sets: top 1k queries and random 1k queries Each of the test queries has been manually assigned according classes Evaluation metrics F-Measure
  • 118.  
  • 119. Results of Hierarchical Structure Generation