Intelligent Crawling and Indexing
using Lucene


                 By
           Shiva Thatipelli
      Mohammad Zubair (Advisor)

      Contents
    Searching
   Indexing
   Lucene
   Indexing with Lucene
   Indexing Static and Dynamic Pages
   Extracting and Indexing Dynamic Pages
   Implementation
   Screens
Searching
   Looking up words in an index
   Factors Affecting Search
   Precision – How well the system can
    filter
   Speed
   Single, Multiple Phase queries, Results
    ranking, Sorting, Wild card queries,
    Range queries support
Indexing
   Sequential Search is bad (Not Scalable)
   Index speeds up selection
   Index is a special data structure which
    allows rapid searching.
   Different Index Implementations
        - B Trees
        - Hash Map
Search Process

                      Query


Docs                                 Docs


       Indexing API
                              Hits
                      Index
Lucene

   High-performance, full-featured text
    search engine library
   Written 100% in pure java
   Easy to use yet powerful API
   Jakarta Apache Product. Strong open
    source community support.
Why Lucene?
   Open source (Not proprietary)
   Easy to use, good documentation
   Interoperable - ex: Index generated by java
    can be used by VB, asp, perl application
   Powerful And Highly Scalable
   Index Format
       Designed for interoperability
       Well Documented
       Resides on File System, RAM, custom store
Continued
   Algorithms
       Efficient, fast and optimized
•   Incremental Indexing
•   Boolean Query, Fuzzy Query, Range Query,
    Multi Phrase Query, Wild Card Query etc…
•   Content Tagging – Documents as Collection
    of terms
   Heterogeneous documents - Useful when
    different set of metadata present for different
    mime types
Indexing With Lucene
   What type of documents can be
    indexed?
       Any document from which text can be
        fetched and extracted over the net with a
        URL
   Uses Inverted Index
     - The index stores statistics about
    terms in order to make term-based
    search more efficient.
Indexing With Lucene Contd…
 HTML            XLS                 WORD            PDF


     extracted         extracted         extracted         extracted

 Parser          Parser              Parser           Parser




                          Analyzer




                          Index
Indexing Static and Dynamic
Pages
   Static Pages which are HTML, XLS, WORD, PDF
    documents on web which can be easily crawled and
    indexed by search engines like Google and Yahoo.
   Static Pages over the internet can be passed into
    Lucene and indexed and searched with direct URLs.
   Dynamic Pages which are generated due to result of
    parameters submitted; like search results pages,
    Database hidden pages cannot be indexed with direct
    URLs.
   To index Dynamic Pages we need the parameters
    submitted by users to generate those pages.
Extracting and Indexing Dynamic
Pages
   Extracting dynamic web pages which also can be
    called as database hidden pages needs some kind of
    input to generate the URLs
   To get the input parameters, we used of Apache
    Access logs which contain user request as URL.
   A sample entry in Apache access log is as follows:
    127.0.0.1 - - [31/Aug/2005:18:44:03 -0400] "GET
    /archon/servlet/search?
    formname=simple&fulltext=maly&group=subject&sor
    t=title HTTP/1.1" 200 9560
Extracting and Indexing Dynamic
Pages Contd...
   It contains all the information like IP-address of the computer
    accessing the information, date, time information accessed,
    Method called, Request URL, HTTP version, and HTTP code.
   The Request URL is the one which has all the input parameters,
    in this case formname=simple
fulltext=maly group=subject        sort=title
   Results page is dynamic and dependent upon the parameters
    passed.
   A full URL like
    http://archon.cs.odu.edu:8066/archon/servlet/searc
     Can be generated from Request URL by appending Website
    address.
Indexing Dynamic Pages…
          Apache Logs



                        Parse and generate URL



         Results page         Could be any file type




            Analyzer




              Index
Implementation
   The above flow chart describes the way
    Apache logs are parsed and URLs are
    generated
   It shows how the Results pages are
    fetched and extracted from the URLs
   The Results page is sent for analysis
    then Lucene generates the index which
    will be used for future searches.
Demo
   Results:
   Hardware Environment
   Dedicated machine for indexing: No, but nominal usage at time
    of indexing.
   CPU: Intel x86 P4 2.8Ghz
   RAM: 512 DDR
   Drive configuration: IDE 7200rpm
   Software environment
   Lucene Version: 1.4
   Java Version: 1..2
   OS Version: Windows 2000
   Apache Web server version 1.3 to 2.0
   Location of index: local
Create Index
IndexByLog.java file reads the access logs on local computer, generates
the URLs, fetches and extracts the results page from the URLs and
indexes them and stores in LuceneIndex folder.
Files extraction and Index
Creation
Searching at the prompt
Searching on the web
Results on the web
Conclusion
   It is very easy to implement efficient and
    powerful search engines using Lucene
   Lucene can be used to index dynamic pages
    and database hidden pages
   Web Server Access logs can be used to
    generate URLs and Java, Lucene API can be
    used to fetch and index database hidden
    pages.
   There are some security risks involved as we
    can reveal what users are doing what
    searches and other sensitive information .
Questions?

Intelligent crawling and indexing using lucene

  • 1.
    Intelligent Crawling andIndexing using Lucene By Shiva Thatipelli Mohammad Zubair (Advisor)
  • 2.
    Contents Searching  Indexing  Lucene  Indexing with Lucene  Indexing Static and Dynamic Pages  Extracting and Indexing Dynamic Pages  Implementation  Screens
  • 3.
    Searching  Looking up words in an index  Factors Affecting Search  Precision – How well the system can filter  Speed  Single, Multiple Phase queries, Results ranking, Sorting, Wild card queries, Range queries support
  • 4.
    Indexing  Sequential Search is bad (Not Scalable)  Index speeds up selection  Index is a special data structure which allows rapid searching.  Different Index Implementations - B Trees - Hash Map
  • 5.
    Search Process Query Docs Docs Indexing API Hits Index
  • 6.
    Lucene  High-performance, full-featured text search engine library  Written 100% in pure java  Easy to use yet powerful API  Jakarta Apache Product. Strong open source community support.
  • 7.
    Why Lucene?  Open source (Not proprietary)  Easy to use, good documentation  Interoperable - ex: Index generated by java can be used by VB, asp, perl application  Powerful And Highly Scalable  Index Format  Designed for interoperability  Well Documented  Resides on File System, RAM, custom store
  • 8.
    Continued  Algorithms  Efficient, fast and optimized • Incremental Indexing • Boolean Query, Fuzzy Query, Range Query, Multi Phrase Query, Wild Card Query etc… • Content Tagging – Documents as Collection of terms  Heterogeneous documents - Useful when different set of metadata present for different mime types
  • 9.
    Indexing With Lucene  What type of documents can be indexed?  Any document from which text can be fetched and extracted over the net with a URL  Uses Inverted Index - The index stores statistics about terms in order to make term-based search more efficient.
  • 10.
    Indexing With LuceneContd… HTML XLS WORD PDF extracted extracted extracted extracted Parser Parser Parser Parser Analyzer Index
  • 11.
    Indexing Static andDynamic Pages  Static Pages which are HTML, XLS, WORD, PDF documents on web which can be easily crawled and indexed by search engines like Google and Yahoo.  Static Pages over the internet can be passed into Lucene and indexed and searched with direct URLs.  Dynamic Pages which are generated due to result of parameters submitted; like search results pages, Database hidden pages cannot be indexed with direct URLs.  To index Dynamic Pages we need the parameters submitted by users to generate those pages.
  • 12.
    Extracting and IndexingDynamic Pages  Extracting dynamic web pages which also can be called as database hidden pages needs some kind of input to generate the URLs  To get the input parameters, we used of Apache Access logs which contain user request as URL.  A sample entry in Apache access log is as follows: 127.0.0.1 - - [31/Aug/2005:18:44:03 -0400] "GET /archon/servlet/search? formname=simple&fulltext=maly&group=subject&sor t=title HTTP/1.1" 200 9560
  • 13.
    Extracting and IndexingDynamic Pages Contd...  It contains all the information like IP-address of the computer accessing the information, date, time information accessed, Method called, Request URL, HTTP version, and HTTP code.  The Request URL is the one which has all the input parameters, in this case formname=simple fulltext=maly group=subject sort=title  Results page is dynamic and dependent upon the parameters passed.  A full URL like http://archon.cs.odu.edu:8066/archon/servlet/searc Can be generated from Request URL by appending Website address.
  • 14.
    Indexing Dynamic Pages… Apache Logs Parse and generate URL Results page Could be any file type Analyzer Index
  • 15.
    Implementation  The above flow chart describes the way Apache logs are parsed and URLs are generated  It shows how the Results pages are fetched and extracted from the URLs  The Results page is sent for analysis then Lucene generates the index which will be used for future searches.
  • 16.
  • 17.
    Results:  Hardware Environment  Dedicated machine for indexing: No, but nominal usage at time of indexing.  CPU: Intel x86 P4 2.8Ghz  RAM: 512 DDR  Drive configuration: IDE 7200rpm  Software environment  Lucene Version: 1.4  Java Version: 1..2  OS Version: Windows 2000  Apache Web server version 1.3 to 2.0  Location of index: local
  • 18.
    Create Index IndexByLog.java filereads the access logs on local computer, generates the URLs, fetches and extracts the results page from the URLs and indexes them and stores in LuceneIndex folder.
  • 19.
    Files extraction andIndex Creation
  • 20.
  • 21.
  • 22.
  • 23.
    Conclusion  It is very easy to implement efficient and powerful search engines using Lucene  Lucene can be used to index dynamic pages and database hidden pages  Web Server Access logs can be used to generate URLs and Java, Lucene API can be used to fetch and index database hidden pages.  There are some security risks involved as we can reveal what users are doing what searches and other sensitive information .
  • 24.