SlideShare a Scribd company logo
A real-time Web Analytics System

                           Mahesh Patwardhan
             Digital and New Media Consultant
Contents
1.   Introduction
2.   The Requirements
3.   The Architecture
4.   The Reports
5.   The Implementation
6.   Conclusion
Introduction
   This document describes an implementation of a realtime
    web logs capture and reporting system.

   This system was developed to provide realtime reports for
    measuring traffic parameters like pageviews, visits, unique
    visitors etc. in realtime.

   The system was designed and built to replace the batch
    process system which generated reports in a deferred mode

   Was built to allow for realtime monitoring and action on the
    various online services.
Requirements
        ◦ Shortcomings of existing system
           The existing system generated reports on the previous day’s logs and not real time,
           the system could not be scaled up,
           was not equipped to handle heavy traffic,
           had no scope for adding new services
           there was no scope for adding or editing logs.

        ◦ Requirements of the new system was to provide for
           Real time web log capture from web servers at geographically dispersed locations
           Building a robust web logs data warehouse
           Provide extensive realtime reports from the web logs

        ◦ The advantages of this system would be:
           Can access data in “real time”
           The process can be scaled up to handle more traffic
           Provision has been made to add a new service or delete an existing service, which can be accessed
            from the very next day
           Logs can be added and modified

   .
…Requirements

◦ The system was required to capture, collate, and aggregate the web-logs
  which accumulate on the web-app servers.

◦ The aggregates need to be produced in near-real time.

◦ A multi-layer architecture needed to be deployed
   a layer of capture agents deployed on every web-app server
   a layer of collation server applications which collate data from the capture agents
   a layer of computation servers which aggregate data at high speed, needs to be
    implemented.

◦ This multi-layer architecture would aggregate data in industry-standard
  RDBMS tables, which could then be queried for viewing using user
  interface screens.

◦ The aggregate tables were to be updated in near-real-time
Architecture
…Architecture
    ◦ The architecture has four layers
         Collation clients (L1),
         Collation servers (L2),
         Computation servers (L3),
         Reporting server (L4)
         A database server to store the aggregated results.

    ◦ By design the architecture is completely scalable in the
      first three layers L1, L2, L3.

    ◦ All the layers communicate with each other over TCP/IP.


…Architecture
   Each collation client in L1 will connect to one Collation server in L2.
    ◦ A maximum of 30 Collation clients can connect to one Collation server.
    ◦ Primary back-up fail-over features will be provided (If one of the
      collation server fails, clients connecting to that will automatically
      shift to other servers in the cluster).

   The computation is distributed to the computation servers (L3) by
    service.
    ◦ Computation required for a service will be handled by its Computation
      server.
    ◦ Primary back fail-over is not possible in this layer.
    ◦ If required the architecture will allow distribution of computing by service.
      (for example there can be two servers performing computations for a
      service like e-mail).

   The computed information (aggregated) is stored in a database, which
    is used by the L4 (Reporting) layer.
Reports

◦   Hits by time
◦   Page Views by time, by pages
◦   Visits by time, by page
◦   Unique visitor by time, by page
◦   Return frequency
◦   Return visit
◦   Visiting frequency by visitor
◦   Average time spent
◦   By page average time spent
◦   Referrer by domains, URL
…Reports

◦   Search engines
◦   Search engine keywords
◦    By search engine by keyword
◦   Browser type, version, OS
◦   Parameter analysis
◦   Country, city, state wise reports
◦   By country top pages
◦   By ISP
◦   Top entry pages
◦   Top exit pages
◦   Path reporting (across service)
◦   Directory filter based reporting
◦   Fall-out reports
Implementation
   The implementation of the solution was done on
    an incremental basis. Deliverables were planned
    for each increment based on the requirement
    specified. There were five development cycles, the
    details of which are as specified

   Incremental cycle 1
    ◦   Setting up the framework for real-time log capture
    ◦   Health monitoring system
    ◦   Hits by time
    ◦   Page Views by time, by pages
…Implementation
   Incremental cycle 2
    ◦   Visits by time, by page
    ◦   Unique visitor by time, by page
    ◦   Return frequency
    ◦   Return visit
    ◦   Visiting frequency by visitor
    ◦   Average time spent
    ◦   By page average time spent
   Incremental cycle 3
    ◦   Referrer by domains, URL
    ◦   Search engines
    ◦   Search engine keywords
    ◦   By search engine by keyword
    ◦   Browser type, version, OS
    ◦   Parameter analysis
…Implementation

◦ Incremental cycle 4
     Country, city, state wise reports
     By country top pages
     By ISP
     Top entry pages
     Top exit pages
     Path reporting (across service)

◦ Incremental cycle 5
   Directory filter based reporting
   Fall-out reports

◦ The deliverables in each phase required elements of each layer to be
  developed, implemented, tested and deployed. For instance, a few
  database tables of the final aggregate table schema were needed to be
  designed from the first cycle itself along with the corresponding
  reports.
Conclusion

◦ This document describes an implementation of a realtime web logs capture and
  reporting system.

◦ This system was developed to provide realtime reports for measuring traffic
  parameters like pageviews, visits, unique visitors etc. in realtime.

◦ The system was designed and built to replace the batch process system which
  generated reports in a deferred mode and did not allow for realtime
  monitoring and action on the various online services.

◦ The architecture of the system consists of four layers - the Collation client
  agent, the Collation layer ,the Computation layer and the Reporting layer

◦ This system has overcome the shortcomings of the existing system which was
  not scalable and provided reports in a deferred mode.

◦ This was overcome by the present system which has a highly scalable
  architecture and provides reports in real time.

More Related Content

Similar to A Real Time Web Analytics System (20)

PDF
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Marko Grobelnik
 
PPTX
Web usage mining
shabnamfsayyad
 
PDF
The burden of a successful feature: Scaling our real time logging platform
Fastly
 
PDF
Building and deploying large scale real time news system with my sql and dist...
Tao Cheng
 
PDF
Big data tutorial_part4
heyramzz
 
PPTX
Apache Spark Streaming -Real time web server log analytics
ANKIT GUPTA
 
PPSX
Project Argus-Tamas Kluber
Tamas Kluber
 
PDF
Government Web Analytics
GovLoop
 
PPTX
Extending Data Lake using the Lambda Architecture June 2015
DataWorks Summit
 
KEY
The data layer
Ian Holsman
 
KEY
Analytics for the Real-Time Web
maria.grineva
 
PPTX
The Ultimate Logging Architecture - You KNOW you want it!
Michele Leroux Bustamante
 
PDF
Web Mining
Rami Alsalman
 
PDF
EDF2013: Big Data Tutorial: Marko Grobelnik
European Data Forum
 
PDF
Scale The Realtime Web
pfleidi
 
PDF
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
SL Corporation
 
PDF
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
Sriskandarajah Suhothayan
 
PDF
ACM DEBS 2015: Realtime Streaming Analytics Patterns
Srinath Perera
 
PPTX
Presentation3.pptx
VIPERVALORANT
 
PDF
Meaure Marketing Online - IABC Ottawa
PublicInsite
 
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Marko Grobelnik
 
Web usage mining
shabnamfsayyad
 
The burden of a successful feature: Scaling our real time logging platform
Fastly
 
Building and deploying large scale real time news system with my sql and dist...
Tao Cheng
 
Big data tutorial_part4
heyramzz
 
Apache Spark Streaming -Real time web server log analytics
ANKIT GUPTA
 
Project Argus-Tamas Kluber
Tamas Kluber
 
Government Web Analytics
GovLoop
 
Extending Data Lake using the Lambda Architecture June 2015
DataWorks Summit
 
The data layer
Ian Holsman
 
Analytics for the Real-Time Web
maria.grineva
 
The Ultimate Logging Architecture - You KNOW you want it!
Michele Leroux Bustamante
 
Web Mining
Rami Alsalman
 
EDF2013: Big Data Tutorial: Marko Grobelnik
European Data Forum
 
Scale The Realtime Web
pfleidi
 
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
SL Corporation
 
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
Sriskandarajah Suhothayan
 
ACM DEBS 2015: Realtime Streaming Analytics Patterns
Srinath Perera
 
Presentation3.pptx
VIPERVALORANT
 
Meaure Marketing Online - IABC Ottawa
PublicInsite
 

More from Mahesh Patwardhan (16)

PDF
IT Control Objectives for SOX
Mahesh Patwardhan
 
PDF
Model Information Office
Mahesh Patwardhan
 
PDF
Digital Landscape
Mahesh Patwardhan
 
PDF
Social Media Publishing & Aggregation
Mahesh Patwardhan
 
PDF
Social Media Metrics
Mahesh Patwardhan
 
PDF
Social Media For A Sporting Event
Mahesh Patwardhan
 
PDF
Revenue Reconciliation System
Mahesh Patwardhan
 
PDF
Business Analytics System
Mahesh Patwardhan
 
PDF
The Information Office
Mahesh Patwardhan
 
PDF
Concept for a Facebook App for a Mexican Restaurant
Mahesh Patwardhan
 
PDF
A concept for a facebook app
Mahesh Patwardhan
 
PDF
Digital And New Media Strategy using Web 2.0
Mahesh Patwardhan
 
PDF
Digital And New Media Consultancy Services
Mahesh Patwardhan
 
PDF
Lets Build A Story
Mahesh Patwardhan
 
PDF
Social Media in Sports - some Case Studies
Mahesh Patwardhan
 
PDF
Social Media - some case studies
Mahesh Patwardhan
 
IT Control Objectives for SOX
Mahesh Patwardhan
 
Model Information Office
Mahesh Patwardhan
 
Digital Landscape
Mahesh Patwardhan
 
Social Media Publishing & Aggregation
Mahesh Patwardhan
 
Social Media Metrics
Mahesh Patwardhan
 
Social Media For A Sporting Event
Mahesh Patwardhan
 
Revenue Reconciliation System
Mahesh Patwardhan
 
Business Analytics System
Mahesh Patwardhan
 
The Information Office
Mahesh Patwardhan
 
Concept for a Facebook App for a Mexican Restaurant
Mahesh Patwardhan
 
A concept for a facebook app
Mahesh Patwardhan
 
Digital And New Media Strategy using Web 2.0
Mahesh Patwardhan
 
Digital And New Media Consultancy Services
Mahesh Patwardhan
 
Lets Build A Story
Mahesh Patwardhan
 
Social Media in Sports - some Case Studies
Mahesh Patwardhan
 
Social Media - some case studies
Mahesh Patwardhan
 
Ad

Recently uploaded (20)

PDF
Per Axbom: The spectacular lies of maps
Nexer Digital
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PPTX
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
PDF
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
PDF
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
PDF
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
PPTX
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PPTX
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
Per Axbom: The spectacular lies of maps
Nexer Digital
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
The Future of Artificial Intelligence (AI)
Mukul
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
Ad

A Real Time Web Analytics System

  • 1. A real-time Web Analytics System Mahesh Patwardhan Digital and New Media Consultant
  • 2. Contents 1. Introduction 2. The Requirements 3. The Architecture 4. The Reports 5. The Implementation 6. Conclusion
  • 3. Introduction  This document describes an implementation of a realtime web logs capture and reporting system.  This system was developed to provide realtime reports for measuring traffic parameters like pageviews, visits, unique visitors etc. in realtime.  The system was designed and built to replace the batch process system which generated reports in a deferred mode  Was built to allow for realtime monitoring and action on the various online services.
  • 4. Requirements ◦ Shortcomings of existing system  The existing system generated reports on the previous day’s logs and not real time,  the system could not be scaled up,  was not equipped to handle heavy traffic,  had no scope for adding new services  there was no scope for adding or editing logs. ◦ Requirements of the new system was to provide for  Real time web log capture from web servers at geographically dispersed locations  Building a robust web logs data warehouse  Provide extensive realtime reports from the web logs ◦ The advantages of this system would be:  Can access data in “real time”  The process can be scaled up to handle more traffic  Provision has been made to add a new service or delete an existing service, which can be accessed from the very next day  Logs can be added and modified  .
  • 5. …Requirements ◦ The system was required to capture, collate, and aggregate the web-logs which accumulate on the web-app servers. ◦ The aggregates need to be produced in near-real time. ◦ A multi-layer architecture needed to be deployed  a layer of capture agents deployed on every web-app server  a layer of collation server applications which collate data from the capture agents  a layer of computation servers which aggregate data at high speed, needs to be implemented. ◦ This multi-layer architecture would aggregate data in industry-standard RDBMS tables, which could then be queried for viewing using user interface screens. ◦ The aggregate tables were to be updated in near-real-time
  • 7. …Architecture ◦ The architecture has four layers  Collation clients (L1),  Collation servers (L2),  Computation servers (L3),  Reporting server (L4)  A database server to store the aggregated results. ◦ By design the architecture is completely scalable in the first three layers L1, L2, L3. ◦ All the layers communicate with each other over TCP/IP. 
  • 8. …Architecture  Each collation client in L1 will connect to one Collation server in L2. ◦ A maximum of 30 Collation clients can connect to one Collation server. ◦ Primary back-up fail-over features will be provided (If one of the collation server fails, clients connecting to that will automatically shift to other servers in the cluster).  The computation is distributed to the computation servers (L3) by service. ◦ Computation required for a service will be handled by its Computation server. ◦ Primary back fail-over is not possible in this layer. ◦ If required the architecture will allow distribution of computing by service. (for example there can be two servers performing computations for a service like e-mail).  The computed information (aggregated) is stored in a database, which is used by the L4 (Reporting) layer.
  • 9. Reports ◦ Hits by time ◦ Page Views by time, by pages ◦ Visits by time, by page ◦ Unique visitor by time, by page ◦ Return frequency ◦ Return visit ◦ Visiting frequency by visitor ◦ Average time spent ◦ By page average time spent ◦ Referrer by domains, URL
  • 10. …Reports ◦ Search engines ◦ Search engine keywords ◦ By search engine by keyword ◦ Browser type, version, OS ◦ Parameter analysis ◦ Country, city, state wise reports ◦ By country top pages ◦ By ISP ◦ Top entry pages ◦ Top exit pages ◦ Path reporting (across service) ◦ Directory filter based reporting ◦ Fall-out reports
  • 11. Implementation  The implementation of the solution was done on an incremental basis. Deliverables were planned for each increment based on the requirement specified. There were five development cycles, the details of which are as specified  Incremental cycle 1 ◦ Setting up the framework for real-time log capture ◦ Health monitoring system ◦ Hits by time ◦ Page Views by time, by pages
  • 12. …Implementation  Incremental cycle 2 ◦ Visits by time, by page ◦ Unique visitor by time, by page ◦ Return frequency ◦ Return visit ◦ Visiting frequency by visitor ◦ Average time spent ◦ By page average time spent  Incremental cycle 3 ◦ Referrer by domains, URL ◦ Search engines ◦ Search engine keywords ◦ By search engine by keyword ◦ Browser type, version, OS ◦ Parameter analysis
  • 13. …Implementation ◦ Incremental cycle 4  Country, city, state wise reports  By country top pages  By ISP  Top entry pages  Top exit pages  Path reporting (across service) ◦ Incremental cycle 5  Directory filter based reporting  Fall-out reports ◦ The deliverables in each phase required elements of each layer to be developed, implemented, tested and deployed. For instance, a few database tables of the final aggregate table schema were needed to be designed from the first cycle itself along with the corresponding reports.
  • 14. Conclusion ◦ This document describes an implementation of a realtime web logs capture and reporting system. ◦ This system was developed to provide realtime reports for measuring traffic parameters like pageviews, visits, unique visitors etc. in realtime. ◦ The system was designed and built to replace the batch process system which generated reports in a deferred mode and did not allow for realtime monitoring and action on the various online services. ◦ The architecture of the system consists of four layers - the Collation client agent, the Collation layer ,the Computation layer and the Reporting layer ◦ This system has overcome the shortcomings of the existing system which was not scalable and provided reports in a deferred mode. ◦ This was overcome by the present system which has a highly scalable architecture and provides reports in real time.