SlideShare a Scribd company logo
©2016 AKAMAI | FASTER FORWARDTM
Tracking Performance of the Web with HTTP Archive
6/13/2018
Paul Calvano
Senior Web Performance Architect
@paulcalvano
pacalvan@akamai.com
2
Paul Calvano
@paulcalvano
Akamai
About Me
● Web Performance Architect @ Akamai
● HTTP Archive / BigQuery Addict :)
● Working on #WebPerf since 2000
● https://paulcalvano.com
● @paulcalvano on Twitter
3
“
httparchive.org
4https://www.igvita.com/2013/06/20/http-archive-bigquery-web-performance-answers/
5
How it Works
● Alexa’s top 500,000 websites
○ Home pages
○ Desktop and emulated mobile
○ Increasing to 1,000,000 soon!
● Powered by WebPageTest
○ Records HAR trace
○ Executes custom metrics
○ Records Lighthouse audits
● httparchive.org
○ Trends and stats
○ Discussion forum
● BigQuery and Cloud Storage
○ Queryable database
○ Raw HARs
6
HTTP Archive Pipeline
230
500K
GCS
BQ
Biweekly
7https://httparchive.org/reports/state-of-the-web
8https://httparchive.org/reports/state-of-the-web#pctHttps
9https://httparchive.org/reports/loading-speed
10
discuss.httparchive.orgThe tip of the databerg
Curated stats/trends
Raw data
11
A Peek Inside the Databerg…
DataSet Description Size (GB) Rows
summary_pages
Summary of all Desktop and
Mobile Pages
~340MB
Desktop: ~460K
Mobile: ~450K
summary_requests
Summary of all HTTP Requests for
Desktop and Mobile
~45 GB
Desktop: ~48 million
Mobile: ~44 million
pages
JSON-encoded parent document
HAR data
~5 GB
Desktop: ~460K
Mobile: ~450K
requests
JSON encoded subresource HAR
data
~290 GB
Desktop: ~48 million
Mobile: ~44 million
response_bodies
JSON encoded response bodies
for textual subresources
~915 GB
Desktop: ~18 million
Mobile: ~14 million
lighthouse
JSON encoded Lighthouse Report.
Mobile only
~140 GB Mobile: ~450K
* rows and size stats are based on 5/15/18 run
12
Who uses the
HTTP Archive?
● Scholars
● Community
● Industry
13
goo.gl/kxgzM1HTTP Archive referenced in research papers
. . .
In this article we utilize
the httparchive.org
[9] publicly available
dataset of captured
web performance
metrics
. . .
Desktop and mobile web page
comparison: characteristics,
trends, and implications
IEEE Communications Magazine (
Volume: 52, Issue: 9, September 2014 )
. . .
Recent stats from
httparchive.org show
that the top 300K URLs
in the world need on
average 38(!) TCP
connections to display
the site
. . .
HTTP2 explained
Computer Communication Review 44.3
(2014): 120-128.
. . .
We make extensive use
of the [...] data
available at HTTP
Archive to expose the
characteristics of 3rd
Party assets embedded
into the top 16,000
Alexa webpages
. . .
Are 3rd Parties Slowing Down the
Mobile Web?
Proceedings of the Eighth Wireless of
the Students, by the Students, and for
the Students Workshop. ACM, 2016.
14goo.gl/oJSIjm
The web community uses HTTP Archive to answer questions about the state of the web
goo.gl/PvVHhL
15
goo.gl/YPdLJuRemember when the average page size exceeded Doom?
16
goo.gl/Lnp1ne goo.gl/e8kXq9Industrial tools use HTTP Archive data for calibration
17
bit.ly/2JohsOc bit.ly/2JsxN0mHTTP Archive is used for emerging Internet Standards
18
Case Study #1
Compression
1. Updating Akamai Gzip Defaults
2. Brotli Compression
19
Last Mile Acceleration (LMA)
● Akamai feature to gzip compress content at the CDN edge
○ Helps out when origins do not compress certain resources
● Compression is based on HTTP Content Type
● Old defaults were not sufficient and usually required updating…
● We updated this a few years ago, using HTTP Archive data
20
SELECT mimeType, count(*) total,
SUM(IF(resp_content_encoding = "gzip",1,0)) gzip,
SUM(IF(resp_content_encoding = "deflate",1,0)) deflate,
SUM(IF(resp_content_encoding IN("gzip","deflate"),0,1)) NoCompression,
ROUND(
SUM(
IF(resp_content_encoding IN("gzip", "deflate"),1,0)
) / COUNT(*),2) CompressedPercentage
FROM httparchive.summary_requests.2018_05_15_desktop
GROUP BY mimeType
HAVING total > 1000
ORDER BY gzip DESC
bit.ly/2y1fKNIQuerying the HTTP Archive for Compression Stats
1
2
3
4
5
6
7
8
9
10
11
12
21
Some New LMA Defaults
● text/javascript
● font/ttf
● application/javascript
● text/xml
● application/json
● application/xml
● ...
Many Content-Types that
did not match the original
defaults!
22
Modern Set of Last Mile Acceleration Defaults
23
What About Brotli?
● New compression algorithm developed by Google researchers
● 5% - 25% Reduction over Gzip Compression
● Supported by most browsers
● Let’s extend the previous query to include Brotli compression
24
SELECT mimeType, count(*) total,
SUM(IF(resp_content_encoding = "gzip",1,0)) gzip,
SUM(IF(resp_content_encoding = "br",1,0)) brotli,
SUM(IF(resp_content_encoding = "deflate",1,0)) deflate,
SUM(IF(resp_content_encoding IN("gzip","deflate","br"),0,1)) NoCompression,
ROUND(
SUM(
IF(resp_content_encoding IN("gzip", "deflate", "br"),1,0)
) / COUNT(*),2) CompressedPercentage,
ROUND(
SUM(
IF(resp_content_encoding = "br",1,0)
) / COUNT(*),2) BrotliCompressedPercentage
FROM httparchive.summary_requests.2018_05_15_desktop
GROUP BY mimeType
HAVING total > 1000
ORDER BY brotli DESC
bit.ly/2Mcawl1Compression Stats - gzip and brotli
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
25
Examining Compression By Content Type - With Brotli
Brotli use has been growing, but where is it most prevalent?
26
Brotli Usage: Mostly JavaScript, CSS and HTML Resources
When we exclude Google and Facebook content, the bulk of Brotli encoded content is JS and CSS
27
Compression Level = Overhead
Data on compression speeds from quixdb.github.io/squash-benchmark/#results
Most byte savings are
obtained by using the
highest compression level.
28
Resource Optimizer: Automated Brotli Compression at the Edge
● Automatically compresses CSS and JS with Brotli
● Resources are compressed offline and then cached
● Brotli compression level 11, without the overhead!
+ =
29
Case Study #2
Server Technologies
1. Akamai Varnish Connector
2. Security Vulnerability Research
30developer.akamai.com/connector-for-varnish/ (free!)
Akamai Varnish Connector
31
How Many Akamai Customers Use Varnish at the Origin?
● HTTP Archive data helped determine which Akamai’s customers were using Varnish at the origin.
● Akamai Product Management was able to discuss desirable functionality with existing customers.
32
Investigating Security Threats - 0 Day Vulnerability
● HTTP Archive
○ Investigate other sites that contain similar characteristics
■ Server, Via, Url Regex Patterns, Other Headers
○ Export a list of sites that appear vulnerable
○ Cross Reference with Akamai Account Data
■ Notify 24x7 Security Contacts,
■ Help customers proactively protect themselves
● No CVE or Historical Attack Patterns
○ Target attack observed and mitigated (Kona Managed Security)
○ Akamai WAF rules prepared
○ Vendor notified
● Example: 0 Day Vulnerability on an Ecommerce App Server
33
Another 0-Day: How Can We Identify Sites Running Drupal
Drupal announced that a “highly critical”
security release would be happening
within a week
Expectation is that it would give sites time
to prepare for an emergency security
patch before 0 Day exploits begin…
https://www.drupal.org/sa-core-2018-002
34
Identifying Sites Running Drupal with HTTP Archive
First Try:
● WHERE url LIKE "%drupal.js%"
● Found 97 sites using Akamai and Drupal
Second Try:
● Expires header = 'Sun, 19 Nov 1978 05:00:00 GMT'
○ https://www.ostraining.com/blog/drupal/5-ways-drupal/
● Found more sites using Akamai and Drupal
○ ~26K total requests with this expires header!
What Did We do With this Data?
● Customer Outreach (Are you aware and prepared to patch?)
● Prepared WAF rules for those not able to apply patches immediately.
35
Investigating Security Threats - CryptoCurrency Miners
● Do any of my customers have cryptocurrency miners?
○ Are they aware?
○ Do they know how it got there?
https://discuss.httparchive.org/t/the-performance-impact-of-cryptocurrency-mining-on-the-web/1126/
36
Now Easier with Wappalyzer!
● Wappalyzer is a Cross Platform utility that
uncovers technologies used on websites.
● Integrated into HTTP Archive since April
2018
https://discuss.httparchive.org/t/using-wappalyzer-to-analyze-cpu-times-across-js-frameworks/1336/
37
Investigating Security Threats - What Domains are Serving Malware?
● Akamai’s ETP Service
○ Millions of malicious domains
and IP addresses.
○ Are any of my customers
serving 3rd party content
from known malware hosts?
● HTTP Archive parsed against the
ETP DB
○ Notified accounts if they
served content to the HTTP
Archive from known malware
hosts
38
Case Study #3
Third Party Research
1. How 3rd Parties Influence Render Time?
2. Researching a specific 3rd party
39
Do Third Parties Impact Load Time?
https://discuss.httparchive.org/t/analyzing-3rd-party-performance-via-http-archive-crux/1359
● CrUX = Chrome User
Experience Report
● JOINed’ w/ HTTP
Archive data for Alexa
Ranks
● Load times are faster
for sites with less third
party content.
40bit.ly/2sN2TJZWhich 3rd Party Content Types Load Before Render Start?
SELECT mimeType,
COUNT(*) num_requests,
SUM(
IF(req.startedDateTime < (pages.startedDateTime + (renderStart/1000) ),1,0)
) BeforeRenderStart,
SUM(
IF(req.startedDateTime < (pages.startedDateTime + (renderStart/1000) ),0,1)
) AfterRenderStart
FROM httparchive.summary_requests.2017_09_01_desktop req
JOIN (
SELECT rank, NET.HOST(url) hostname, url, pageid, startedDateTime, renderStart
FROM httparchive.summary_pages.2017_09_01_desktop
) pages ON pages.pageid = req.pageid
WHERE NET.HOST(req.url) != pages.hostname AND rank > 0 AND rank < 100000
GROUP BY mimeType
HAVING num_requests > 1000
ORDER BY BeforeRenderStart DESC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
41
What 3rd Party Content Loads Before RenderStart?
discuss.httparchive.org/t/which-3rd-party-content-loads-before-render-start/1084
42bit.ly/2glVquXWhich 3rd Parties Load Before Render Start?
SELECT NET.HOST(req.url) thirdparty,
mimeType,
COUNT(*) num_requests,
SUM(
IF(req.startedDateTime < (pages.startedDateTime + (renderStart/1000) ),1,0)
) BeforeRenderStart,
SUM(
IF(req.startedDateTime < (pages.startedDateTime + (renderStart/1000) ),0,1)
) AfterRenderStart
FROM httparchive.summary_requests.2017_09_01_desktop req
JOIN (
SELECT rank, NET.HOST(url) hostname, url, pageid, startedDateTime, renderStart
FROM httparchive.summary_pages.2017_09_01_desktop
) pages ON pages.pageid = req.pageid
WHERE NET.HOST(req.url) != pages.hostname AND rank > 0 AND rank < 100000
GROUP BY thirdparty, mimeType
HAVING num_requests > 100
ORDER BY BeforeRenderStart DESC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
43
What 3rd Party Domains Loads Before RenderStart?
44bit.ly/2LuD1JqWhich Websites Load <3rd Party> Before vs After Render Time?
SELECT rank, site,
SUM(
IF(req.startedDateTime < (pages.startedDateTime + (renderStart/1000) ),1,0)
) BeforeRenderStart,
SUM(
IF(req.startedDateTime < (pages.startedDateTime + (renderStart/1000) ),0,1)
) AfterRenderStart
FROM httparchive.summary_requests.2017_09_01_desktop req
INNER JOIN (
SELECT rank, NET.HOST(url) site, pageid, startedDateTime, renderStart
FROM httparchive.summary_pages.2017_09_01_desktop
) pages ON pages.pageid = req.pageid
WHERE NET.HOST(req.url) LIKE "%ensighten.com%" AND rank > 0
GROUP BY rank, site
HAVING AfterRenderStart>0
ORDER BY rank ASC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
45
Which Sites Load a Specific 3rd Party Before RenderStart?
● This query outputs a summary
containing the following
information:
○ Which sites use the 3rd
party?
○ How many resources are
served by it?
○ How many of them are
loaded before/after the page
renders?
● Results can help sites learn from
each other’s best practices
○ Even across industries!!!
46
Getting Started
1. Google Cloud project
2. BigQuery config
47
bit.ly/httparchive-bigqueryUp and running - Create a Google Cloud Project
⦁ console.cloud.google.com/projectcreate
48
Up and running - Add the HTTP Archive Tables
⦁ bigquery.cloud.google.com
bit.ly/httparchive-bigquery
49
http://bit.ly/httparchive-bigqueryUp and running - Start Exploring!
● Detailed Setup Instructions:
○ bit.ly/httparchive-bigquery
● BigQuery Standard SQL Documentation
○ cloud.google.com/bigquery/docs/reference/standard-sql/
● HTTP Archive Discussion Forum
○ discuss.httparchive.org
50
Continuing the Discussion…
discuss.httparchive.org
51
HTTP Archive Sponsors
Contact team@httparchive.org if interested
52
Thanks!
bit.ly/ha-slack
Chat
github.com/HTTPArchive
Contribute
discuss.httparchive.org
Collaborate
@paulcalvano
pacalvan@akamai.com

More Related Content

What's hot (20)

PDF
Ingress? That’s So 2020! Introducing the Kubernetes Gateway API
VMware Tanzu
 
PDF
Creating macOS Build Infrastructure in the Cloud
MacStadium
 
PPTX
Enterprise git - the hard bits
Matthew Barr
 
PDF
Full-Stack Development with Spring Boot and VueJS
VMware Tanzu
 
PDF
What’s new in grails framework 5?
Puneet Behl
 
PDF
How to convert your Full Trust Solutions to the SharePoint Framework (SPFx)
Brian Culver
 
PPTX
Continuous Code Quality with the sonar ecosystem
Roman Pickl
 
PPTX
Graph ql subscriptions on the jvm
Gerard Klijs
 
PDF
Rendering: Or why your perfectly optimized content doesn't rank
WeLoveSEO
 
PDF
Secrets of Successful Digital Transformers
VMware Tanzu
 
PDF
DOES SFO 2016 - Matthew Barr - Enterprise Git - the hard bits
Gene Kim
 
PPTX
VizEx View HTML5 workshop 2017
Larson Software Technology
 
PDF
Choose Your Own Adventure with JHipster & Kubernetes - Denver JUG 2020
Matt Raible
 
PDF
9 steps to awesome with kubernetes
BaraniBuuny
 
PDF
Evolution of GitLab Frontend
Fatih Acet
 
PDF
Future of Grails
Daniel Woods
 
PPTX
EMC World 2016 12 Factor Apps FTW
Tommy Trogden
 
PDF
Graalvm with Groovy and Kotlin - Madrid GUG 2019
Alberto De Ávila Hernández
 
PDF
To Microservices and Beyond
Matt Stine
 
PDF
Streaming Media West: Webrtc the future of low latency streaming
Alexandre Gouaillard
 
Ingress? That’s So 2020! Introducing the Kubernetes Gateway API
VMware Tanzu
 
Creating macOS Build Infrastructure in the Cloud
MacStadium
 
Enterprise git - the hard bits
Matthew Barr
 
Full-Stack Development with Spring Boot and VueJS
VMware Tanzu
 
What’s new in grails framework 5?
Puneet Behl
 
How to convert your Full Trust Solutions to the SharePoint Framework (SPFx)
Brian Culver
 
Continuous Code Quality with the sonar ecosystem
Roman Pickl
 
Graph ql subscriptions on the jvm
Gerard Klijs
 
Rendering: Or why your perfectly optimized content doesn't rank
WeLoveSEO
 
Secrets of Successful Digital Transformers
VMware Tanzu
 
DOES SFO 2016 - Matthew Barr - Enterprise Git - the hard bits
Gene Kim
 
VizEx View HTML5 workshop 2017
Larson Software Technology
 
Choose Your Own Adventure with JHipster & Kubernetes - Denver JUG 2020
Matt Raible
 
9 steps to awesome with kubernetes
BaraniBuuny
 
Evolution of GitLab Frontend
Fatih Acet
 
Future of Grails
Daniel Woods
 
EMC World 2016 12 Factor Apps FTW
Tommy Trogden
 
Graalvm with Groovy and Kotlin - Madrid GUG 2019
Alberto De Ávila Hernández
 
To Microservices and Beyond
Matt Stine
 
Streaming Media West: Webrtc the future of low latency streaming
Alexandre Gouaillard
 

Similar to Fluent 2018: Tracking Performance of the Web with HTTP Archive (20)

PPTX
Tracking the Performance of the Web Over Time with the HTTP Archive
Akamai Developers & Admins
 
PPTX
Akamai Edge: Tracking the Performance of the Web with HTTP Archive
Rick Viscomi
 
PDF
Web performance optimization - MercadoLibre
Pablo Moretti
 
PDF
Web performance mercadolibre - ECI 2013
Santiago Aimetta
 
PDF
Holiday Retail Readiness: Preparing For Peak
G3 Communications
 
PPT
Cache Optimization with Akamai
Blake Crosby
 
PPTX
Assessing Your Own Site Configuration
Akamai Developers & Admins
 
PPTX
Analysis of Google Page Speed Insight
Sarvesh Sonawane
 
PDF
DIY Website Performance - Akamai Toronto Tech Day 2015
Desmond Tam
 
PDF
Cache all the things #DCLondon
digital006
 
PPT
Web Speed And Scalability
Jason Ragsdale
 
PDF
Scaling Drupal: Not IF... HOW
Treehouse Agency
 
PPTX
Optimizing Front-end Web Performance Like a Rockstar
Zoompf
 
PDF
Drupal BigPipe: What have I learned
Radim Klaška
 
PDF
Improving frontend performance
Sagar Desarda
 
PPTX
The Most Frequently Used Caching Headers
HTS Hosting
 
PDF
Boston Web Performance Meetup: The Render Chain and You
mattringel
 
PPTX
Cloud Delivery: The Path from Simple to Sophisticated
Akamai Developers & Admins
 
PDF
CIRCUIT 2015 - Akamai: Caching and Beyond
ICF CIRCUIT
 
PPTX
SearchLove San Diego 2018 | Mat Clayton | Site Speed for Digital Marketers
Distilled
 
Tracking the Performance of the Web Over Time with the HTTP Archive
Akamai Developers & Admins
 
Akamai Edge: Tracking the Performance of the Web with HTTP Archive
Rick Viscomi
 
Web performance optimization - MercadoLibre
Pablo Moretti
 
Web performance mercadolibre - ECI 2013
Santiago Aimetta
 
Holiday Retail Readiness: Preparing For Peak
G3 Communications
 
Cache Optimization with Akamai
Blake Crosby
 
Assessing Your Own Site Configuration
Akamai Developers & Admins
 
Analysis of Google Page Speed Insight
Sarvesh Sonawane
 
DIY Website Performance - Akamai Toronto Tech Day 2015
Desmond Tam
 
Cache all the things #DCLondon
digital006
 
Web Speed And Scalability
Jason Ragsdale
 
Scaling Drupal: Not IF... HOW
Treehouse Agency
 
Optimizing Front-end Web Performance Like a Rockstar
Zoompf
 
Drupal BigPipe: What have I learned
Radim Klaška
 
Improving frontend performance
Sagar Desarda
 
The Most Frequently Used Caching Headers
HTS Hosting
 
Boston Web Performance Meetup: The Render Chain and You
mattringel
 
Cloud Delivery: The Path from Simple to Sophisticated
Akamai Developers & Admins
 
CIRCUIT 2015 - Akamai: Caching and Beyond
ICF CIRCUIT
 
SearchLove San Diego 2018 | Mat Clayton | Site Speed for Digital Marketers
Distilled
 
Ad

More from Paul Calvano (7)

PDF
Boston WebPerf Meetup Dec'24- Performance Mistakes.pdf
Paul Calvano
 
PDF
Performance Now '24- Performance Mistakes - Final.pdf
Paul Calvano
 
PPTX
Font Performance - NYC WebPerf Meetup April '24
Paul Calvano
 
PDF
NY WebPerf Sept '22 - Performance Mistakes - An HTTP Archive Deep Dive
Paul Calvano
 
PPTX
Lazy Load '22 - Performance Mistakes - An HTTP Archive Deep Dive
Paul Calvano
 
PDF
Web Unleashed '19 - Measuring the Adoption of Web Performance Techniques
Paul Calvano
 
PPTX
Real User Measurement Insights, NYWebPerf 2018-Aug-09
Paul Calvano
 
Boston WebPerf Meetup Dec'24- Performance Mistakes.pdf
Paul Calvano
 
Performance Now '24- Performance Mistakes - Final.pdf
Paul Calvano
 
Font Performance - NYC WebPerf Meetup April '24
Paul Calvano
 
NY WebPerf Sept '22 - Performance Mistakes - An HTTP Archive Deep Dive
Paul Calvano
 
Lazy Load '22 - Performance Mistakes - An HTTP Archive Deep Dive
Paul Calvano
 
Web Unleashed '19 - Measuring the Adoption of Web Performance Techniques
Paul Calvano
 
Real User Measurement Insights, NYWebPerf 2018-Aug-09
Paul Calvano
 
Ad

Recently uploaded (20)

PPT
Computer Securityyyyyyyy - Chapter 1.ppt
SolomonSB
 
PPTX
unit 2_2 copy right fdrgfdgfai and sm.pptx
nepmithibai2024
 
PPTX
Presentation3gsgsgsgsdfgadgsfgfgsfgagsfgsfgzfdgsdgs.pptx
SUB03
 
PDF
The-Hidden-Dangers-of-Skipping-Penetration-Testing.pdf.pdf
naksh4thra
 
PPTX
PM200.pptxghjgfhjghjghjghjghjghjghjghjghjghj
breadpaan921
 
PPTX
sajflsajfljsdfljslfjslfsdfas;fdsfksadfjlsdflkjslgfs;lfjlsajfl;sajfasfd.pptx
theknightme
 
PPTX
internet básico presentacion es una red global
70965857
 
PPTX
一比一原版(LaTech毕业证)路易斯安那理工大学毕业证如何办理
Taqyea
 
PPT
introductio to computers by arthur janry
RamananMuthukrishnan
 
PPTX
法国巴黎第二大学本科毕业证{Paris 2学费发票Paris 2成绩单}办理方法
Taqyea
 
PPTX
Research Design - Report on seminar in thesis writing. PPTX
arvielobos1
 
PPTX
PE introd.pptxfrgfgfdgfdgfgrtretrt44t444
nepmithibai2024
 
PPTX
INTEGRATION OF ICT IN LEARNING AND INCORPORATIING TECHNOLOGY
kvshardwork1235
 
PPTX
Orchestrating things in Angular application
Peter Abraham
 
PPTX
西班牙武康大学毕业证书{UCAMOfferUCAM成绩单水印}原版制作
Taqyea
 
PPTX
原版西班牙莱昂大学毕业证(León毕业证书)如何办理
Taqyea
 
PPTX
本科硕士学历佛罗里达大学毕业证(UF毕业证书)24小时在线办理
Taqyea
 
PPTX
Optimization_Techniques_ML_Presentation.pptx
farispalayi
 
PPT
Computer Securityyyyyyyy - Chapter 2.ppt
SolomonSB
 
PDF
Apple_Environmental_Progress_Report_2025.pdf
yiukwong
 
Computer Securityyyyyyyy - Chapter 1.ppt
SolomonSB
 
unit 2_2 copy right fdrgfdgfai and sm.pptx
nepmithibai2024
 
Presentation3gsgsgsgsdfgadgsfgfgsfgagsfgsfgzfdgsdgs.pptx
SUB03
 
The-Hidden-Dangers-of-Skipping-Penetration-Testing.pdf.pdf
naksh4thra
 
PM200.pptxghjgfhjghjghjghjghjghjghjghjghjghj
breadpaan921
 
sajflsajfljsdfljslfjslfsdfas;fdsfksadfjlsdflkjslgfs;lfjlsajfl;sajfasfd.pptx
theknightme
 
internet básico presentacion es una red global
70965857
 
一比一原版(LaTech毕业证)路易斯安那理工大学毕业证如何办理
Taqyea
 
introductio to computers by arthur janry
RamananMuthukrishnan
 
法国巴黎第二大学本科毕业证{Paris 2学费发票Paris 2成绩单}办理方法
Taqyea
 
Research Design - Report on seminar in thesis writing. PPTX
arvielobos1
 
PE introd.pptxfrgfgfdgfdgfgrtretrt44t444
nepmithibai2024
 
INTEGRATION OF ICT IN LEARNING AND INCORPORATIING TECHNOLOGY
kvshardwork1235
 
Orchestrating things in Angular application
Peter Abraham
 
西班牙武康大学毕业证书{UCAMOfferUCAM成绩单水印}原版制作
Taqyea
 
原版西班牙莱昂大学毕业证(León毕业证书)如何办理
Taqyea
 
本科硕士学历佛罗里达大学毕业证(UF毕业证书)24小时在线办理
Taqyea
 
Optimization_Techniques_ML_Presentation.pptx
farispalayi
 
Computer Securityyyyyyyy - Chapter 2.ppt
SolomonSB
 
Apple_Environmental_Progress_Report_2025.pdf
yiukwong
 

Fluent 2018: Tracking Performance of the Web with HTTP Archive

  • 1. ©2016 AKAMAI | FASTER FORWARDTM Tracking Performance of the Web with HTTP Archive 6/13/2018 Paul Calvano Senior Web Performance Architect @paulcalvano [email protected]
  • 2. 2 Paul Calvano @paulcalvano Akamai About Me ● Web Performance Architect @ Akamai ● HTTP Archive / BigQuery Addict :) ● Working on #WebPerf since 2000 ● https://paulcalvano.com ● @paulcalvano on Twitter
  • 5. 5 How it Works ● Alexa’s top 500,000 websites ○ Home pages ○ Desktop and emulated mobile ○ Increasing to 1,000,000 soon! ● Powered by WebPageTest ○ Records HAR trace ○ Executes custom metrics ○ Records Lighthouse audits ● httparchive.org ○ Trends and stats ○ Discussion forum ● BigQuery and Cloud Storage ○ Queryable database ○ Raw HARs
  • 10. 10 discuss.httparchive.orgThe tip of the databerg Curated stats/trends Raw data
  • 11. 11 A Peek Inside the Databerg… DataSet Description Size (GB) Rows summary_pages Summary of all Desktop and Mobile Pages ~340MB Desktop: ~460K Mobile: ~450K summary_requests Summary of all HTTP Requests for Desktop and Mobile ~45 GB Desktop: ~48 million Mobile: ~44 million pages JSON-encoded parent document HAR data ~5 GB Desktop: ~460K Mobile: ~450K requests JSON encoded subresource HAR data ~290 GB Desktop: ~48 million Mobile: ~44 million response_bodies JSON encoded response bodies for textual subresources ~915 GB Desktop: ~18 million Mobile: ~14 million lighthouse JSON encoded Lighthouse Report. Mobile only ~140 GB Mobile: ~450K * rows and size stats are based on 5/15/18 run
  • 12. 12 Who uses the HTTP Archive? ● Scholars ● Community ● Industry
  • 13. 13 goo.gl/kxgzM1HTTP Archive referenced in research papers . . . In this article we utilize the httparchive.org [9] publicly available dataset of captured web performance metrics . . . Desktop and mobile web page comparison: characteristics, trends, and implications IEEE Communications Magazine ( Volume: 52, Issue: 9, September 2014 ) . . . Recent stats from httparchive.org show that the top 300K URLs in the world need on average 38(!) TCP connections to display the site . . . HTTP2 explained Computer Communication Review 44.3 (2014): 120-128. . . . We make extensive use of the [...] data available at HTTP Archive to expose the characteristics of 3rd Party assets embedded into the top 16,000 Alexa webpages . . . Are 3rd Parties Slowing Down the Mobile Web? Proceedings of the Eighth Wireless of the Students, by the Students, and for the Students Workshop. ACM, 2016.
  • 14. 14goo.gl/oJSIjm The web community uses HTTP Archive to answer questions about the state of the web goo.gl/PvVHhL
  • 15. 15 goo.gl/YPdLJuRemember when the average page size exceeded Doom?
  • 16. 16 goo.gl/Lnp1ne goo.gl/e8kXq9Industrial tools use HTTP Archive data for calibration
  • 17. 17 bit.ly/2JohsOc bit.ly/2JsxN0mHTTP Archive is used for emerging Internet Standards
  • 18. 18 Case Study #1 Compression 1. Updating Akamai Gzip Defaults 2. Brotli Compression
  • 19. 19 Last Mile Acceleration (LMA) ● Akamai feature to gzip compress content at the CDN edge ○ Helps out when origins do not compress certain resources ● Compression is based on HTTP Content Type ● Old defaults were not sufficient and usually required updating… ● We updated this a few years ago, using HTTP Archive data
  • 20. 20 SELECT mimeType, count(*) total, SUM(IF(resp_content_encoding = "gzip",1,0)) gzip, SUM(IF(resp_content_encoding = "deflate",1,0)) deflate, SUM(IF(resp_content_encoding IN("gzip","deflate"),0,1)) NoCompression, ROUND( SUM( IF(resp_content_encoding IN("gzip", "deflate"),1,0) ) / COUNT(*),2) CompressedPercentage FROM httparchive.summary_requests.2018_05_15_desktop GROUP BY mimeType HAVING total > 1000 ORDER BY gzip DESC bit.ly/2y1fKNIQuerying the HTTP Archive for Compression Stats 1 2 3 4 5 6 7 8 9 10 11 12
  • 21. 21 Some New LMA Defaults ● text/javascript ● font/ttf ● application/javascript ● text/xml ● application/json ● application/xml ● ... Many Content-Types that did not match the original defaults!
  • 22. 22 Modern Set of Last Mile Acceleration Defaults
  • 23. 23 What About Brotli? ● New compression algorithm developed by Google researchers ● 5% - 25% Reduction over Gzip Compression ● Supported by most browsers ● Let’s extend the previous query to include Brotli compression
  • 24. 24 SELECT mimeType, count(*) total, SUM(IF(resp_content_encoding = "gzip",1,0)) gzip, SUM(IF(resp_content_encoding = "br",1,0)) brotli, SUM(IF(resp_content_encoding = "deflate",1,0)) deflate, SUM(IF(resp_content_encoding IN("gzip","deflate","br"),0,1)) NoCompression, ROUND( SUM( IF(resp_content_encoding IN("gzip", "deflate", "br"),1,0) ) / COUNT(*),2) CompressedPercentage, ROUND( SUM( IF(resp_content_encoding = "br",1,0) ) / COUNT(*),2) BrotliCompressedPercentage FROM httparchive.summary_requests.2018_05_15_desktop GROUP BY mimeType HAVING total > 1000 ORDER BY brotli DESC bit.ly/2Mcawl1Compression Stats - gzip and brotli 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
  • 25. 25 Examining Compression By Content Type - With Brotli Brotli use has been growing, but where is it most prevalent?
  • 26. 26 Brotli Usage: Mostly JavaScript, CSS and HTML Resources When we exclude Google and Facebook content, the bulk of Brotli encoded content is JS and CSS
  • 27. 27 Compression Level = Overhead Data on compression speeds from quixdb.github.io/squash-benchmark/#results Most byte savings are obtained by using the highest compression level.
  • 28. 28 Resource Optimizer: Automated Brotli Compression at the Edge ● Automatically compresses CSS and JS with Brotli ● Resources are compressed offline and then cached ● Brotli compression level 11, without the overhead! + =
  • 29. 29 Case Study #2 Server Technologies 1. Akamai Varnish Connector 2. Security Vulnerability Research
  • 31. 31 How Many Akamai Customers Use Varnish at the Origin? ● HTTP Archive data helped determine which Akamai’s customers were using Varnish at the origin. ● Akamai Product Management was able to discuss desirable functionality with existing customers.
  • 32. 32 Investigating Security Threats - 0 Day Vulnerability ● HTTP Archive ○ Investigate other sites that contain similar characteristics ■ Server, Via, Url Regex Patterns, Other Headers ○ Export a list of sites that appear vulnerable ○ Cross Reference with Akamai Account Data ■ Notify 24x7 Security Contacts, ■ Help customers proactively protect themselves ● No CVE or Historical Attack Patterns ○ Target attack observed and mitigated (Kona Managed Security) ○ Akamai WAF rules prepared ○ Vendor notified ● Example: 0 Day Vulnerability on an Ecommerce App Server
  • 33. 33 Another 0-Day: How Can We Identify Sites Running Drupal Drupal announced that a “highly critical” security release would be happening within a week Expectation is that it would give sites time to prepare for an emergency security patch before 0 Day exploits begin… https://www.drupal.org/sa-core-2018-002
  • 34. 34 Identifying Sites Running Drupal with HTTP Archive First Try: ● WHERE url LIKE "%drupal.js%" ● Found 97 sites using Akamai and Drupal Second Try: ● Expires header = 'Sun, 19 Nov 1978 05:00:00 GMT' ○ https://www.ostraining.com/blog/drupal/5-ways-drupal/ ● Found more sites using Akamai and Drupal ○ ~26K total requests with this expires header! What Did We do With this Data? ● Customer Outreach (Are you aware and prepared to patch?) ● Prepared WAF rules for those not able to apply patches immediately.
  • 35. 35 Investigating Security Threats - CryptoCurrency Miners ● Do any of my customers have cryptocurrency miners? ○ Are they aware? ○ Do they know how it got there? https://discuss.httparchive.org/t/the-performance-impact-of-cryptocurrency-mining-on-the-web/1126/
  • 36. 36 Now Easier with Wappalyzer! ● Wappalyzer is a Cross Platform utility that uncovers technologies used on websites. ● Integrated into HTTP Archive since April 2018 https://discuss.httparchive.org/t/using-wappalyzer-to-analyze-cpu-times-across-js-frameworks/1336/
  • 37. 37 Investigating Security Threats - What Domains are Serving Malware? ● Akamai’s ETP Service ○ Millions of malicious domains and IP addresses. ○ Are any of my customers serving 3rd party content from known malware hosts? ● HTTP Archive parsed against the ETP DB ○ Notified accounts if they served content to the HTTP Archive from known malware hosts
  • 38. 38 Case Study #3 Third Party Research 1. How 3rd Parties Influence Render Time? 2. Researching a specific 3rd party
  • 39. 39 Do Third Parties Impact Load Time? https://discuss.httparchive.org/t/analyzing-3rd-party-performance-via-http-archive-crux/1359 ● CrUX = Chrome User Experience Report ● JOINed’ w/ HTTP Archive data for Alexa Ranks ● Load times are faster for sites with less third party content.
  • 40. 40bit.ly/2sN2TJZWhich 3rd Party Content Types Load Before Render Start? SELECT mimeType, COUNT(*) num_requests, SUM( IF(req.startedDateTime < (pages.startedDateTime + (renderStart/1000) ),1,0) ) BeforeRenderStart, SUM( IF(req.startedDateTime < (pages.startedDateTime + (renderStart/1000) ),0,1) ) AfterRenderStart FROM httparchive.summary_requests.2017_09_01_desktop req JOIN ( SELECT rank, NET.HOST(url) hostname, url, pageid, startedDateTime, renderStart FROM httparchive.summary_pages.2017_09_01_desktop ) pages ON pages.pageid = req.pageid WHERE NET.HOST(req.url) != pages.hostname AND rank > 0 AND rank < 100000 GROUP BY mimeType HAVING num_requests > 1000 ORDER BY BeforeRenderStart DESC 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
  • 41. 41 What 3rd Party Content Loads Before RenderStart? discuss.httparchive.org/t/which-3rd-party-content-loads-before-render-start/1084
  • 42. 42bit.ly/2glVquXWhich 3rd Parties Load Before Render Start? SELECT NET.HOST(req.url) thirdparty, mimeType, COUNT(*) num_requests, SUM( IF(req.startedDateTime < (pages.startedDateTime + (renderStart/1000) ),1,0) ) BeforeRenderStart, SUM( IF(req.startedDateTime < (pages.startedDateTime + (renderStart/1000) ),0,1) ) AfterRenderStart FROM httparchive.summary_requests.2017_09_01_desktop req JOIN ( SELECT rank, NET.HOST(url) hostname, url, pageid, startedDateTime, renderStart FROM httparchive.summary_pages.2017_09_01_desktop ) pages ON pages.pageid = req.pageid WHERE NET.HOST(req.url) != pages.hostname AND rank > 0 AND rank < 100000 GROUP BY thirdparty, mimeType HAVING num_requests > 100 ORDER BY BeforeRenderStart DESC 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
  • 43. 43 What 3rd Party Domains Loads Before RenderStart?
  • 44. 44bit.ly/2LuD1JqWhich Websites Load <3rd Party> Before vs After Render Time? SELECT rank, site, SUM( IF(req.startedDateTime < (pages.startedDateTime + (renderStart/1000) ),1,0) ) BeforeRenderStart, SUM( IF(req.startedDateTime < (pages.startedDateTime + (renderStart/1000) ),0,1) ) AfterRenderStart FROM httparchive.summary_requests.2017_09_01_desktop req INNER JOIN ( SELECT rank, NET.HOST(url) site, pageid, startedDateTime, renderStart FROM httparchive.summary_pages.2017_09_01_desktop ) pages ON pages.pageid = req.pageid WHERE NET.HOST(req.url) LIKE "%ensighten.com%" AND rank > 0 GROUP BY rank, site HAVING AfterRenderStart>0 ORDER BY rank ASC 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
  • 45. 45 Which Sites Load a Specific 3rd Party Before RenderStart? ● This query outputs a summary containing the following information: ○ Which sites use the 3rd party? ○ How many resources are served by it? ○ How many of them are loaded before/after the page renders? ● Results can help sites learn from each other’s best practices ○ Even across industries!!!
  • 46. 46 Getting Started 1. Google Cloud project 2. BigQuery config
  • 47. 47 bit.ly/httparchive-bigqueryUp and running - Create a Google Cloud Project ⦁ console.cloud.google.com/projectcreate
  • 48. 48 Up and running - Add the HTTP Archive Tables ⦁ bigquery.cloud.google.com bit.ly/httparchive-bigquery
  • 49. 49 http://bit.ly/httparchive-bigqueryUp and running - Start Exploring! ● Detailed Setup Instructions: ○ bit.ly/httparchive-bigquery ● BigQuery Standard SQL Documentation ○ cloud.google.com/bigquery/docs/reference/standard-sql/ ● HTTP Archive Discussion Forum ○ discuss.httparchive.org