Skip to content
This repository was archived by the owner on Jun 1, 2021. It is now read-only.

joelpurra/har-dulcify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

169 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Extract data from HTTP Archive (HAR) files, quite possibly downloaded by har-heedless, for some aggregate analysis. You might want to use har-portent, which runs both downloads multiple dataset variations using har-heedless and then analyzes them with har-dulcify in a single step.

⚠️ This project has been archived

No future updates are planned. Feel free to continue using it, but expect no support.

Usage

# TODO: describe relevant scripts.
# Start with src/one-shot/*.sh

$ tree src/
# src/
# ├── aggregate
# │   ├── all.sh
# │   ├── analysis.sh
# │   ├── merge.sh
# │   ├── prepare.sh
# │   └── prepare2.sh
# ├── classification
# │   ├── basic.sh
# │   ├── disconnect
# │   │   ├── add.sh
# │   │   ├── analysis.sh
# │   │   └── prepare-service-list.sh
# │   └── public-suffix
# │       ├── add.sh
# │       └── prepare-list.sh
# ├── domains
# │   └── latest
# │       ├── all.sh
# │       └── single.sh
# ├── extract
# │   ├── errors
# │   │   ├── all.sh
# │   │   ├── failed-page-loads.sh
# │   │   ├── page.sh
# │   │   └── successful-page-loads.sh
# │   └── request
# │       ├── expand-parts.sh
# │       └── parts.sh
# ├── multiset
# │   ├── download-retries.sh
# │   ├── non-failed.classification.disconnect.coverage.sh
# │   ├── non-failed.classification.domain-scope.coverage.sh
# │   ├── non-failed.classification.secure.coverage.sh
# │   ├── non-failed.disconnect.categories.coverage.external.sh
# │   ├── non-failed.disconnect.counts.sh
# │   ├── non-failed.disconnect.domains.coverage.external.google.sh
# │   ├── non-failed.disconnect.domains.coverage.external.sh
# │   ├── non-failed.disconnect.organizations.coverage.external.sh
# │   ├── non-failed.mime-types.groups.coverage.external.sh
# │   ├── non-failed.mime-types.groups.coverage.internal.sh
# │   ├── non-failed.mime-types.groups.coverage.origin.sh
# │   ├── non-failed.public-suffix.coverage.external.sh
# │   ├── non-failed.requests.counts.sh
# │   ├── origin-redirects.sh
# │   ├── ratio-buckets.sh
# │   └── request-status.codes.coverage.origin.sh
# ├── one-shot
# │   ├── aggregate.sh
# │   ├── all.sh
# │   ├── data.sh
# │   ├── multiset.sh
# │   ├── preparations.sh
# │   └── questions.sh
# ├── questions
# │   ├── disconnect.categories.organizations.sh
# │   ├── google-gtm-ga-dc.aggregate.sh
# │   ├── google-gtm-ga-dc.sh
# │   ├── origin-redirects.aggregate.sh
# │   ├── origin-redirects.sh
# │   ├── ratio-buckets.aggregate.analysis.sh
# │   ├── ratio-buckets.aggregate.sh
# │   └── ratio-buckets.sh
# └── util
#     ├── array-of-objects-to-csv.sh
#     ├── array-of-objects-to-tsv.sh
#     ├── cat-path.sh
#     ├── clean-csv-sorted-header.sh
#     ├── clean-tsv-sorted-header.sh
#     ├── concat.sh
#     ├── dataset-foreach.sh
#     ├── dataset-query.sh
#     ├── malformed-har.sh
#     ├── parallel-chunks.sh
#     ├── parallel-n-2.sh
#     ├── prepare-alexa-domain-lists.sh
#     ├── prepare-domain-lists.sh
#     ├── prepare-zone-file-domain-lists.sh
#     ├── reduce-merge-deep-add.sh
#     ├── structure.sh
#     ├── take.sh
#     ├── to-array.sh
#     └── unwrap-array.sh
#
# 13 directories, 69 files

Original purpose

Photo of Joel Purra presenting his master's thesis, named Swedes Online: You Are More Tracked Than You Think

Built as a component in Joel Purra's master's thesis research, where downloading lots of front pages in the .se top level domain zone was required to analyze their content and use of internal/external resources.

Citations

If you use, like, reference, or base work on the thesis report Swedes Online: You Are More Tracked Than You Think, the IEEE LCN 2016 paper Third-party Tracking on the Web: A Swedish Perspective, open source code, or open data, please add at least on of the following two citations with a link to the project website: https://joelpurra.com/projects/masters-thesis/

Master's thesis citation:

Joel Purra. 2015. Swedes Online: You Are More Tracked Than You Think. Master's thesis. Linköping University (LiU), Linköping, Sweden. https://joelpurra.com/projects/masters-thesis/

IEEE LCN 2016 paper citation:

J. Purra, N. Carlsson, Third-party Tracking on the Web: A Swedish Perspective, Proc. IEEE Conference on Local Computer Networks (LCN), Dubai, UAE, Nov. 2016. https://joelpurra.com/projects/masters-thesis/


Copyright (c) 2014, 2015, 2016, 2017 Joel Purra. Released under GNU General Public License version 3.0 (GPL-3.0).

About

Extract data from HTTP Archive (HAR) files, quite possibly downloaded by har-heedless, for some aggregate analysis.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages