Posts

Showing posts with the label WARCMerge

2014-09-02: WARCMerge: Merging Multiple WARC files into a single WARC file

Image
WARCMerge is the name given to a new tool for organizing WARC files. The name describes it -- merging multiple WARC files into a single one. In web archiving, WARC files can be generated by well-known web crawlers such as Hertrix and Wget command, or by state-of-the-art tools like WARCreate/WAIL and Webrecorder.io which were developed to support the personal web archiving. WARC files contain records not only for HTTP responses and metadata elements but also all original HTTP requests. By having those WARC files, any replay tools (e.g., Wayback Machine) can be used to reconstruct and display the original web pages. I would emphasize here that a single WARC file may consist of records related to different web sites. In other words, multiple web sites can be archived in the same WARC file. This Python program runs in three different modes. In the first mode , the program sequentially reads records one by one from different WARC files and combines them into a new file in whic...