Delete comment from: DSHR's Blog
Thanks for this post, David. Regarding the caveat: Clearly there is that problem because, in essence, what happens is an abstract form of screen scraping. But, I am rather confident that the problem is tractable in the proposed approach:
- A quality control mechanism at the end of the server's crawling process can detect that something is wrong with the Trace at hand: When an instruction in a Trace is no longer valid, the associated call to the headless browser API will result in an error message. This error can be intercepted and is an indication that the Trace has to be re-recorded.
- One can imagine acting automatically upon such an error condition: There could be an alerting mechanism from the server-side crawler environment to the shared repository when a Trace doesn't work anymore. If the repository is - say - GitHub, this could be done by automatically posting an Issue.
May 24, 2018, 11:09:40 AM
Posted to How Far Is Far Enough?

