deduplication

From IndieWeb
(Redirected from de-dupe)


deduplication (de-duplication AKA deduping/de-duping) is the process of comparing responses (sometimes posts) and seeing if they are exactly or essentially the same, and only keeping the earliest or most canonical version, perhaps keeping track of alternative URLs, like syndicated copies.

How to deduplicate responses

Replies and other responses are often duplicated in different places, e.g. via backfeed of POSSEd replies by Bridgy. Ideally, recipients should try to de-dupe webmention sources, preferring an original post (see below). Getting this perfect is hard, but getting close is pretty easy (see one IRC discussion and another) by both:

  1. Preferring original replies
  2. Comparing an incoming reply (etc) to existing replies based on:
    • u-uid
    • u-url
    • u-syndication (also compare to u-url, and vice versa)
    • other u-in-reply-to links in the incoming reply
    • rel=alternate / rel=canonical
    • full text, after stripping HTML tags and probably ignoring whitespace differences
    • text prefix, after also stripping leading @username, RT/MT, trailing ..., etc.
    • edit distance, longest common subsequence, or other fuzzy match

Responses challenges

Examples / challenges for de-duping (use these as source material to check any de-duping approaches / algorithms)

IndieWeb Examples

Kyle Mahan

Kara Mahan de-duplicates comments on his site since at least 2015-06:

Aaron Parecki

Aaron Parecki de-duplicates comments on his site since 2017-09-01, with a partially working implementation since ~2016

Silo Examples

Twitter

  • Twitter: ~24hr(?) dedupe. In their web create UI, if you enter the same text as a previous tweet in the past 24hrs (tested minutes, and years, educated guessing 24hrs) and attempt to "Tweet", Twitter won't post it, and will instead show an error message of "You have already sent this Tweet.".

See Also