SlideShare a Scribd company logo
Invalidating Copyright
Infringement Claims with
Python and Fuzzy
Hashing
Joe T. Sylve, M.S.

Managing Partner
504ENSICS Labs
Background
• Client was being sued for Copyright Infringement
• Client’s lawyer wanted two questions answered
• Does the code contain any open source or GPL code?
• When was the code in question written?

• Code was written in PHP (web-based application)
• Code had absolutely no comments
• No copyright headers
• No dates of any kind

www.504ensics.com
Goal
• If it can be proven that the code contains open
source or GPL code with restrictive licenses then
the claim in invalid
• If it can be proven that the copyright code on file
was written after the author’s claimed “creation
date”, Copyright is invalid

www.504ensics.com
Is code original?
• No comments or header’s that would imply
authorship
• Code didn’t look familiar
• Code was kind of crappy

www.504ensics.com
Step 1 – Acquire Samples
• Wrote Python script to download all projects
written in PHP from Github
• Scraped from search feature
• Limited to 50 pages of search

• Got something like 10GB of compressed code
• ~100,000 files

www.504ensics.com
Step 2 – Compare Code
• Three Options
• Manual Verification
• Grad Students, Interns, etc

• Cryptographic Hashing
• MD5, SHA-1, etc

• “Fuzzy” Hashing
• ssdeep, sdhash

www.504ensics.com
Fuzzy Hashing
• Vassil says I have to call it “Approximate Matching”
• Ssdeep
• Vassil Roussev & Candace Quates
• Free, Open Source
• Awesome

• Traditional hashing
• If a single bit of the input changes, the whole hash
changes

• Fuzzy Hashing
• Compares files and gives similarity index
• Can find “similar” files
www.504ensics.com
When was code written?
• We can invalidate copyright if the sample on file
was written after the claimed authorship date
• No comments or dates of any kind in the code!
• No access to developer’s workstation to do
traditional forensics
• ???

www.504ensics.com
PHP
• Web-based language
• Updated reasonably frequently
• New Features added often
• Goal
• Determine which features were used in the code
• Correlate features with PHP release date
• Code couldn’t have been written before this date

www.504ensics.com
Step 1 – Function Use
• Programmer can create own functions or use ones
available in the language
• Ex
• function plus_one($x) { return $x + 1; }

• Python script to find all function declarations and
calls
• Ignore declared functions
• Left with a list of language “features” used

www.504ensics.com
Step 2 – Version Detection
• PHP comes with auto-generated documentation
about each built-in function
• Documentation says which version each function
became first available
• Write python script to scrape PHP documentation
• Correlate functions with PHP versions
• We only care about the function with the newest
version

www.504ensics.com
Step 3 – Date the code
• PHP has an archive of release notes on their
website
• Contains release versions and dates
• Python script scrapes release notes for the PHP
version of interest and gives us the release date
• Reasonably, the code couldn’t have been written
before that date

www.504ensics.com
Step 4 – Profit
• Win!
• Code in question used features first available in
PHP 5.1.5
• Release date 17-Aug-2006
• This was after the claimed creation date

www.504ensics.com
Conclusion
• Sometimes you can’t depend solely on existing
tools
• Learn to program even if you’re not a
“programmer”
• PHP sucks
• Fuzzy Hashing and Python is Cool

www.504ensics.com

More Related Content

PDF
Can PL/SQL be Clean? (2013)
Peter Kofler
 
PDF
The Python in the Apple
zeroSteiner
 
PDF
Building APIs with Apigilty and Zend Framework 2
David Stockton
 
PPTX
A great clash of symbols
Greg Sohl
 
PDF
Specification-driven API Design vs Technical Writers
Lukas Leander Rosenstock
 
PDF
Lessons learned: Choosing your documentation system
Pronovix
 
DOC
Mca 02 year_exp_unit_automation_testing_ldra_rtrt_c -
sandeep kumar gupta
 
PDF
Community update on flink 1.9 and How to Contribute to Flink
Bowen Li
 
Can PL/SQL be Clean? (2013)
Peter Kofler
 
The Python in the Apple
zeroSteiner
 
Building APIs with Apigilty and Zend Framework 2
David Stockton
 
A great clash of symbols
Greg Sohl
 
Specification-driven API Design vs Technical Writers
Lukas Leander Rosenstock
 
Lessons learned: Choosing your documentation system
Pronovix
 
Mca 02 year_exp_unit_automation_testing_ldra_rtrt_c -
sandeep kumar gupta
 
Community update on flink 1.9 and How to Contribute to Flink
Bowen Li
 

What's hot (14)

PDF
Specification-driven API Design with OpenAPI
Lukas Leander Rosenstock
 
PDF
apidays LIVE London 2021 - Designing APIs: Less Data is More by Damir Svrtan,...
apidays
 
PPTX
Native Script by Sebastian Witalec
Simone Basso
 
PDF
Managing Open Source Software in the GitHub Era
nexB Inc.
 
PDF
nexB Software Audit M&A: What to expect as a Seller
nexB Inc.
 
PDF
Magento 2 performance profiling and best practices
Jacques Bodin-Hullin
 
PDF
Reaching Out To Developers
Christian Heilmann
 
PDF
OmegaT "Team Project" feature: a case study
Qabiria
 
PDF
Google Developer Day 2010 Japan: 音声入力 API for Android (アレックス グランスタイン, 小西 祐介)
Google Developer Relations Team
 
PPTX
How to Review your Translation with 2 Free and Open Source QA Tools
Qabiria
 
PPTX
Android maven Road to flutter| Mavenizing Flutter for web
OluwatobiAkinpelu
 
PPTX
Effective .NET Core Unit Testing with SQLite and Dapper
Mike Melusky
 
PPTX
How to Manage Open Source requirements with AboutCode
nexB Inc.
 
PDF
Introduction to OmegaT
Qabiria
 
Specification-driven API Design with OpenAPI
Lukas Leander Rosenstock
 
apidays LIVE London 2021 - Designing APIs: Less Data is More by Damir Svrtan,...
apidays
 
Native Script by Sebastian Witalec
Simone Basso
 
Managing Open Source Software in the GitHub Era
nexB Inc.
 
nexB Software Audit M&A: What to expect as a Seller
nexB Inc.
 
Magento 2 performance profiling and best practices
Jacques Bodin-Hullin
 
Reaching Out To Developers
Christian Heilmann
 
OmegaT "Team Project" feature: a case study
Qabiria
 
Google Developer Day 2010 Japan: 音声入力 API for Android (アレックス グランスタイン, 小西 祐介)
Google Developer Relations Team
 
How to Review your Translation with 2 Free and Open Source QA Tools
Qabiria
 
Android maven Road to flutter| Mavenizing Flutter for web
OluwatobiAkinpelu
 
Effective .NET Core Unit Testing with SQLite and Dapper
Mike Melusky
 
How to Manage Open Source requirements with AboutCode
nexB Inc.
 
Introduction to OmegaT
Qabiria
 
Ad

Similar to Invalidating copyright infringement claims (20)

PDF
Ln monitoring repositories
snyff
 
PDF
"Building Modern PHP Applications" - Jackson Murtha, South Dakota Code Camp 2012
Blend Interactive
 
PDF
Modern PHP
Simon Jones
 
KEY
PyCon AU 2012 - Debugging Live Python Web Applications
Graham Dumpleton
 
PDF
Preparing for the next PHP version (5.6)
Damien Seguy
 
PDF
Preparing code for Php 7 workshop
Damien Seguy
 
PDF
Intro to Micro-frameworks
jsmith92
 
PPTX
The Hacking Game - Think Like a Hacker Meetup 12072023.pptx
lior mazor
 
PDF
Object Oriented Programming with Laravel - Session 1
Shahrzad Peyman
 
PDF
TAKING PHP SERIOUSLY - Keith Adams
Hermes Alves
 
PDF
$kernel->infect(): Creating a cryptovirus for Symfony2 apps
Raul Fraile
 
PDF
ZendCon Security
philipo
 
PDF
All The Little Pieces
Ezequiel Calderara
 
PPTX
Dmytro Dziubenko "Developer's toolchain"
Fwdays
 
PDF
Automated code audits
Damien Seguy
 
PDF
PHP
Potter
 
PDF
Tool up your lamp stack
AgileOnTheBeach
 
PDF
Tool Up Your LAMP Stack
Lorna Mitchell
 
PDF
잘 알려지지 않은 Php 코드 활용하기
형우 안
 
PDF
Zend Certification PHP 5 Sample Questions
Jagat Kothari
 
Ln monitoring repositories
snyff
 
"Building Modern PHP Applications" - Jackson Murtha, South Dakota Code Camp 2012
Blend Interactive
 
Modern PHP
Simon Jones
 
PyCon AU 2012 - Debugging Live Python Web Applications
Graham Dumpleton
 
Preparing for the next PHP version (5.6)
Damien Seguy
 
Preparing code for Php 7 workshop
Damien Seguy
 
Intro to Micro-frameworks
jsmith92
 
The Hacking Game - Think Like a Hacker Meetup 12072023.pptx
lior mazor
 
Object Oriented Programming with Laravel - Session 1
Shahrzad Peyman
 
TAKING PHP SERIOUSLY - Keith Adams
Hermes Alves
 
$kernel->infect(): Creating a cryptovirus for Symfony2 apps
Raul Fraile
 
ZendCon Security
philipo
 
All The Little Pieces
Ezequiel Calderara
 
Dmytro Dziubenko "Developer's toolchain"
Fwdays
 
Automated code audits
Damien Seguy
 
PHP
Potter
 
Tool up your lamp stack
AgileOnTheBeach
 
Tool Up Your LAMP Stack
Lorna Mitchell
 
잘 알려지지 않은 Php 코드 활용하기
형우 안
 
Zend Certification PHP 5 Sample Questions
Jagat Kothari
 
Ad

Recently uploaded (20)

PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
Software Development Methodologies in 2025
KodekX
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
Beyond Automation: The Role of IoT Sensor Integration in Next-Gen Industries
Rejig Digital
 
PDF
Software Development Company | KodekX
KodekX
 
PPTX
ChatGPT's Deck on The Enduring Legacy of Fax Machines
Greg Swan
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PPTX
IoT Sensor Integration 2025 Powering Smart Tech and Industrial Automation.pptx
Rejig Digital
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PDF
Architecture of the Future (09152021)
EdwardMeyman
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Software Development Methodologies in 2025
KodekX
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Beyond Automation: The Role of IoT Sensor Integration in Next-Gen Industries
Rejig Digital
 
Software Development Company | KodekX
KodekX
 
ChatGPT's Deck on The Enduring Legacy of Fax Machines
Greg Swan
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
IoT Sensor Integration 2025 Powering Smart Tech and Industrial Automation.pptx
Rejig Digital
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
Architecture of the Future (09152021)
EdwardMeyman
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 

Invalidating copyright infringement claims

  • 1. Invalidating Copyright Infringement Claims with Python and Fuzzy Hashing Joe T. Sylve, M.S. Managing Partner 504ENSICS Labs
  • 2. Background • Client was being sued for Copyright Infringement • Client’s lawyer wanted two questions answered • Does the code contain any open source or GPL code? • When was the code in question written? • Code was written in PHP (web-based application) • Code had absolutely no comments • No copyright headers • No dates of any kind www.504ensics.com
  • 3. Goal • If it can be proven that the code contains open source or GPL code with restrictive licenses then the claim in invalid • If it can be proven that the copyright code on file was written after the author’s claimed “creation date”, Copyright is invalid www.504ensics.com
  • 4. Is code original? • No comments or header’s that would imply authorship • Code didn’t look familiar • Code was kind of crappy www.504ensics.com
  • 5. Step 1 – Acquire Samples • Wrote Python script to download all projects written in PHP from Github • Scraped from search feature • Limited to 50 pages of search • Got something like 10GB of compressed code • ~100,000 files www.504ensics.com
  • 6. Step 2 – Compare Code • Three Options • Manual Verification • Grad Students, Interns, etc • Cryptographic Hashing • MD5, SHA-1, etc • “Fuzzy” Hashing • ssdeep, sdhash www.504ensics.com
  • 7. Fuzzy Hashing • Vassil says I have to call it “Approximate Matching” • Ssdeep • Vassil Roussev & Candace Quates • Free, Open Source • Awesome • Traditional hashing • If a single bit of the input changes, the whole hash changes • Fuzzy Hashing • Compares files and gives similarity index • Can find “similar” files www.504ensics.com
  • 8. When was code written? • We can invalidate copyright if the sample on file was written after the claimed authorship date • No comments or dates of any kind in the code! • No access to developer’s workstation to do traditional forensics • ??? www.504ensics.com
  • 9. PHP • Web-based language • Updated reasonably frequently • New Features added often • Goal • Determine which features were used in the code • Correlate features with PHP release date • Code couldn’t have been written before this date www.504ensics.com
  • 10. Step 1 – Function Use • Programmer can create own functions or use ones available in the language • Ex • function plus_one($x) { return $x + 1; } • Python script to find all function declarations and calls • Ignore declared functions • Left with a list of language “features” used www.504ensics.com
  • 11. Step 2 – Version Detection • PHP comes with auto-generated documentation about each built-in function • Documentation says which version each function became first available • Write python script to scrape PHP documentation • Correlate functions with PHP versions • We only care about the function with the newest version www.504ensics.com
  • 12. Step 3 – Date the code • PHP has an archive of release notes on their website • Contains release versions and dates • Python script scrapes release notes for the PHP version of interest and gives us the release date • Reasonably, the code couldn’t have been written before that date www.504ensics.com
  • 13. Step 4 – Profit • Win! • Code in question used features first available in PHP 5.1.5 • Release date 17-Aug-2006 • This was after the claimed creation date www.504ensics.com
  • 14. Conclusion • Sometimes you can’t depend solely on existing tools • Learn to program even if you’re not a “programmer” • PHP sucks • Fuzzy Hashing and Python is Cool www.504ensics.com