SlideShare a Scribd company logo
+ => 1 million SPDX 
Large-scale license transparency using open data, open standards and F/OSS 
http://triplecheck.net http://searchcode.com
Speaker 
Slide #2 
Nuno Brito 
 Free/open source contributor since 2005 
 Last 12 months wrote 100k F/OSS lines of code 
 SPDX contributor, co-founder of TripleCheck 
Around the web 
http://nunobrito.eu
Transparency 
Slide #3 
Take some source code as example 
Who developed the code? 
Which licenses are applicable? 
Was the code copied from somewhere else?
Size 
Slide #4 
A problem of scale 
Open licenses? > 300 types to choose 
> 5 million F/OSS projects 
> 100 million source code files
Practice 
Slide #5 
Applying licenses 
 Burden on developer (do correctly, do enough) 
 Expressed differently (difficult to understand) 
 Scaling obstacles (scarce automation) 
Transparency?
What do? 
Slide #6 
Ideally, we'd have tooling that is.. 
a) Reachable 
b) Cooperative 
c) Free 
Choose two. (sad reality)
Choose three 
Slide #7 
Choose building blocks based on: 
a) Open standards 
b) Open data 
c) Reachable tools 
Learn, write, improve. 
Share.
Standards 
Slide #8 
SPDX: Open standard for software licensing 
 Standardizes license description 
 Defines Id for license terms 
 http://spdx.org 
Pro: Good docs, straightforward, getting better 
Cons: Slow adoption, scarce tooling
Open data 
Slide #9 
GitHub: Targeting open data repositories 
 API suited for intensive access 
 Social coding 
 Largest open source code collection 
Pro: Reachable, diverse 
Cons: Repositories processed one-by-one
Tooling 
Slide #10 
Custom-built tools for software licenses 
 Large-scale repository data-mining 
 Find applicable licenses inside content 
 Share millions of SPDX documents 
Pro: Learn by doing, modularized, single language 
Cons: Built from scratch, needs consolidation
Step 1 
Slide #11 
Desktop tool/engine to discover licenses 
 SPDX format as storage medium 
 Identify copyright and 18 license types 
 Java, released in Feb 2014. EUPL 
http://spdx.org/tools/community/triplecheck-reporter
Desktop 
Slide #12
File detail 
Slide #13
SPDX file 
Slide #14
Customize 
Slide #15
Details 
Slide #16 
Underneath the hood 
 147 file extensions, 18 license types 
 LOC, hashes (SHA1, MD5, SHA256, SSDEEP) 
 Command line supported (Jenkins, cron) 
 Fast, 40k files/minute (Pentium IV)
Step 2 
Discovering repositories with gitFinder 
Create a list of projects online to use as components. 
Get basic licensing information from each project. 
 Write text file with each github user (~7 million) 
 For each user, find repositories not forked (~10M) 
 Split each repository according to language (197) 
 For each list of language/reps, download code 
Slide #17
Performance 
Slide #18 
~70k repositories/day 
 Single machine (i7, 8Gb RAM, CentOS) 
 9 parallel threads 
 Resume/recover supported 
 Released in Jun. 2014 
https://github.com/triplecheck/gitfinder
Output 
Slide #19
Storage? 
https://what-if.xkcd.com/29/ (CC BY-NC 2.5) Slide #20
Storage 
BigZip, +100 million files on a single download 
Slide #21 
 Flat-file, zip compression (per entry) 
 Fast, simple, portable. Indexed search 
https://github.com/triplecheck/big
How it looks 
Slide #22
Step 3 
Slide #23 
SPDX search engine 
 One-click SPDX creation from open data 
 Visualize license and copyright data 
 Visit at http://searchcode.com/spdx
Example 
Slide #24 
Using the original URL.. 
 https://github.com/iuly/europa_kernel/ 
=> 
 https://spdxhub.com/iuly/europa_kernel/
Example 
Slide #25
SPDX-1M 
“Do It Yourself” kit. Generate 1 million SPDX 
Slide #26 
 https://github.com/triplecheck/diy 
 1.2 million open source projects 
 “Arduino” for s/w licenses detection 
9Gb worth of SPDX? Grab: 
http://triplecheck.net/public/storage/spdx.big
Screenshots 
Slide #27
Next step? 
Slide #28 
F2F – pinpointing non-original code 
 Decompose code into blocks 
 Tokenize/anonymize data 
 Find code matches across knowledge base 
ETA in Dec. 2014 
https://github.com/triplecheck/f2f
Preview 
Slide #29
Conclusion 
Slide #30 
What is now available for everyone 
 Desktop tooling / detection engine 
 Extraction of open data in scale 
 Search engine for SPDX
Questions? 
Slide #31 
http://spdx.org 
http://searchcode.com/spdx 
http://github.com/triplecheck 
Interesting stuff? 
Let us know: @nn81 @boyte #linuxcon 
http://xkcd.com/1118/
Backup slides 
Slide #32
Engine 
Slide #33
License DB 
Slide #34
Components 
Slide #35
Exporting 
Slide #36

More Related Content

What's hot (20)

PPTX
Open Source Software Concepts
JITENDRA LENKA
 
PDF
The Ring programming language version 1.5.1 book - Part 14 of 180
Mahmoud Samir Fayed
 
PDF
Philosophy of Open Source - SFO17-TR01
Linaro
 
PDF
For the Love of Tux: Linux on RISC-V
Drew Fustini
 
PPT
Open Source and Free Software
King Fahad University for Petroleum and Minerals
 
PDF
Introduction to FOSS, SRM University
Atul Jha
 
PPTX
Benefits of Opensource Products
Anju Merin
 
PDF
Python at a glance
Mohammad Rafiee
 
PDF
Dynamic hacking with Guile (FOSDEM 2011)
Igalia
 
PPT
The open source philosophy
Gautam Krishnan
 
PDF
MSR09.ppt
Ptidej Team
 
PDF
Free and open source software
Frederik Questier
 
PPT
GNU GPL, LGPL, Apache licence Types and Differences
Iresha Rubasinghe
 
ODP
Fundamentals of Free and Open Source Software
Ross Gardler
 
PPTX
Kivy report
shobhit bhatnagar
 
PPT
Open Source Presentation
Adhoura Academy
 
PDF
Avoiding the tragedy of the commons: some lessons from the Software Heritage ...
OW2
 
PPTX
Free and Open Source Software
iwilldo4u
 
ODP
Foss Presentation
Ahmed Mekkawy
 
PDF
Using oss and hacker culture at an internet company at osc/tokyo 2014/03/01
Hiro Yoshioka
 
Open Source Software Concepts
JITENDRA LENKA
 
The Ring programming language version 1.5.1 book - Part 14 of 180
Mahmoud Samir Fayed
 
Philosophy of Open Source - SFO17-TR01
Linaro
 
For the Love of Tux: Linux on RISC-V
Drew Fustini
 
Introduction to FOSS, SRM University
Atul Jha
 
Benefits of Opensource Products
Anju Merin
 
Python at a glance
Mohammad Rafiee
 
Dynamic hacking with Guile (FOSDEM 2011)
Igalia
 
The open source philosophy
Gautam Krishnan
 
MSR09.ppt
Ptidej Team
 
Free and open source software
Frederik Questier
 
GNU GPL, LGPL, Apache licence Types and Differences
Iresha Rubasinghe
 
Fundamentals of Free and Open Source Software
Ross Gardler
 
Kivy report
shobhit bhatnagar
 
Open Source Presentation
Adhoura Academy
 
Avoiding the tragedy of the commons: some lessons from the Software Heritage ...
OW2
 
Free and Open Source Software
iwilldo4u
 
Foss Presentation
Ahmed Mekkawy
 
Using oss and hacker culture at an internet company at osc/tokyo 2014/03/01
Hiro Yoshioka
 

Similar to 2014 10-14: GitHub plus FOSS == 1 million SPDX (20)

PPT
Android Developer Meetup
Medialets
 
PDF
Automate your iOS deployment a bit
Michał Łukasiewicz
 
PDF
Drupal Dev Days Vienna 2023 - What is the secure software supply chain and th...
sparkfabrik
 
ODP
Ubucon 2013, licensing and packaging OSS
Nuno Brito
 
PDF
Open frameworks 101_fitc
benDesigning
 
PDF
Hacking the Kinect with GAFFTA Day 1
benDesigning
 
PDF
Module 18 (linux hacking)
Wail Hassan
 
PDF
Become Rick and famous, thanks to Open Source
Geeks Anonymes
 
PPT
2nd ARM Developer Day - mbed Workshop - ARM
Antonio Mondragon
 
PDF
Introduction to License Compliance and My research (D. German)
dmgerman
 
PPTX
Scanning Docker Images with ScanCode.io
Michael Herzog
 
PDF
Software Heritage, a revolutionary infrastructure for software source code, O...
OW2
 
PDF
OpenNTF Webinar 05/07/13: OpenNTF - The IBM Collaboration Solutions App Dev C...
Niklas Heidloff
 
PPTX
Lab Handson: Power your Creations with Intel Edison!
Codemotion
 
PPTX
Microsoft Embracing Open Source Technologies
Ricardo Peres
 
PDF
Software Heritage: Archiving the Free Software Commons for Fun & Profit
Speck&Tech
 
PDF
DT2014-15 S01: Digital Toolbox
Carlos Cámara
 
PDF
UnDeveloper Studio
Christien Rioux
 
ODP
Open source freeopensource & linux
Manura Perera
 
PDF
Tech Talk - Blockchain presentation
Laura Steggles
 
Android Developer Meetup
Medialets
 
Automate your iOS deployment a bit
Michał Łukasiewicz
 
Drupal Dev Days Vienna 2023 - What is the secure software supply chain and th...
sparkfabrik
 
Ubucon 2013, licensing and packaging OSS
Nuno Brito
 
Open frameworks 101_fitc
benDesigning
 
Hacking the Kinect with GAFFTA Day 1
benDesigning
 
Module 18 (linux hacking)
Wail Hassan
 
Become Rick and famous, thanks to Open Source
Geeks Anonymes
 
2nd ARM Developer Day - mbed Workshop - ARM
Antonio Mondragon
 
Introduction to License Compliance and My research (D. German)
dmgerman
 
Scanning Docker Images with ScanCode.io
Michael Herzog
 
Software Heritage, a revolutionary infrastructure for software source code, O...
OW2
 
OpenNTF Webinar 05/07/13: OpenNTF - The IBM Collaboration Solutions App Dev C...
Niklas Heidloff
 
Lab Handson: Power your Creations with Intel Edison!
Codemotion
 
Microsoft Embracing Open Source Technologies
Ricardo Peres
 
Software Heritage: Archiving the Free Software Commons for Fun & Profit
Speck&Tech
 
DT2014-15 S01: Digital Toolbox
Carlos Cámara
 
UnDeveloper Studio
Christien Rioux
 
Open source freeopensource & linux
Manura Perera
 
Tech Talk - Blockchain presentation
Laura Steggles
 
Ad

More from Nuno Brito (6)

PDF
Triplechecheck induction-presentation-sample
Nuno Brito
 
PDF
Stop look and listen before you talk
Nuno Brito
 
PPT
Lifes Good In Portugal
Nuno Brito
 
PPTX
Managing business relationships
Nuno Brito
 
PDF
Explaining the WinBuilder framework
Nuno Brito
 
PDF
White paper - Adhoc 2.0
Nuno Brito
 
Triplechecheck induction-presentation-sample
Nuno Brito
 
Stop look and listen before you talk
Nuno Brito
 
Lifes Good In Portugal
Nuno Brito
 
Managing business relationships
Nuno Brito
 
Explaining the WinBuilder framework
Nuno Brito
 
White paper - Adhoc 2.0
Nuno Brito
 
Ad

Recently uploaded (20)

PDF
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PPTX
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
PDF
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
Per Axbom: The spectacular lies of maps
Nexer Digital
 
PPTX
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PPTX
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Per Axbom: The spectacular lies of maps
Nexer Digital
 
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
The Future of Artificial Intelligence (AI)
Mukul
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 

2014 10-14: GitHub plus FOSS == 1 million SPDX

  • 1. + => 1 million SPDX Large-scale license transparency using open data, open standards and F/OSS http://triplecheck.net http://searchcode.com
  • 2. Speaker Slide #2 Nuno Brito  Free/open source contributor since 2005  Last 12 months wrote 100k F/OSS lines of code  SPDX contributor, co-founder of TripleCheck Around the web http://nunobrito.eu
  • 3. Transparency Slide #3 Take some source code as example Who developed the code? Which licenses are applicable? Was the code copied from somewhere else?
  • 4. Size Slide #4 A problem of scale Open licenses? > 300 types to choose > 5 million F/OSS projects > 100 million source code files
  • 5. Practice Slide #5 Applying licenses  Burden on developer (do correctly, do enough)  Expressed differently (difficult to understand)  Scaling obstacles (scarce automation) Transparency?
  • 6. What do? Slide #6 Ideally, we'd have tooling that is.. a) Reachable b) Cooperative c) Free Choose two. (sad reality)
  • 7. Choose three Slide #7 Choose building blocks based on: a) Open standards b) Open data c) Reachable tools Learn, write, improve. Share.
  • 8. Standards Slide #8 SPDX: Open standard for software licensing  Standardizes license description  Defines Id for license terms  http://spdx.org Pro: Good docs, straightforward, getting better Cons: Slow adoption, scarce tooling
  • 9. Open data Slide #9 GitHub: Targeting open data repositories  API suited for intensive access  Social coding  Largest open source code collection Pro: Reachable, diverse Cons: Repositories processed one-by-one
  • 10. Tooling Slide #10 Custom-built tools for software licenses  Large-scale repository data-mining  Find applicable licenses inside content  Share millions of SPDX documents Pro: Learn by doing, modularized, single language Cons: Built from scratch, needs consolidation
  • 11. Step 1 Slide #11 Desktop tool/engine to discover licenses  SPDX format as storage medium  Identify copyright and 18 license types  Java, released in Feb 2014. EUPL http://spdx.org/tools/community/triplecheck-reporter
  • 16. Details Slide #16 Underneath the hood  147 file extensions, 18 license types  LOC, hashes (SHA1, MD5, SHA256, SSDEEP)  Command line supported (Jenkins, cron)  Fast, 40k files/minute (Pentium IV)
  • 17. Step 2 Discovering repositories with gitFinder Create a list of projects online to use as components. Get basic licensing information from each project.  Write text file with each github user (~7 million)  For each user, find repositories not forked (~10M)  Split each repository according to language (197)  For each list of language/reps, download code Slide #17
  • 18. Performance Slide #18 ~70k repositories/day  Single machine (i7, 8Gb RAM, CentOS)  9 parallel threads  Resume/recover supported  Released in Jun. 2014 https://github.com/triplecheck/gitfinder
  • 21. Storage BigZip, +100 million files on a single download Slide #21  Flat-file, zip compression (per entry)  Fast, simple, portable. Indexed search https://github.com/triplecheck/big
  • 22. How it looks Slide #22
  • 23. Step 3 Slide #23 SPDX search engine  One-click SPDX creation from open data  Visualize license and copyright data  Visit at http://searchcode.com/spdx
  • 24. Example Slide #24 Using the original URL..  https://github.com/iuly/europa_kernel/ =>  https://spdxhub.com/iuly/europa_kernel/
  • 26. SPDX-1M “Do It Yourself” kit. Generate 1 million SPDX Slide #26  https://github.com/triplecheck/diy  1.2 million open source projects  “Arduino” for s/w licenses detection 9Gb worth of SPDX? Grab: http://triplecheck.net/public/storage/spdx.big
  • 28. Next step? Slide #28 F2F – pinpointing non-original code  Decompose code into blocks  Tokenize/anonymize data  Find code matches across knowledge base ETA in Dec. 2014 https://github.com/triplecheck/f2f
  • 30. Conclusion Slide #30 What is now available for everyone  Desktop tooling / detection engine  Extraction of open data in scale  Search engine for SPDX
  • 31. Questions? Slide #31 http://spdx.org http://searchcode.com/spdx http://github.com/triplecheck Interesting stuff? Let us know: @nn81 @boyte #linuxcon http://xkcd.com/1118/