SlideShare a Scribd company logo
Summarizing Software API Usage Examples
Using Clustering Techniques
N. Katirtzis1,2 T. Diamantopoulos3 and C. Sutton2
1Hotels.com
2School of Informatics
University of Edinburgh, Edinburgh, UK
3Electrical and Computer Engineering Department
Aristotle University of Thessaloniki, Thessaloniki, Greece
21st International Conference on Fundamental Approaches to
Software Engineering (FASE), 2018
CLustering for Api Mining of Snippets
CLAMS
An approach for mining API usage examples from client code that
lies between snippet and sequence mining methods, which ensures
lower complexity and thus could apply more readily to other
languages.
Sample Snippet
AccessToken accessToken;
String oauthToken;
String oAuthVerifier;
Twitter twitter;
try {
accessToken = twitter.getOAuthAccessToken(oauthToken,oAuthVerifier);
// Do something with accessToken
} catch (TwitterException e) {
e.printStackTrace();
}
Outline
1 Background
Introduction
Problem Statement
Related Work
2 Methodology
The Concept
System Overview
Deploying to New Languages
3 Evaluation
Evaluation Framework
Evaluation Results
Outline
1 Background
Introduction
Problem Statement
Related Work
2 Methodology
The Concept
System Overview
Deploying to New Languages
3 Evaluation
Evaluation Framework
Evaluation Results
Introduction
• Third-party libraries are used heavily during SDLC.
• Lack of proper documentation for the APIs of the libraries.
• Creating API usage examples is time-consuming.
Outline
1 Background
Introduction
Problem Statement
Related Work
2 Methodology
The Concept
System Overview
Deploying to New Languages
3 Evaluation
Evaluation Framework
Evaluation Results
Problem Statement
The Problem
Automatically identify a set of patterns that characterize how an
API is typically used from a corpus of client code (API usage
mining).
Outline
1 Background
Introduction
Problem Statement
Related Work
2 Methodology
The Concept
System Overview
Deploying to New Languages
3 Evaluation
Evaluation Framework
Evaluation Results
Related Work
• Systems that Output API Call Sequences
Twitter.setOAuthConsumer
Twitter.setOAuthAccessToken
• Systems that Output Source Code Snippets
String mConsumerKey;
Twitter mTwitter;
AccessToken mAccessToken;
String mSecretKey;
if (mAccessToken != null) {
mTwitter.setOAuthConsumer(mConsumerKey, mSecretKey);
mTwitter.setOAuthAccessToken(mAccessToken);
}
Related Work
Systems that Output API Call Sequences
• MAPO (frequent sequence mining, clustering)
• UP-Miner (clustering)
• PAM (probabilistic modeling)
Disadvantages
• API call sequences do not always describe important
information like method arguments and control flow.
• The output cannot be directly included in one’s code.
Related Work
Systems that Output Source Code Snippets
• eXoaDocs (clustering, program slicing)
• APIMiner (program slicing, association rules)
• Buse and Weimer (path-sensitive data flow analysis,
clustering, pattern abstraction)
Disadvantages
• Rely on detailed semantic analysis and semantic features
which can make them more difficult to deploy to new
languages.
• Limited source code summarization capabilities.
Outline
1 Background
Introduction
Problem Statement
Related Work
2 Methodology
The Concept
System Overview
Deploying to New Languages
3 Evaluation
Evaluation Framework
Evaluation Results
The Concept
1 Cluster a large set of usage examples based on their API calls.
2 Generate summarized versions for the top snippets of each
cluster.
3 Select the most representative snippet from each cluster,
using a tree edit distance metric on the ASTs.
4 Rank the snippets in descending order of support.
Outline
1 Background
Introduction
Problem Statement
Related Work
2 Methodology
The Concept
System Overview
Deploying to New Languages
3 Evaluation
Evaluation Framework
Evaluation Results
System Overview
JAVA
JAVA
JAVA
JAVA
JAVA
JAVA
Clustering
Preprocessor
AST
Extractor
Snippet
Generator
Snippet
Selector
Ranker
JAVA
JAVA
JAVA
JAVA
JAVA
JAVA
API Call
Extractor
ARFF
JAVA
JAVA
JAVA
JAVA
JAVA
JAVA
Client
Files
Snippets
Clustering
Engine
Clustering
Postprocessor
Clustering ModulePreprocessing
Module
API
API call
sequences
distance
matrix
clustered
sequences
most representative sequences
ASTs
summarized
snippets
most
representative
snippets
ranked
snippets
Input ⇒
A set of Client Files
API of the library
Output ⇒ A set of source code snippets
Preprocessing Module
API Call Extractor
String mConsumerKey;
Twitter mTwitter;
AccessToken mAccessToken;
String mSecretKey;
if (mAccessToken != null) {
mTwitter.setOAuthConsumer(mConsumerKey,
mSecretKey);
mTwitter.setOAuthAccessToken(mAccessToken);
}
⇒ Twitter.setOAuthConsumer
Twitter.setOAuthAccessToken
AST Extractor
String mConsumerKey;
Twitter mTwitter;
AccessToken mAccessToken;
String mSecretKey;
if (mAccessToken != null) {
mTwitter.setOAuthConsumer(mConsumerKey,
mSecretKey);
mTwitter.setOAuthAccessToken(mAccessToken);
}
⇒
<unit ... language="Java" filename="test.java">
<decl_stmt><decl><type><name>String</name></
type><name>mConsumerKey</name></decl>;
</decl_stmt>...
<if>if<condition>...</condition>
<then>
<block>{
...
}</block>
</then>
</if>
</unit>
Clustering Module
Clustering Preprocessor
• We cluster at sequence level:
editor.putString("", tkn.getToken());
editor.putString("", tkn.getTokenSecret());
(a)
if (token != null) {
editor.putString("", token.getToken());
editor.putString("", token.getTokenSecret());
}
(b)
• Using the distance matrix which is based on the Longest
Common Subsequence (LCS) between any two API call
sequences:
LCS dist (S1, S2) = 1 − 2 ·
|LCS (S1, S2)|
|S1| + |S2|
(1)
where |S1| and |S2| are the lengths of S1 and S2, and
|LCS (S1, S2)| is the length of their LCS.
Clustering Module
Clustering Engine/Postprocessor
• Explores two different clustering algorithms:
1 k-medoids by Bauckhage.
2 HDBSCAN by McInnes et al.
• We then select multiple snippets for each cluster, this way
retaining source code structure information, which shall be
useful for selecting a single snippet.
Snippet Generator - Summarizer
if (t.getCreatedAt().getTime() + number < mTime) {
breakPaging = char;
} else {
userName = t.getFromUser().toLowerCase();
JUser user = userMap.get(userName);
if (user == null) {
user = new JUser(userName).init(t);
userMap.put(userName, user);
}
}
Step 1: Preprocess comments and literals
if (t.getCreatedAt().getTime() + number < mTime) {
breakPaging = char;
} else {
userName = t.getFromUser().toLowerCase();
JUser user = userMap.get(userName);
if (user == null) {
user = new JUser(userName).init(t);
userMap.put(userName, user);
}
}
Step 3: Retrieve local scope variables
if (t.getCreatedAt().getTime() + number < mTime) {
} else {
userName = t.getFromUser().toLowerCase();
}
long mTime;
Tweet t;
String userName;
if (t.getCreatedAt().getTime() + number < mTime) {
// Do something
} else {
userName = t.getFromUser().toLowerCase();
// Do something with userName
}
Step 6: Add declaration statements and comments
Summarizer Input
if (t.getCreatedAt().getTime() + 1000 < mTime) {
breakPaging = 'y';
//TODO
} else {
userName = t.getFromUser().toLowerCase();
JUser user = userMap.get(userName);
if (user == null) {
user = new JUser(userName).init(t);
userMap.put(userName, user);
}
}
Step 2: Identify API statements
if (t.getCreatedAt().getTime() + number < mTime) {
breakPaging = char;
} else {
userName = t.getFromUser().toLowerCase();
JUser user = userMap.get(userName);
if (user == null) {
user = new JUser(userName).init(t);
userMap.put(userName, user);
}
}
Step 4: Remove non-API statements
if (t.getCreatedAt().getTime() + number < mTime) {
} else {
userName = t.getFromUser().toLowerCase();
}
Step 5: Filtering variables
Snippet Selector
Goal
Select the most representative snippet of each cluster.
Concept
• Create a matrix for each cluster, which contains the distance
between any two top snippets of the cluster.
• Use the APTED algorithm to compute the tree edit distance
between any two snippets.
• Select the snippet with the minimum sum of distances in each
cluster’s matrix.
Ranker
Goal
Rank the snippets in descending order of support.
Concept
A client file supports a snippet if the API call sequence of the
snippet is a subsequence of the sequence of that file.
Example
The snippet with API call sequence:
[twitter4j.Status.getUser, twitter4j.Status.getText]
is supported by a client file with sequence:
[twitter4j.Paging.<init>, twitter4j.Status.getUser,
twitter4j.Status.getId, twitter4j.Status.getText, twitter4j.
Status.getUser].
Outline
1 Background
Introduction
Problem Statement
Related Work
2 Methodology
The Concept
System Overview
Deploying to New Languages
3 Evaluation
Evaluation Framework
Evaluation Results
Deploying to New Languages
Our methodology can be easily deployed to additional
programming languages.
Table: Concept on which the CLAMS modules are based on.
Module Main Concept
Preprocessing Module AST
Clustering Module API Call Sequence
Snippet Generator Statement/Control Flow
Snippet Selector AST
Ranker API Call Sequence
Outline
1 Background
Introduction
Problem Statement
Related Work
2 Methodology
The Concept
System Overview
Deploying to New Languages
3 Evaluation
Evaluation Framework
Evaluation Results
Evaluation Framework
Dataset
Table: Summary of the evaluation dataset.
Project Package Name Client LOC Example LOC
Apache Camel org.apache.camel 141,454 15,256
Drools org.drools 187,809 15,390
Restlet Framework org.restlet 208,395 41,078
Twitter4j twitter4j 96,020 6,560
Project Wonder com.webobjects 375,064 37,181
Apache Wicket org.apache.wicket 564,418 33,025
Evaluation Framework
Research Questions
RQ1: How much more concise, readable, and precise with re-
spect to handwritten examples are the snippets after
summarization?
RQ2: Do more powerful clustering techniques, that cluster
similar rather than identical sequences, lead to snippets
that more closely match handwritten examples?
RQ3: Does our tool mine more diverse patterns than other
existing approaches?
RQ4: Do snippets match handwritten examples more than API
call sequences?
Figure: Research Questions (RQs) to be evaluated.
Outline
1 Background
Introduction
Problem Statement
Related Work
2 Methodology
The Concept
System Overview
Deploying to New Languages
3 Evaluation
Evaluation Framework
Evaluation Results
Evaluation Results - RQ1
RQ1: How much more concise, readable, and precise with
respect to handwritten examples are the snippets after
summarization?
Apache
Camel
Drools
Restlet
Framework
Twitter4j
Project
Wonder
Apache
Wicket
0.0
0.1
0.2
0.3
0.4
AverageReadability
NaiveSum
NaiveNoSum
(a)
Apache
Camel
Drools
Restlet
Framework
Twitter4j
Project
Wonder
Apache
Wicket
0
5
10
15
20
AveragePhysicalLines
NaiveSum
NaiveNoSum
(b)
Figure: (a) average readability, and (b) average PLOCs of the snippets,
for each library, with (NaiveSum) and without (NaiveNoSum)
summarization.
Evaluation Results - RQ1
RQ1: How much more concise, readable, and precise with
respect to handwritten examples are the snippets after
summarization?
10 20 30 40 50
top k
0.0
0.1
0.2
0.3
0.4Precision
NaiveSum
NaiveNoSum
Figure: Precision at top k, with (NaiveSum) or without (NaiveNoSum)
summarization using the top 50 mined snippets.
Evaluation Results - RQ2
RQ2: Do more powerful clustering techniques, that cluster
similar rather than identical sequences, lead to snippets that
more closely match handwritten examples?
20 40 60 80 100 120
No. API methods covered
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7SnippetPrecision
A1
A2
A3
NaiveSum
HDBSCANSum
KMedoidsSum
Figure: Average interpolated snippet precision versus API coverage for
three clustering algorithms, using the top 100 mined snippets.
Evaluation Results - RQ3
RQ3: Does our tool mine more diverse patterns than other
existing approaches?
20 40 60 80 100
top k
0
25
50
75
100
APImethodscovered
CLAMS
UP-Miner
MAPO
Figure: Coverage in API methods achieved by CLAMS, MAPO, and
UP-Miner on average, at top k, using the top 100 examples.
Evaluation Results - RQ4
RQ4: Do source code snippets match handwritten examples
more than API call sequences?
Apache
Camel
Drools Restlet
Framework
Twitter4j Project
Wonder
Apache
Wicket
0
2
4
6
8
10
Averageno.commontokens
Sequence-tokens
Additional snippet-tokens
Figure: Additional information revealed when mining snippets instead of
sequences.
Evaluation Results - RQ4
RQ4: Do source code snippets match handwritten examples
more than API call sequences?
AccessToken accessToken;
String oauthToken;
String oAuthVerifier;
Twitter twitter;
try {
accessToken = twitter.getOAuthAccessToken(oauthToken,oAuthVerifier);
// Do something with accessToken
} catch (TwitterException e) {
e.printStackTrace();
}
Figure: Example snippet matched to handwritten example.
Sequence-tokens are encircled and additional snippet-tokens are
highlighted in bold.
Applying CLAMS to the industry
• We conducted a pilot user survey at a team of Java developers
at Hotels.com which received encouraging feedback:
Developer 1: The system generates clear and concise
snippets which would be easy to follow and useful when
using an API you are unfamiliar with.
Developer 2: The system is great and I think it would be
very useful particularly in discovering clients of our APIs!
Applying CLAMS to the industry
• We applied CLAMS at Hotels.com and developed an internal
method-based search engine which can be used for API
documentation purposes.
Resources
CLAMS Website
https://mast-group.github.io/clams/
CLAMS User Survey at Hotels.com
https://mast-group.github.io/clams/user-survey/
CLAMS Source Code
https://github.com/mast-group/clams

More Related Content

What's hot (20)

PDF
Applications of Machine Learning and Metaheuristic Search to Security Testing
Lionel Briand
 
PPTX
Final training course
Noor Dhiya
 
PPT
Mutual Exclusion Election (Distributed computing)
Sri Prasanna
 
PDF
Search-driven String Constraint Solving for Vulnerability Detection
Lionel Briand
 
PPTX
Isola 12 presentation
Iakovos Ouranos
 
PDF
Open Problems in Automatically Refactoring Legacy Java Software to use New Fe...
Raffi Khatchadourian
 
PPTX
Heterogeneous Defect Prediction (

ESEC/FSE 2015)
Sung Kim
 
DOC
ECET 375 Invent Yourself/newtonhelp.com
lechenau125
 
PDF
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
IRJET Journal
 
PDF
Model checking
Richard Ashworth
 
PPT
Inside LoLA - Experiences from building a state space tool for place transiti...
Universität Rostock
 
PDF
Analyzing Changes in Software Systems From ChangeDistiller to FMDiff
Martin Pinzger
 
PDF
Producer consumer-problems
Richard Ashworth
 
PDF
A Search-based Testing Approach for XML Injection Vulnerabilities in Web Appl...
Lionel Briand
 
PDF
Chapter 11d coordination agreement
AbDul ThaYyal
 
PPTX
A survey of distributed deadlock detection algorithms
anaykh1992
 
PDF
Python for Machine Learning
Student
 
PDF
Automated Repair of Feature Interaction Failures in Automated Driving Systems
Lionel Briand
 
PPT
Thesis F. Redaelli UIC Slides EN
Marco Santambrogio
 
PDF
Generator of pseudorandom sequences
Venkata Sai Kalyan Routhu
 
Applications of Machine Learning and Metaheuristic Search to Security Testing
Lionel Briand
 
Final training course
Noor Dhiya
 
Mutual Exclusion Election (Distributed computing)
Sri Prasanna
 
Search-driven String Constraint Solving for Vulnerability Detection
Lionel Briand
 
Isola 12 presentation
Iakovos Ouranos
 
Open Problems in Automatically Refactoring Legacy Java Software to use New Fe...
Raffi Khatchadourian
 
Heterogeneous Defect Prediction (

ESEC/FSE 2015)
Sung Kim
 
ECET 375 Invent Yourself/newtonhelp.com
lechenau125
 
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
IRJET Journal
 
Model checking
Richard Ashworth
 
Inside LoLA - Experiences from building a state space tool for place transiti...
Universität Rostock
 
Analyzing Changes in Software Systems From ChangeDistiller to FMDiff
Martin Pinzger
 
Producer consumer-problems
Richard Ashworth
 
A Search-based Testing Approach for XML Injection Vulnerabilities in Web Appl...
Lionel Briand
 
Chapter 11d coordination agreement
AbDul ThaYyal
 
A survey of distributed deadlock detection algorithms
anaykh1992
 
Python for Machine Learning
Student
 
Automated Repair of Feature Interaction Failures in Automated Driving Systems
Lionel Briand
 
Thesis F. Redaelli UIC Slides EN
Marco Santambrogio
 
Generator of pseudorandom sequences
Venkata Sai Kalyan Routhu
 

Similar to Summarizing Software API Usage Examples Using Clustering Techniques (20)

PPTX
Architectures, Frameworks and Infrastructure
harendra_pathak
 
PDF
Construction Techniques For Domain Specific Languages
ThoughtWorks
 
PDF
PRIME OOPSLA12 paper
Eran Yahav
 
PPTX
Mining Code Examples with Descriptive Text from Software Artifacts
Preetha Chatterjee
 
PDF
Spec-first API Design for Speed and Safety
Atlassian
 
PPTX
Spring Test Framework
GlobalLogic Ukraine
 
PPTX
A Conceptual Dependency Graph Based Keyword Extraction Model for Source Code...
Nakul Sharma
 
PPT
Slides
Videoguy
 
PPTX
API Documentation Workshop tcworld India 2015
Tom Johnson
 
PPTX
Towards a Generic Cloud-based Modeling Environment
ljuracz
 
PDF
API Docs Made Right / RAML - Swagger rant
Vladimir Shulyak
 
PDF
Enabling White-Box Reuse in a Pure Composition Language
elliando dias
 
PDF
Tooling for the JavaScript Era
martinlippert
 
PPTX
From Pilot to Product - Morning@Lohika
Ivan Verhun
 
PDF
Java Programming
Tracy Clark
 
PDF
INTERFACE, by apidays - Building an Accessible API Spec
apidays
 
PDF
The Final Frontier
jClarity
 
PDF
Training Semester Report, Api Types of Apps
RamanTayal4
 
PDF
Generating docs from APIs
jamiehannaford
 
Architectures, Frameworks and Infrastructure
harendra_pathak
 
Construction Techniques For Domain Specific Languages
ThoughtWorks
 
PRIME OOPSLA12 paper
Eran Yahav
 
Mining Code Examples with Descriptive Text from Software Artifacts
Preetha Chatterjee
 
Spec-first API Design for Speed and Safety
Atlassian
 
Spring Test Framework
GlobalLogic Ukraine
 
A Conceptual Dependency Graph Based Keyword Extraction Model for Source Code...
Nakul Sharma
 
Slides
Videoguy
 
API Documentation Workshop tcworld India 2015
Tom Johnson
 
Towards a Generic Cloud-based Modeling Environment
ljuracz
 
API Docs Made Right / RAML - Swagger rant
Vladimir Shulyak
 
Enabling White-Box Reuse in a Pure Composition Language
elliando dias
 
Tooling for the JavaScript Era
martinlippert
 
From Pilot to Product - Morning@Lohika
Ivan Verhun
 
Java Programming
Tracy Clark
 
INTERFACE, by apidays - Building an Accessible API Spec
apidays
 
The Final Frontier
jClarity
 
Training Semester Report, Api Types of Apps
RamanTayal4
 
Generating docs from APIs
jamiehannaford
 
Ad

Recently uploaded (20)

PPTX
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PDF
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PPTX
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PDF
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
PDF
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PPTX
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
Ad

Summarizing Software API Usage Examples Using Clustering Techniques

  • 1. Summarizing Software API Usage Examples Using Clustering Techniques N. Katirtzis1,2 T. Diamantopoulos3 and C. Sutton2 1Hotels.com 2School of Informatics University of Edinburgh, Edinburgh, UK 3Electrical and Computer Engineering Department Aristotle University of Thessaloniki, Thessaloniki, Greece 21st International Conference on Fundamental Approaches to Software Engineering (FASE), 2018
  • 2. CLustering for Api Mining of Snippets CLAMS An approach for mining API usage examples from client code that lies between snippet and sequence mining methods, which ensures lower complexity and thus could apply more readily to other languages. Sample Snippet AccessToken accessToken; String oauthToken; String oAuthVerifier; Twitter twitter; try { accessToken = twitter.getOAuthAccessToken(oauthToken,oAuthVerifier); // Do something with accessToken } catch (TwitterException e) { e.printStackTrace(); }
  • 3. Outline 1 Background Introduction Problem Statement Related Work 2 Methodology The Concept System Overview Deploying to New Languages 3 Evaluation Evaluation Framework Evaluation Results
  • 4. Outline 1 Background Introduction Problem Statement Related Work 2 Methodology The Concept System Overview Deploying to New Languages 3 Evaluation Evaluation Framework Evaluation Results
  • 5. Introduction • Third-party libraries are used heavily during SDLC. • Lack of proper documentation for the APIs of the libraries. • Creating API usage examples is time-consuming.
  • 6. Outline 1 Background Introduction Problem Statement Related Work 2 Methodology The Concept System Overview Deploying to New Languages 3 Evaluation Evaluation Framework Evaluation Results
  • 7. Problem Statement The Problem Automatically identify a set of patterns that characterize how an API is typically used from a corpus of client code (API usage mining).
  • 8. Outline 1 Background Introduction Problem Statement Related Work 2 Methodology The Concept System Overview Deploying to New Languages 3 Evaluation Evaluation Framework Evaluation Results
  • 9. Related Work • Systems that Output API Call Sequences Twitter.setOAuthConsumer Twitter.setOAuthAccessToken • Systems that Output Source Code Snippets String mConsumerKey; Twitter mTwitter; AccessToken mAccessToken; String mSecretKey; if (mAccessToken != null) { mTwitter.setOAuthConsumer(mConsumerKey, mSecretKey); mTwitter.setOAuthAccessToken(mAccessToken); }
  • 10. Related Work Systems that Output API Call Sequences • MAPO (frequent sequence mining, clustering) • UP-Miner (clustering) • PAM (probabilistic modeling) Disadvantages • API call sequences do not always describe important information like method arguments and control flow. • The output cannot be directly included in one’s code.
  • 11. Related Work Systems that Output Source Code Snippets • eXoaDocs (clustering, program slicing) • APIMiner (program slicing, association rules) • Buse and Weimer (path-sensitive data flow analysis, clustering, pattern abstraction) Disadvantages • Rely on detailed semantic analysis and semantic features which can make them more difficult to deploy to new languages. • Limited source code summarization capabilities.
  • 12. Outline 1 Background Introduction Problem Statement Related Work 2 Methodology The Concept System Overview Deploying to New Languages 3 Evaluation Evaluation Framework Evaluation Results
  • 13. The Concept 1 Cluster a large set of usage examples based on their API calls. 2 Generate summarized versions for the top snippets of each cluster. 3 Select the most representative snippet from each cluster, using a tree edit distance metric on the ASTs. 4 Rank the snippets in descending order of support.
  • 14. Outline 1 Background Introduction Problem Statement Related Work 2 Methodology The Concept System Overview Deploying to New Languages 3 Evaluation Evaluation Framework Evaluation Results
  • 15. System Overview JAVA JAVA JAVA JAVA JAVA JAVA Clustering Preprocessor AST Extractor Snippet Generator Snippet Selector Ranker JAVA JAVA JAVA JAVA JAVA JAVA API Call Extractor ARFF JAVA JAVA JAVA JAVA JAVA JAVA Client Files Snippets Clustering Engine Clustering Postprocessor Clustering ModulePreprocessing Module API API call sequences distance matrix clustered sequences most representative sequences ASTs summarized snippets most representative snippets ranked snippets Input ⇒ A set of Client Files API of the library Output ⇒ A set of source code snippets
  • 16. Preprocessing Module API Call Extractor String mConsumerKey; Twitter mTwitter; AccessToken mAccessToken; String mSecretKey; if (mAccessToken != null) { mTwitter.setOAuthConsumer(mConsumerKey, mSecretKey); mTwitter.setOAuthAccessToken(mAccessToken); } ⇒ Twitter.setOAuthConsumer Twitter.setOAuthAccessToken AST Extractor String mConsumerKey; Twitter mTwitter; AccessToken mAccessToken; String mSecretKey; if (mAccessToken != null) { mTwitter.setOAuthConsumer(mConsumerKey, mSecretKey); mTwitter.setOAuthAccessToken(mAccessToken); } ⇒ <unit ... language="Java" filename="test.java"> <decl_stmt><decl><type><name>String</name></ type><name>mConsumerKey</name></decl>; </decl_stmt>... <if>if<condition>...</condition> <then> <block>{ ... }</block> </then> </if> </unit>
  • 17. Clustering Module Clustering Preprocessor • We cluster at sequence level: editor.putString("", tkn.getToken()); editor.putString("", tkn.getTokenSecret()); (a) if (token != null) { editor.putString("", token.getToken()); editor.putString("", token.getTokenSecret()); } (b) • Using the distance matrix which is based on the Longest Common Subsequence (LCS) between any two API call sequences: LCS dist (S1, S2) = 1 − 2 · |LCS (S1, S2)| |S1| + |S2| (1) where |S1| and |S2| are the lengths of S1 and S2, and |LCS (S1, S2)| is the length of their LCS.
  • 18. Clustering Module Clustering Engine/Postprocessor • Explores two different clustering algorithms: 1 k-medoids by Bauckhage. 2 HDBSCAN by McInnes et al. • We then select multiple snippets for each cluster, this way retaining source code structure information, which shall be useful for selecting a single snippet.
  • 19. Snippet Generator - Summarizer if (t.getCreatedAt().getTime() + number < mTime) { breakPaging = char; } else { userName = t.getFromUser().toLowerCase(); JUser user = userMap.get(userName); if (user == null) { user = new JUser(userName).init(t); userMap.put(userName, user); } } Step 1: Preprocess comments and literals if (t.getCreatedAt().getTime() + number < mTime) { breakPaging = char; } else { userName = t.getFromUser().toLowerCase(); JUser user = userMap.get(userName); if (user == null) { user = new JUser(userName).init(t); userMap.put(userName, user); } } Step 3: Retrieve local scope variables if (t.getCreatedAt().getTime() + number < mTime) { } else { userName = t.getFromUser().toLowerCase(); } long mTime; Tweet t; String userName; if (t.getCreatedAt().getTime() + number < mTime) { // Do something } else { userName = t.getFromUser().toLowerCase(); // Do something with userName } Step 6: Add declaration statements and comments Summarizer Input if (t.getCreatedAt().getTime() + 1000 < mTime) { breakPaging = 'y'; //TODO } else { userName = t.getFromUser().toLowerCase(); JUser user = userMap.get(userName); if (user == null) { user = new JUser(userName).init(t); userMap.put(userName, user); } } Step 2: Identify API statements if (t.getCreatedAt().getTime() + number < mTime) { breakPaging = char; } else { userName = t.getFromUser().toLowerCase(); JUser user = userMap.get(userName); if (user == null) { user = new JUser(userName).init(t); userMap.put(userName, user); } } Step 4: Remove non-API statements if (t.getCreatedAt().getTime() + number < mTime) { } else { userName = t.getFromUser().toLowerCase(); } Step 5: Filtering variables
  • 20. Snippet Selector Goal Select the most representative snippet of each cluster. Concept • Create a matrix for each cluster, which contains the distance between any two top snippets of the cluster. • Use the APTED algorithm to compute the tree edit distance between any two snippets. • Select the snippet with the minimum sum of distances in each cluster’s matrix.
  • 21. Ranker Goal Rank the snippets in descending order of support. Concept A client file supports a snippet if the API call sequence of the snippet is a subsequence of the sequence of that file. Example The snippet with API call sequence: [twitter4j.Status.getUser, twitter4j.Status.getText] is supported by a client file with sequence: [twitter4j.Paging.<init>, twitter4j.Status.getUser, twitter4j.Status.getId, twitter4j.Status.getText, twitter4j. Status.getUser].
  • 22. Outline 1 Background Introduction Problem Statement Related Work 2 Methodology The Concept System Overview Deploying to New Languages 3 Evaluation Evaluation Framework Evaluation Results
  • 23. Deploying to New Languages Our methodology can be easily deployed to additional programming languages. Table: Concept on which the CLAMS modules are based on. Module Main Concept Preprocessing Module AST Clustering Module API Call Sequence Snippet Generator Statement/Control Flow Snippet Selector AST Ranker API Call Sequence
  • 24. Outline 1 Background Introduction Problem Statement Related Work 2 Methodology The Concept System Overview Deploying to New Languages 3 Evaluation Evaluation Framework Evaluation Results
  • 25. Evaluation Framework Dataset Table: Summary of the evaluation dataset. Project Package Name Client LOC Example LOC Apache Camel org.apache.camel 141,454 15,256 Drools org.drools 187,809 15,390 Restlet Framework org.restlet 208,395 41,078 Twitter4j twitter4j 96,020 6,560 Project Wonder com.webobjects 375,064 37,181 Apache Wicket org.apache.wicket 564,418 33,025
  • 26. Evaluation Framework Research Questions RQ1: How much more concise, readable, and precise with re- spect to handwritten examples are the snippets after summarization? RQ2: Do more powerful clustering techniques, that cluster similar rather than identical sequences, lead to snippets that more closely match handwritten examples? RQ3: Does our tool mine more diverse patterns than other existing approaches? RQ4: Do snippets match handwritten examples more than API call sequences? Figure: Research Questions (RQs) to be evaluated.
  • 27. Outline 1 Background Introduction Problem Statement Related Work 2 Methodology The Concept System Overview Deploying to New Languages 3 Evaluation Evaluation Framework Evaluation Results
  • 28. Evaluation Results - RQ1 RQ1: How much more concise, readable, and precise with respect to handwritten examples are the snippets after summarization? Apache Camel Drools Restlet Framework Twitter4j Project Wonder Apache Wicket 0.0 0.1 0.2 0.3 0.4 AverageReadability NaiveSum NaiveNoSum (a) Apache Camel Drools Restlet Framework Twitter4j Project Wonder Apache Wicket 0 5 10 15 20 AveragePhysicalLines NaiveSum NaiveNoSum (b) Figure: (a) average readability, and (b) average PLOCs of the snippets, for each library, with (NaiveSum) and without (NaiveNoSum) summarization.
  • 29. Evaluation Results - RQ1 RQ1: How much more concise, readable, and precise with respect to handwritten examples are the snippets after summarization? 10 20 30 40 50 top k 0.0 0.1 0.2 0.3 0.4Precision NaiveSum NaiveNoSum Figure: Precision at top k, with (NaiveSum) or without (NaiveNoSum) summarization using the top 50 mined snippets.
  • 30. Evaluation Results - RQ2 RQ2: Do more powerful clustering techniques, that cluster similar rather than identical sequences, lead to snippets that more closely match handwritten examples? 20 40 60 80 100 120 No. API methods covered 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7SnippetPrecision A1 A2 A3 NaiveSum HDBSCANSum KMedoidsSum Figure: Average interpolated snippet precision versus API coverage for three clustering algorithms, using the top 100 mined snippets.
  • 31. Evaluation Results - RQ3 RQ3: Does our tool mine more diverse patterns than other existing approaches? 20 40 60 80 100 top k 0 25 50 75 100 APImethodscovered CLAMS UP-Miner MAPO Figure: Coverage in API methods achieved by CLAMS, MAPO, and UP-Miner on average, at top k, using the top 100 examples.
  • 32. Evaluation Results - RQ4 RQ4: Do source code snippets match handwritten examples more than API call sequences? Apache Camel Drools Restlet Framework Twitter4j Project Wonder Apache Wicket 0 2 4 6 8 10 Averageno.commontokens Sequence-tokens Additional snippet-tokens Figure: Additional information revealed when mining snippets instead of sequences.
  • 33. Evaluation Results - RQ4 RQ4: Do source code snippets match handwritten examples more than API call sequences? AccessToken accessToken; String oauthToken; String oAuthVerifier; Twitter twitter; try { accessToken = twitter.getOAuthAccessToken(oauthToken,oAuthVerifier); // Do something with accessToken } catch (TwitterException e) { e.printStackTrace(); } Figure: Example snippet matched to handwritten example. Sequence-tokens are encircled and additional snippet-tokens are highlighted in bold.
  • 34. Applying CLAMS to the industry • We conducted a pilot user survey at a team of Java developers at Hotels.com which received encouraging feedback: Developer 1: The system generates clear and concise snippets which would be easy to follow and useful when using an API you are unfamiliar with. Developer 2: The system is great and I think it would be very useful particularly in discovering clients of our APIs!
  • 35. Applying CLAMS to the industry • We applied CLAMS at Hotels.com and developed an internal method-based search engine which can be used for API documentation purposes.
  • 36. Resources CLAMS Website https://mast-group.github.io/clams/ CLAMS User Survey at Hotels.com https://mast-group.github.io/clams/user-survey/ CLAMS Source Code https://github.com/mast-group/clams