Machine Learning on Code - SF meetup

“Software is eating the world”

Machine Learning on Source Code
Francesc Campoy

VP of Product & DevRel
source{d}
Machine Learning for Large Scale Code Analysis
@francesc | #MLonCode
Francesc Campoy

Agenda
● Machine Learning on Source Code
● Research
● Use Cases
● The Future

Field of Machine Learning where the input data is source code.
MLonCode

Requires:
● Lots of data
● Really, lots and lots of data
● Fancy ML Algorithms
● A little bit of luck
Related Fields:
● Data Mining
● Natural Language Processing
● Graph Based Machine Learning

The datasets of ML on Code
● GH Archive: https://www.gharchive.org
● Public Git Archive https://pga.sourced.tech

Announcement: blog.sourced.tech/post/announcing-pga
Public Git Archive

blog.sourced.tech/post/pga_history/

'112', '97', '99', '107', '97', '103',
'101', '32', '109', '97', '105', '110',
'10', '10', '105', '109', '112', '111',
'114', '116', '32', '40', '10', '9',
'34', '102', '109', '116', '34', '10',
'41', '10', '10', '102', '117', '110',
'99', '32', '109', '97', '105', '110',
'40', '41', '32', '123', '10', '9',
'102', '109', '116', '46', '80', '114',
'105', '110', '116', '108', '110', '40',
'34', '72', '101', '108', '108', '111',
'44', '32', '112', '108', '97', '121',
'103', '114', '111', '117', '110', '100',
'34', '41', '10', '125', '10'
package main
import “fmt”
func main() {
fmt.Println(“Hello, Copenhagen”)
}
What is Source Code

package package
IDENT main
;
import import
STRING "fmt"
;
func func
IDENT main
(
)
What is Source Code
{
IDENT fmt
.
IDENT Println
(
STRING "Hello, Denver"
)
;
}
;
package main
import “fmt”
func main() {
}

What is Source Code
package main
import “fmt”
func main() {
}

What is Source Code
● A sequence of bytes
● A sequence of tokens
● An abstract syntax tree
● A graph (e.g. Control Flow Graph)

Tasks
● Language Classification
● File Parsing
● Token Extraction
● History Analysis
● Reference Resolution
Tools
● enry
● babelfish
● libuast & XPath selectors
● go-git
● kythe.io
Analyzing Code

source{d} engine
github.com/src-d/engine

babelfish
gitbase
jupiter
Demo time!

Challenge #3
Learning from Source Code

Neural Networks
Basically fancy linear regression machines
Given an input of a constant length,
they predict an output of constant length.
Example:
MNIST:
Input: images with 28x28 px
Output: a digit from zero to 9

MNIST
~0
~0
~0
~0
~0
~0
~0
~0
~1
~0

MLonCode: Predict the next token
for
i
:=
0
;
i
<
10
;
i
++

Recurrent Neural Networks
Can process sequences of variable length.
Uses its own output as a new input.
Example:
Natural Language Translation:
Input: “bonjour, les gauffres”
Output: “hi, waffles”

MLonCode: Code Generation
charRNN: Given n characters, predict the next one
Trained over the Go standard library
Achieved 61% accuracy on predictions.

Before training
r t,
kp0t@pp kpktp 0p000 xS%%%?ttk?^@p0rk^@%ppp@ta#p^@ #pp}}%p^@?P%^@@k#%@P}}ta S?@}^@t%@% %%aNt i
^@SSt@@ pyikkp?%y ?t k L P0L t% ^@i%yy ^@p i? %L%LL tyLP?a ?L@Ly?tkk^@ @^@ykk^@i#P^@iL@??@%1tt%^@tPTta L
^@LL%% %i1::yyy^@^@t tP @?@a#Patt 1^@@ k^@k ? yt%L1^@tP%k1?k? % ^@i ^@ta1?1taktt1P?a^@^@Pkt?#^@t^@##1?##
#^@t11#:^@%??t%1^@a 1?a at1P ^@^@Pt #%^@^@ ^@aaak^@#a#?P1Pa^@tt%?^@kt?#akP ?#^@i%%aa ^@1%t tt?a?%
t^@k^@^@k^@ a : ^@1 P# % ^@^@#t% :% kkP ^@#?P: t^@a
?%##?kkPaP^@ #a k?t?? ai?i%PPk taP% P^@ k^@iiai#?^@# #t ?# P?P^@ i^@ttPt #
1%11 ti a^@k P^@k ^@kt %^@%y?#a a#% @? kt ^@t%k? ^@PtttkL tkLa1 ^@iaay?p1% Pta tt ik?ty
k^@kpt%^@tktpkryyp^@?pP# %kt?ki? i @t^@k^@%#P} ?at}akP##Pa11%^@i% ^@?ia ia%##%tki %
}i%%%}} a ay^@%yt }%t ^@tU%a% t}yi^@ ^@ @t yt%? aP @% ^@??^@%? ^@??k#%
kk#%t?a: P}^@t :#^@#1t^@#: w^@P#%w:Pt t # t%aa%i@ak@@^@ka@^@a # y}^@# ^@? % tP i?
?tk ktPPt a tpprrpt? a^@ pP pt %p ? k? ^@^@ kP^@%%?tk a Pt^@#
tP? P kkP1L1tP a%? t1P%PPti^@?%ytk %#%%t?@?^@ty^@iyk%1#^@@^@1#t a t@P^@^@ P@^@1P^@%%#@P:^@%^@ t
1:#P#@LtL#@L L1 %%dt??^@L ^@iBt yTk%p ^@i

After one epoch (dataset seen once)
if testingValuesIntering() {
t.SetCaterCleen(time.SewsallSetrive(true)
if weq := nil {
t.Errorf("eshould: wont %v", touts anverals, prc.Strnared, error
}
t, err := ntr.Soare(cueper(err, err)
if err != nil {
t.Errorf("preveth dime il resetests:%d; want %#',", tr.test.Into
}
if err != nil {
return
}
if err == nel {
t.Errorf("LoconserrSathe foot %q::%q: %s;%want %d", d, err)
},
defarenContateFule(temt.Canses)
}
if err != nil {
return err
}
// Treters and restives of the sesconse stmpeletatareservet
// This no to the result digares wheckader. Constate bytes alleal

After two epochs
if !ok {
t.Errorf("%d: %v not %v", i, err)
}
if !ot.Close()
if enr != nil {
t.Fatal(err)
}
if !ers != nil {
t.Fatal(err)
}
if err != nil {
t.Fatal(err)
}
if err != nil {
t.Errorf("error %q: %s not %v", i, err)
}
return nil
}

if got := t.struct(); !ok {
t.Fatalf("Got %q: %q, %v, want %q", test, true
}
if !strings.Connig(t) {
t.Fatalf("Got %q: %q", want %q", t, err)
}
if !ot {
t.Errorf("%s < %v", x, y)
}
if !ok {
t.Errorf("%d <= %d", err)
}
if !stricgs(); !ot {
t.Errorf("!(%d <= %v", x, e)
}
}
if !ot != nil {
return ""
}
After many epochs

Learning to Represent Programs with Graphs
from, err := os.Open("a.txt")
if err != nil {
log.Fatal(err)
}
defer from.Close()
to, err := os.Open("b.txt")
if err != nil {
log.Fatal(err)
}
defer ???.Close()
io.Copy(to, from)
Miltiadis Allamanis, Marc Brockschmidt, Mahmoud Khademi
https://arxiv.org/abs/1711.00740
The VARMISUSE Task:
Given a program and a gap in it,
predict what variable is missing.

code2vec: Learning Distributed Representations of Code
Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav
https://arxiv.org/abs/1803.09473 | https://code2vec.org/

Much more research
github.com/src-d/awesome-machine-learning-on-source-code

Challenge #4
What can we build?

Predictable vs Predicted
~0
~0
~0
~0
~0
~0
~0
~0
~1
~0

A
G
o
PR
An attention model for code reviews.

Can you see the mistake?
Prediction vs Expectation
for i := 0; i < 10; i-- {
if i %2 == 0 {
fmt.Println("where's the mistake?")
}
}

VARMISUSE
if err != nil {
log.Fatal(err)
}
defer from.Close()
if err != nil {
log.Fatal(err)
}
defer from.Close()
io.Copy(to, from)

VARMISUSE
if err != nil {
log.Fatal(err)
}
defer from.Close()
if err != nil {
log.Fatal(err)
}
defer from.Close() ← s/from/to/
io.Copy(to, from)

Is this a good name?
func XXX(list []string, text string) bool {
for _, s := range list {
if s == text {
return true
}
}
return false
}
Suggestions:
● Contains
● Has
func XXX(list []string, text string) int {
for i, s := range list {
if s == text {
return i
}
}
return -1
}
Suggestions:
● Find
● Index
code2vec: Learning Distributed Representations of Code

Splitting millions of identifiers with Deep Learning
https://blog.sourced.tech/post/idsplit/
isthisCorrect? → is this correct? → isThisCorrect?

Demo time!
learning Go
code2vec.org
neural splitter

source: WOCinTech
Assisted code review! src-d/lookout

putting everything together
lookout

Coming up soon:
● Automated Style Guide Enforcing
● Bug Prediction
Coming … later:
● Automated Code Review
● Code Generation: from unit tests, specification, natural language
description.
● Natural Analysis: code description and conversational analysis.
● Education
And so much more

Want to know more?
● sourced.tech (pssh, we’re hiring)
● bit.ly/awesome-mloncode
● francesc@sourced.tech
● come say hi, I have stickers

Machine Learning on Code - SF meetup

More Related Content

What's hot (20)

Similar to Machine Learning on Code - SF meetup (20)

More from source{d} (13)

Recently uploaded (20)

Machine Learning on Code - SF meetup