SlideShare a Scribd company logo
How to build own
translator in 15 minutes
Neural Machine Translation in practice
Bartek Rozkrut
2040.io
Why so
important?
40 billion USD /
year industry
Huge barrier for
many people
Provide unlimited
access to
knowledge
Scale NLP
problems
AIMeetup #4: Neural-machine-translation
AIMeetup #4: Neural-machine-translation
AIMeetup #4: Neural-machine-translation
AIMeetup #4: Neural-machine-translation
AIMeetup #4: Neural-machine-translation
AIMeetup #4: Neural-machine-translation
AIMeetup #4: Neural-machine-translation
AIMeetup #4: Neural-machine-translation
Why own translator?
1.Private / sensitive data
2.Huge amount of data – eg. e-mail translation (cost)
3.Off-line / off-cloud / on-premise
4.Custom domain-specific translation / vocabulary
Neural Machine Translation – example workflow
1. Download Parallel Corpus files
2. Append all corpus files (source + target) in same order
3. Split TRAIN / VAL set
4. Tokenization
5. Preprocess
6. Train
7. Release model (CPU compatible)
8. Translate!
9. REPEAT! 
Parallel Corpus – public data
HTTP://OPUS.LINGFIL.UU.SE
Parallel Corpus (source file – PL, EUROPARL)
1.Tytuł: Admirał NATO potrzebuje przyjaciół.
2.Dziękuję.
3.Naprawdę potrzebuję...
4.Ten program stał się katalizatorem. Następnego dnia setki
osób chciały mnie dodać do znajomych. Indonezyjczycy i
Finowie Pisali: "Admirale, słyszeliśmy, że potrzebuje pan
znajomych, a tak przy okazji, co to jest NATO?"
Parallel Corpus (target file - EN , EUROPARL)
1.The headline was: NATO Admiral Needs Friends.
2.Thank you.
3.Which I do.
4.And the story was a catalyst, and the next morning I had
hundreds of Facebook friend requests from Indonesians and
Finns, mostly saying, "Admiral, we heard you need a friend, and
oh, by the way, what is NATO?"
Vocabulary
1.Word level
2.Sub-word level (eg. Byte Pair Encoding)
3.Character level
BLEU
HTTP://OPENNMT.NET/
OPENNMT – DECEMBER 2016
HTTPS://GOOGLE.GITHUB.IO/SEQ2SEQ/
GOOGLE’S SEQ2SEQ – MARCH 2017
Our experience from PL=>EN training
1.100k vocabulary (word-level)
2.Bidirectional LSTM, 2 layers, RNN size 500
3.5M sentences from public data sources
4.~ 20 BLEU
OpenNMT – run Docker container
Run CPU-based interactive session with command:
sudo docker run -it 2040/opennmt bash
Run GPU-based interactive session with command:
sudo nvidia-docker run -it 2040/opennmt bash
OpenNMT – split paralell corpus
split -l $[ $(wc -l src.txt|cut -d" " -f1) * 9/10 ] src.txt
mv xaa train-src.txt
mv xab val-src.txt
split -l $[ $(wc -l tgt.txt|cut -d" " -f1) * 9/10 ] tgt.txt
mv xaa train-tgt.txt
mv xab val-tgt.txt
OpenNMT – preprocess paralell corpus
th tools/tokenize.lua -joiner_annotate -mode aggressive < train-src.txt >
train-src.txt.tok
th tools/tokenize.lua -joiner_annotate -mode aggressive < train-tgt.txt >
train-tgt.txt.tok
th tools/tokenize.lua -joiner_annotate -mode aggressive < val-src.txt > val-
src.txt.tok
th tools/tokenize.lua -joiner_annotate -mode aggressive < val-tgt.txt > val-
tgt.txt.tok
th preprocess.lua -train_src train-src.txt.tok -train_tgt train-tgt.txt.tok -
valid_src val-src.txt.tok -valid_tgt val-tgt.txt.tok -save_data _data
OpenNMT – train && release && translate
th train.lua -data _data-train.t7 -layers 2 -rnn_size 500 -brnn -save_model
model -gpuid 1
th tools/release_model.lua -model model.t7 -gpuid 1
th translate.lua -model model.t7 -src src-val.txt -output file-tgt.tok -gpuid
1
Best hyperparams from 250k GPU hours (thx Google)
HTTPS://ARXIV.ORG/ABS/1703.03906
Other applications
1.Image 2 Text
2.OCR (eg. Tesseract OCR v4.0 – LSTM)
3.Lip reading
4.Simple Q&A
5.Chatbots
HTTP://WEB.STANFORD.EDU/CLASS/CS224N/
SLIDES USED WITH PERMISSION FROM RICHARD SOCHER
Thanks!
Bartek Rozkrut
bartek@2040.io

More Related Content

PDF
Ai meetup Neural machine translation updated
2040.io
 
PDF
Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017
Codemotion
 
PDF
Kernel Recipes 2016 - Would an ABI changes visualization tool be useful to Li...
Anne Nicolas
 
ODP
Gsummit apis-2013
Gluster.org
 
PDF
Go at uber
Rob Skillington
 
ODP
Open Source .NET
Onyxfish
 
PDF
Briefly Rust - Daniele Esposti - Codemotion Rome 2017
Codemotion
 
PPTX
Building your First gRPC Service
Jessie Barnett
 
Ai meetup Neural machine translation updated
2040.io
 
Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017
Codemotion
 
Kernel Recipes 2016 - Would an ABI changes visualization tool be useful to Li...
Anne Nicolas
 
Gsummit apis-2013
Gluster.org
 
Go at uber
Rob Skillington
 
Open Source .NET
Onyxfish
 
Briefly Rust - Daniele Esposti - Codemotion Rome 2017
Codemotion
 
Building your First gRPC Service
Jessie Barnett
 

What's hot (20)

PPTX
Compiling P4 to XDP, IOVISOR Summit 2017
Cheng-Chun William Tu
 
PDF
Experimental dtrace
Matthew Ahrens
 
PDF
tokyotalk
Hiroshi Ono
 
PPTX
2014.10 - Towards Description Set Profiles for RDF Using SPARQL as Intermedia...
Dr.-Ing. Thomas Hartmann
 
KEY
Playing Nice with Others
Jeremy Hinegardner
 
PDF
introduction to linux kernel tcp/ip ptocotol stack
monad bobo
 
PDF
Text tagging with finite state transducers
lucenerevolution
 
PDF
Memory Barriers in the Linux Kernel
Davidlohr Bueso
 
PDF
Ns2pre
Pratik Joshi
 
PDF
Automata Invasion
lucenerevolution
 
PDF
Learning RSocket Using RSC
VMware Tanzu
 
PPT
Linux50commands
NIRMAL FELIX
 
PDF
TLPI - Chapter 44 Pipe and Fifos
Shu-Yu Fu
 
PDF
Versioned Triple Pattern Fragments
Ruben Taelman
 
PPTX
The TCP/IP Stack in the Linux Kernel
Divye Kapoor
 
PDF
The linux networking architecture
hugo lu
 
PDF
Serialization in Go
Albert Strasheim
 
PDF
Golang concurrency design
Hyejong
 
PDF
OpenZFS send and receive
Matthew Ahrens
 
PDF
FBTFTP: an opensource framework to build dynamic tftp servers
Angelo Failla
 
Compiling P4 to XDP, IOVISOR Summit 2017
Cheng-Chun William Tu
 
Experimental dtrace
Matthew Ahrens
 
tokyotalk
Hiroshi Ono
 
2014.10 - Towards Description Set Profiles for RDF Using SPARQL as Intermedia...
Dr.-Ing. Thomas Hartmann
 
Playing Nice with Others
Jeremy Hinegardner
 
introduction to linux kernel tcp/ip ptocotol stack
monad bobo
 
Text tagging with finite state transducers
lucenerevolution
 
Memory Barriers in the Linux Kernel
Davidlohr Bueso
 
Ns2pre
Pratik Joshi
 
Automata Invasion
lucenerevolution
 
Learning RSocket Using RSC
VMware Tanzu
 
Linux50commands
NIRMAL FELIX
 
TLPI - Chapter 44 Pipe and Fifos
Shu-Yu Fu
 
Versioned Triple Pattern Fragments
Ruben Taelman
 
The TCP/IP Stack in the Linux Kernel
Divye Kapoor
 
The linux networking architecture
hugo lu
 
Serialization in Go
Albert Strasheim
 
Golang concurrency design
Hyejong
 
OpenZFS send and receive
Matthew Ahrens
 
FBTFTP: an opensource framework to build dynamic tftp servers
Angelo Failla
 
Ad

Similar to AIMeetup #4: Neural-machine-translation (20)

PDF
Building streaming pipelines for neural machine translation
Suneel Marthi
 
PPTX
Deep Learning for Machine Translation
Matīss ‎‎‎‎‎‎‎  
 
PDF
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Codemotion
 
PDF
Beyond the Hype of Neural Machine Translation, Diego Bartolome (tauyou) and G...
TAUS - The Language Data Network
 
PDF
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ijnlc
 
PDF
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
kevig
 
PDF
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
kevig
 
PPTX
Notes on attention mechanism
Khang Pham
 
PDF
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Association for Computational Linguistics
 
PDF
Machine Translation Introduction
nlab_utokyo
 
PPTX
Neural Machine Translation in the NLP.pptx
ChandimaMaduwantha
 
PPTX
Hybrid Machine Translation by Combining Multiple Machine Translation Systems
Matīss ‎‎‎‎‎‎‎  
 
PDF
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
Universitat Politècnica de Catalunya
 
PDF
On using monolingual corpora in neural machine translation
NAIST Machine Translation Study Group
 
PPTX
Master Thesis of Computer Engineering: OpenTranslator
Giuseppe D'Onofrio
 
PDF
Deep learning for NLP and Transformer
Arvind Devaraj
 
PDF
Learning to Translate with Joey NMT
Julia Kreutzer
 
PDF
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
Dr. Haxel Consult
 
PDF
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
Dr. Haxel Consult
 
PDF
Building a Neural Machine Translation System From Scratch
Natasha Latysheva
 
Building streaming pipelines for neural machine translation
Suneel Marthi
 
Deep Learning for Machine Translation
Matīss ‎‎‎‎‎‎‎  
 
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Codemotion
 
Beyond the Hype of Neural Machine Translation, Diego Bartolome (tauyou) and G...
TAUS - The Language Data Network
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ijnlc
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
kevig
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
kevig
 
Notes on attention mechanism
Khang Pham
 
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Association for Computational Linguistics
 
Machine Translation Introduction
nlab_utokyo
 
Neural Machine Translation in the NLP.pptx
ChandimaMaduwantha
 
Hybrid Machine Translation by Combining Multiple Machine Translation Systems
Matīss ‎‎‎‎‎‎‎  
 
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
Universitat Politècnica de Catalunya
 
On using monolingual corpora in neural machine translation
NAIST Machine Translation Study Group
 
Master Thesis of Computer Engineering: OpenTranslator
Giuseppe D'Onofrio
 
Deep learning for NLP and Transformer
Arvind Devaraj
 
Learning to Translate with Joey NMT
Julia Kreutzer
 
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
Dr. Haxel Consult
 
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
Dr. Haxel Consult
 
Building a Neural Machine Translation System From Scratch
Natasha Latysheva
 
Ad

More from 2040.io (16)

PPTX
Jak budujemy inteligentnego asystenta biznesowego
2040.io
 
PPTX
Obsługa klienta z wykorzystaniem sztucznej inteligencji
2040.io
 
PDF
Jak AI pozwala nam usłyszeć głos klienta
2040.io
 
PDF
Wyzwania związane z modelowaniem mobilnych systemów świadomych kontekstu
2040.io
 
PDF
Rozpoznawanie mowy: problem rozwiązany?
2040.io
 
PDF
Czy Deep Learning działa?
2040.io
 
PDF
Analiza semantyczna zasosowana w środowisku Menerva
2040.io
 
PDF
Time-series prediction with neural networks
2040.io
 
PDF
AIMeetup #4: Artificial intelligence and economics
2040.io
 
PDF
AIMeetup #4: Let’s compete with machine! edrone crm
2040.io
 
PDF
AIMeetup #3: Uczenie maszynowe - rocket science czy chleb powszedni?
2040.io
 
PDF
AIMeetup #3: Cortana intelligence suite - tchnij życie w swoje dane
2040.io
 
PDF
AIMeetup #2: A.I. - podstawowe pojęcia techniczne
2040.io
 
PDF
AIMeetup #2: Jak dzięki Data Mining księgujemy automatycznie koszty w Infakt.pl?
2040.io
 
PDF
AIMeetup #2: Jak wykorzystaliśmy technologię rozpoznawania mowy i mówcy do au...
2040.io
 
PDF
AIMeetup #2: Gdzie można nakarmić sztuczną inteligencję?
2040.io
 
Jak budujemy inteligentnego asystenta biznesowego
2040.io
 
Obsługa klienta z wykorzystaniem sztucznej inteligencji
2040.io
 
Jak AI pozwala nam usłyszeć głos klienta
2040.io
 
Wyzwania związane z modelowaniem mobilnych systemów świadomych kontekstu
2040.io
 
Rozpoznawanie mowy: problem rozwiązany?
2040.io
 
Czy Deep Learning działa?
2040.io
 
Analiza semantyczna zasosowana w środowisku Menerva
2040.io
 
Time-series prediction with neural networks
2040.io
 
AIMeetup #4: Artificial intelligence and economics
2040.io
 
AIMeetup #4: Let’s compete with machine! edrone crm
2040.io
 
AIMeetup #3: Uczenie maszynowe - rocket science czy chleb powszedni?
2040.io
 
AIMeetup #3: Cortana intelligence suite - tchnij życie w swoje dane
2040.io
 
AIMeetup #2: A.I. - podstawowe pojęcia techniczne
2040.io
 
AIMeetup #2: Jak dzięki Data Mining księgujemy automatycznie koszty w Infakt.pl?
2040.io
 
AIMeetup #2: Jak wykorzystaliśmy technologię rozpoznawania mowy i mówcy do au...
2040.io
 
AIMeetup #2: Gdzie można nakarmić sztuczną inteligencję?
2040.io
 

Recently uploaded (20)

PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Doc9.....................................
SofiaCollazos
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
The Future of Artificial Intelligence (AI)
Mukul
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Doc9.....................................
SofiaCollazos
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 

AIMeetup #4: Neural-machine-translation

  • 1. How to build own translator in 15 minutes Neural Machine Translation in practice Bartek Rozkrut 2040.io
  • 2. Why so important? 40 billion USD / year industry Huge barrier for many people Provide unlimited access to knowledge Scale NLP problems
  • 11. Why own translator? 1.Private / sensitive data 2.Huge amount of data – eg. e-mail translation (cost) 3.Off-line / off-cloud / on-premise 4.Custom domain-specific translation / vocabulary
  • 12. Neural Machine Translation – example workflow 1. Download Parallel Corpus files 2. Append all corpus files (source + target) in same order 3. Split TRAIN / VAL set 4. Tokenization 5. Preprocess 6. Train 7. Release model (CPU compatible) 8. Translate! 9. REPEAT! 
  • 13. Parallel Corpus – public data HTTP://OPUS.LINGFIL.UU.SE
  • 14. Parallel Corpus (source file – PL, EUROPARL) 1.Tytuł: Admirał NATO potrzebuje przyjaciół. 2.Dziękuję. 3.Naprawdę potrzebuję... 4.Ten program stał się katalizatorem. Następnego dnia setki osób chciały mnie dodać do znajomych. Indonezyjczycy i Finowie Pisali: "Admirale, słyszeliśmy, że potrzebuje pan znajomych, a tak przy okazji, co to jest NATO?"
  • 15. Parallel Corpus (target file - EN , EUROPARL) 1.The headline was: NATO Admiral Needs Friends. 2.Thank you. 3.Which I do. 4.And the story was a catalyst, and the next morning I had hundreds of Facebook friend requests from Indonesians and Finns, mostly saying, "Admiral, we heard you need a friend, and oh, by the way, what is NATO?"
  • 16. Vocabulary 1.Word level 2.Sub-word level (eg. Byte Pair Encoding) 3.Character level
  • 17. BLEU
  • 20. Our experience from PL=>EN training 1.100k vocabulary (word-level) 2.Bidirectional LSTM, 2 layers, RNN size 500 3.5M sentences from public data sources 4.~ 20 BLEU
  • 21. OpenNMT – run Docker container Run CPU-based interactive session with command: sudo docker run -it 2040/opennmt bash Run GPU-based interactive session with command: sudo nvidia-docker run -it 2040/opennmt bash
  • 22. OpenNMT – split paralell corpus split -l $[ $(wc -l src.txt|cut -d" " -f1) * 9/10 ] src.txt mv xaa train-src.txt mv xab val-src.txt split -l $[ $(wc -l tgt.txt|cut -d" " -f1) * 9/10 ] tgt.txt mv xaa train-tgt.txt mv xab val-tgt.txt
  • 23. OpenNMT – preprocess paralell corpus th tools/tokenize.lua -joiner_annotate -mode aggressive < train-src.txt > train-src.txt.tok th tools/tokenize.lua -joiner_annotate -mode aggressive < train-tgt.txt > train-tgt.txt.tok th tools/tokenize.lua -joiner_annotate -mode aggressive < val-src.txt > val- src.txt.tok th tools/tokenize.lua -joiner_annotate -mode aggressive < val-tgt.txt > val- tgt.txt.tok th preprocess.lua -train_src train-src.txt.tok -train_tgt train-tgt.txt.tok - valid_src val-src.txt.tok -valid_tgt val-tgt.txt.tok -save_data _data
  • 24. OpenNMT – train && release && translate th train.lua -data _data-train.t7 -layers 2 -rnn_size 500 -brnn -save_model model -gpuid 1 th tools/release_model.lua -model model.t7 -gpuid 1 th translate.lua -model model.t7 -src src-val.txt -output file-tgt.tok -gpuid 1
  • 25. Best hyperparams from 250k GPU hours (thx Google) HTTPS://ARXIV.ORG/ABS/1703.03906
  • 26. Other applications 1.Image 2 Text 2.OCR (eg. Tesseract OCR v4.0 – LSTM) 3.Lip reading 4.Simple Q&A 5.Chatbots