"Full Stack" Data Science with R
Startups: Production-Ready
with Open Source Tools
#rstats #SoCalDS17 #IDEAS17
Oct 22, 2017
Ajay Gopal
1
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Me: (Data) Scientist, Technologist, Entrepreneur
2
Ajay Gopal, PhD
: ajzz : @aj2z
2017: Chief Data Scientist, SelfScore Inc
#FinTech #ML #Underwriting #Risk #rstats
2016: VP, Data Science & Growth, CARD.com
#FinTech #MktgAutomation #BehavEcon #rstats
2012: Postdoc / Staff Researcher, UCLA
#BioInformatics #GraphTheory #StatMech #Python
2005: PhD, Univ of Chicago
#SurfacePhysics #BioPhysics #StatMech #Matlab
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Ajay Gopal, PhD
: ajzz : @aj2z
2017: Chief Data Scientist, SelfScore Inc
#FinTech #ML #Underwriting #Risk #rstats
2016: VP, Data Science & Growth, CARD.com
#FinTech #MktgAutomation #BehavEcon #rstats
2012: Postdoc / Staff Researcher, UCLA
#BioInformatics #GraphTheory #StatMech #Python
2005: PhD, Univ of Chicago
#SurfacePhysics #BioPhysics #StatMech #Matlab
SelfScore: Financial Education & Inclusion
3
SelfScore
Industry
FinTech Alt-Lending Startup, Menlo Park, CA
What we do
Use ML models with alternative financial signals
to help deserving but underserved populations
gain access to fair credit, started with
international students (2 products in market)
Differentiator
Measure borrower’s potential
instead of history (eg without SSN / FICO etc)
Team
~ 30 (4 in Data Science + You?)
Funding
Series B, Founded in 2013
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
... was born on Twitter
For Startups + New Teams
1) Evolving Data Science needs
2) What’s “Full Stack” DS?
3) Why use R (or Python)?
4) Cloud R-based DS Stack
- Sample Infra
- Open Source tools
-------------------------
5) Production Mindset
6) Buy or Build?
This talk
4
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Data Science (VC) Expectations Evolve
Innovation Vertical + Optimization Laterally
5
Data Science
IP, AI,
Innovation,
R&D
Operations
Finance
Compliance
Technology
Product
CX
Demand Gen
Growth
Infra Process Automation Product Optimization Ad / Comms Optim
Considerations:
● Disruptive if
relying on resources
from other verticals
● More ad-hoc work
● R&D timelines not
predictable
● Faster cadence for
analytics
Solution:
● “Full Stack”
Infra & Teams!
● Tools & Training for
others to self-serve
Data Science in Modern (Gen-AI) Startups
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
The “Full Stack” Analogy
6
Front End
Back End
Data Store
Devops
APIs
UX
Technology
Puppet, Chef, Ansible, AWS EC2,
Docker, ECS/GCE, Heroku
MySQL, PostGres, MongoDB, Redis,
MemCached etc.
PHP, JS, Python, Ruby, ORMs, CI, Git
Restify, Django, Rails, ASP.net, Lambda
HTML/CSS, JS (Node, React), Bootstrap,
iOS, Android, Ionic, Cordova
Email (SendGrid), SMS (Twilio), Push
(SNS, Firebase), Msg Frmwks
Function
Multi-Channel Engagement
Optimal Service Delivery
Platform-agnostic function &
information availability
Business Logic
Identities, Attribs, Relations
Scaleable Services &
Contingencies
Goal: Scalable, Engaging, Valuable Web Service
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
“Full Stack” Web Services - Technologies
7
Front End
Back End
Data Store
Devops
APIs
UX
Technology
Puppet, Chef, Ansible, AWS EC2,
Docker, ECS/GCE, Heroku
MySQL, PostGres, MongoDB, Redis,
MemCached etc.
PHP, JS, Python, Ruby, ORMs, CI, Git
Restify, Django, Rails, ASP.net, Lambda
HTML/CSS, JS (Node, React), Bootstrap,
iOS, Android, Ionic, Cordova
Email (SendGrid), SMS (Twilio), Push
(SNS, Firebase), Msg Frmwks
Function
Multi-Channel Engagement
Optimal Service Delivery
Platform-agnostic function &
information availability
Business Logic
Identities, Attribs, Relations
Scaleable Services &
Contingencies
Goal: Scalable, Engaging, Valuable Web Service
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Technology
rocker, EMIs, ECS, GCE, other cloud
tools
DBI, RMySQL, RPostGreSQL, Redis,
Hadoop, Kinesis (AWR), Spark etc.
Your internal pkgs, RServer, CI, Git,
Chron, (most R packages), sparkR
shiny, HTML, CSV, rook, googlesheets,
HtmlWidgets, shinyapps.io, Dropbox
httr, curl - API interactions for Email,
SMS, Push, Slack, OR via CI tool
Generic: rapache, opencpu, plumber
ML: h2o/steam, Domino Data Lab
8
Front End
Back End
Data Store
Devops
APIs
UX
Function
Multi-Channel Engagement
Optimal Service Delivery
Platform-agnostic function &
information availability
Business Logic
Identities, Attribs, Relations
Scaleable Services &
Contingencies
“Full Stack” Data Science
Goal: Scalable, Timely, Intelligence/Economic Services
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
R is Sufficient For All Key Stack Functions
1) Retrieve Data
- Ad / Marketing
- Sales
- Transaction
- 3rd Party / Behavioral
2) Process (ETL)
- Fetch, clean up, store
3) Analyze
- Cross-Connectivity
- Aggregation & Features
- Algorithms
4) Predict
- Models in batch
- In-memory modeling
- REST APIs
5) Inform
- Customers (Services & API)
- Partners
Eg: Marketing, fulfillment
- Internal Stakeholders
Eg: Reporting / Dashboards
9
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore 10
Front End
Back End
Data Store
Devops
APIs
UX
Technology
rocker, EMIs, ECS, GCE, other cloud
tools, Domino Data Lab, Azure
DBI, RMySQL, RPostGreSQL, Redis,
Hadoop, Kinesis (AWR), SparkR etc.
Your internal pkgs, RServer, CI, Git,
H2O, (most R packages), Spark
shiny, HTML, CSV, rook, googlesheets,
HtmlWidgets, shinyapps.io, Dropbox
httr, curl - API interactions for Email,
SMS, Push, Slack, OR via CI tool
Function
Multi-Channel Engagement
Optimal Service Delivery
Platform-agnostic function &
information availability
Business Logic
Identities, Attribs, Relations
Scaleable Services &
Contingencies
“Full Stack” Data Science with R
Generic: rapache, opencpu, plumber
ML: h2o/steam, Domino, Lambda
Goal: Scalable, Timely, Intelligence/Economic Services
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Detractors
- Fewer hard-core devs
- Only handful of dev shops;
no serious bandwidth for hire
- Memory mgmt (still?)
R is great for startups!
Top Drivers for Startups
1. Instant Reactive Web Visualizations
via Shiny (Zero front-end dev)
2. Low barrier for cross-training
3. Fantastic IDE (RStudio)
(single-point access to stack)
4. Large ecosystem of packages
(modeling + viz + utils)
5. Great client libraries
for ML frameworks
6. Statistically Trained Prospects
(Python / Pandas odds good too)
11
So how do we build an R based stack?
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Data Science Should Be This Easy
12
A
U
T
O
M
A
T
I
O
N
Data Science IDE
Interactive Dashboards
Predictive Models & APIs
Alerts Notification, Files
So how do we build this in the cloud?
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Assembly of Cloud Container Services
1) Bastion - to connect to external world
(small, low memory, public IP)
2) Scheduler - do things triggered by time & events
(medium, run CI tools, invoke compute slaves)
3) Workers - heavy feature computations
(highmem, multi core, stateless)
4) Storage - DBs, pipelines & message queues
(distributed storage services or internal clusters)
5) Modeler - H2O Cluster, MLLib, Sci-Kit etc
(multi-node cluster, available on demand)
6) Reporter - API Service / Shiny server
(medium, autoscaled containers)
13
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Sample AWS Infra
14
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Choice of Tools
15
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
“Staging” Shiny App
1. Git Commit App to “Dev” branch
2. Jenkins Sync Repo on Commit
3. Sync triggers next Jenkins job
creates Docker container
4. Next job: AWS cli tools deploy
Docker container to ECS
5. “Dev” Shiny app live on staging
6. API call to notify Slack channel
Sample Production Workflows
SEM Cost Forecaster
1. Rscript fetches Adwords
spend & internal sales data
every 5 minutes.
2. Rscript runs existing anomaly
detection & forecast model
3. When check fails, API calls
from R to SMS (eg Twilio) and
Email (eg: SendGrid).
16
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Building Full-Stack Data Science Teams
People
- Data / Backend Engineer
- Data Scientist
- Modeller / Statistician
- Product Manager
- Devops Engineer
Team Output
- EDA / ad-hoc
- Scheduled Reporting
- Batch Predictions
- Stream Processing
- Real-Time Prediction APIs
Our “product” is scalable, actionable intelligence
17
… let’s adopt good software development practices
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
BetteR habits:
1. Write inline and offline tests for your code (testthat, checkmate)
2. Generate informational logs so you can debug later (futile.logger)
3. Add versioning (github)
4. Save business logic as functions in package (selfscoRe)
5. Add examples (Rmd)
6. Write documentation (Rmd)
7. Create a web service (Shiny apps)
8. Put the service in a docker container
The Production Mindset for Data Scientists
18
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Should we buy or build?
VS
Should my company buy the infra? Should my team build it?
19
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Buy vs Build Considerations
BUY / RENT
- If no dev/tech in-house
- If time-to-market is key
requires:
- Custom Integrations
- Higher Cost Tolerance
- Niche engagements
BUILD
- If compliance is major factor
(HIPAA, PCI)
- If cost control is key
- Full Control of Features Reqd
requires:
- In-house talent
- Longer time-to-market?
- Ongoing maintenance
20
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Thank You!
21
Img Credits: http://daemon.co.za/2014/04/what-does-full-stack-mean
*ML Models
Hiring Sr “Full Stack” Data Scientist
In Summary
- Data Science is
Vertical + Lateral!
- Colocate data sources
- Containerize services in the cloud
- Use R’s Rich Ecosystem
(or something easy to
cross-train other verticals on)

“Full Stack” Data Science with R for Startups: Production-ready with Open-Source Tools

  • 1.
    "Full Stack" DataScience with R Startups: Production-Ready with Open Source Tools #rstats #SoCalDS17 #IDEAS17 Oct 22, 2017 Ajay Gopal 1
  • 2.
    #SoCalDS17 #IDEAS17 |#rstats | @aj2z @SelfScore Me: (Data) Scientist, Technologist, Entrepreneur 2 Ajay Gopal, PhD : ajzz : @aj2z 2017: Chief Data Scientist, SelfScore Inc #FinTech #ML #Underwriting #Risk #rstats 2016: VP, Data Science & Growth, CARD.com #FinTech #MktgAutomation #BehavEcon #rstats 2012: Postdoc / Staff Researcher, UCLA #BioInformatics #GraphTheory #StatMech #Python 2005: PhD, Univ of Chicago #SurfacePhysics #BioPhysics #StatMech #Matlab
  • 3.
    #SoCalDS17 #IDEAS17 |#rstats | @aj2z @SelfScore Ajay Gopal, PhD : ajzz : @aj2z 2017: Chief Data Scientist, SelfScore Inc #FinTech #ML #Underwriting #Risk #rstats 2016: VP, Data Science & Growth, CARD.com #FinTech #MktgAutomation #BehavEcon #rstats 2012: Postdoc / Staff Researcher, UCLA #BioInformatics #GraphTheory #StatMech #Python 2005: PhD, Univ of Chicago #SurfacePhysics #BioPhysics #StatMech #Matlab SelfScore: Financial Education & Inclusion 3 SelfScore Industry FinTech Alt-Lending Startup, Menlo Park, CA What we do Use ML models with alternative financial signals to help deserving but underserved populations gain access to fair credit, started with international students (2 products in market) Differentiator Measure borrower’s potential instead of history (eg without SSN / FICO etc) Team ~ 30 (4 in Data Science + You?) Funding Series B, Founded in 2013
  • 4.
    #SoCalDS17 #IDEAS17 |#rstats | @aj2z @SelfScore ... was born on Twitter For Startups + New Teams 1) Evolving Data Science needs 2) What’s “Full Stack” DS? 3) Why use R (or Python)? 4) Cloud R-based DS Stack - Sample Infra - Open Source tools ------------------------- 5) Production Mindset 6) Buy or Build? This talk 4
  • 5.
    #SoCalDS17 #IDEAS17 |#rstats | @aj2z @SelfScore Data Science (VC) Expectations Evolve Innovation Vertical + Optimization Laterally 5 Data Science IP, AI, Innovation, R&D Operations Finance Compliance Technology Product CX Demand Gen Growth Infra Process Automation Product Optimization Ad / Comms Optim Considerations: ● Disruptive if relying on resources from other verticals ● More ad-hoc work ● R&D timelines not predictable ● Faster cadence for analytics Solution: ● “Full Stack” Infra & Teams! ● Tools & Training for others to self-serve Data Science in Modern (Gen-AI) Startups
  • 6.
    #SoCalDS17 #IDEAS17 |#rstats | @aj2z @SelfScore The “Full Stack” Analogy 6 Front End Back End Data Store Devops APIs UX Technology Puppet, Chef, Ansible, AWS EC2, Docker, ECS/GCE, Heroku MySQL, PostGres, MongoDB, Redis, MemCached etc. PHP, JS, Python, Ruby, ORMs, CI, Git Restify, Django, Rails, ASP.net, Lambda HTML/CSS, JS (Node, React), Bootstrap, iOS, Android, Ionic, Cordova Email (SendGrid), SMS (Twilio), Push (SNS, Firebase), Msg Frmwks Function Multi-Channel Engagement Optimal Service Delivery Platform-agnostic function & information availability Business Logic Identities, Attribs, Relations Scaleable Services & Contingencies Goal: Scalable, Engaging, Valuable Web Service
  • 7.
    #SoCalDS17 #IDEAS17 |#rstats | @aj2z @SelfScore “Full Stack” Web Services - Technologies 7 Front End Back End Data Store Devops APIs UX Technology Puppet, Chef, Ansible, AWS EC2, Docker, ECS/GCE, Heroku MySQL, PostGres, MongoDB, Redis, MemCached etc. PHP, JS, Python, Ruby, ORMs, CI, Git Restify, Django, Rails, ASP.net, Lambda HTML/CSS, JS (Node, React), Bootstrap, iOS, Android, Ionic, Cordova Email (SendGrid), SMS (Twilio), Push (SNS, Firebase), Msg Frmwks Function Multi-Channel Engagement Optimal Service Delivery Platform-agnostic function & information availability Business Logic Identities, Attribs, Relations Scaleable Services & Contingencies Goal: Scalable, Engaging, Valuable Web Service
  • 8.
    #SoCalDS17 #IDEAS17 |#rstats | @aj2z @SelfScore Technology rocker, EMIs, ECS, GCE, other cloud tools DBI, RMySQL, RPostGreSQL, Redis, Hadoop, Kinesis (AWR), Spark etc. Your internal pkgs, RServer, CI, Git, Chron, (most R packages), sparkR shiny, HTML, CSV, rook, googlesheets, HtmlWidgets, shinyapps.io, Dropbox httr, curl - API interactions for Email, SMS, Push, Slack, OR via CI tool Generic: rapache, opencpu, plumber ML: h2o/steam, Domino Data Lab 8 Front End Back End Data Store Devops APIs UX Function Multi-Channel Engagement Optimal Service Delivery Platform-agnostic function & information availability Business Logic Identities, Attribs, Relations Scaleable Services & Contingencies “Full Stack” Data Science Goal: Scalable, Timely, Intelligence/Economic Services
  • 9.
    #SoCalDS17 #IDEAS17 |#rstats | @aj2z @SelfScore R is Sufficient For All Key Stack Functions 1) Retrieve Data - Ad / Marketing - Sales - Transaction - 3rd Party / Behavioral 2) Process (ETL) - Fetch, clean up, store 3) Analyze - Cross-Connectivity - Aggregation & Features - Algorithms 4) Predict - Models in batch - In-memory modeling - REST APIs 5) Inform - Customers (Services & API) - Partners Eg: Marketing, fulfillment - Internal Stakeholders Eg: Reporting / Dashboards 9
  • 10.
    #SoCalDS17 #IDEAS17 |#rstats | @aj2z @SelfScore 10 Front End Back End Data Store Devops APIs UX Technology rocker, EMIs, ECS, GCE, other cloud tools, Domino Data Lab, Azure DBI, RMySQL, RPostGreSQL, Redis, Hadoop, Kinesis (AWR), SparkR etc. Your internal pkgs, RServer, CI, Git, H2O, (most R packages), Spark shiny, HTML, CSV, rook, googlesheets, HtmlWidgets, shinyapps.io, Dropbox httr, curl - API interactions for Email, SMS, Push, Slack, OR via CI tool Function Multi-Channel Engagement Optimal Service Delivery Platform-agnostic function & information availability Business Logic Identities, Attribs, Relations Scaleable Services & Contingencies “Full Stack” Data Science with R Generic: rapache, opencpu, plumber ML: h2o/steam, Domino, Lambda Goal: Scalable, Timely, Intelligence/Economic Services
  • 11.
    #SoCalDS17 #IDEAS17 |#rstats | @aj2z @SelfScore Detractors - Fewer hard-core devs - Only handful of dev shops; no serious bandwidth for hire - Memory mgmt (still?) R is great for startups! Top Drivers for Startups 1. Instant Reactive Web Visualizations via Shiny (Zero front-end dev) 2. Low barrier for cross-training 3. Fantastic IDE (RStudio) (single-point access to stack) 4. Large ecosystem of packages (modeling + viz + utils) 5. Great client libraries for ML frameworks 6. Statistically Trained Prospects (Python / Pandas odds good too) 11 So how do we build an R based stack?
  • 12.
    #SoCalDS17 #IDEAS17 |#rstats | @aj2z @SelfScore Data Science Should Be This Easy 12 A U T O M A T I O N Data Science IDE Interactive Dashboards Predictive Models & APIs Alerts Notification, Files So how do we build this in the cloud?
  • 13.
    #SoCalDS17 #IDEAS17 |#rstats | @aj2z @SelfScore Assembly of Cloud Container Services 1) Bastion - to connect to external world (small, low memory, public IP) 2) Scheduler - do things triggered by time & events (medium, run CI tools, invoke compute slaves) 3) Workers - heavy feature computations (highmem, multi core, stateless) 4) Storage - DBs, pipelines & message queues (distributed storage services or internal clusters) 5) Modeler - H2O Cluster, MLLib, Sci-Kit etc (multi-node cluster, available on demand) 6) Reporter - API Service / Shiny server (medium, autoscaled containers) 13
  • 14.
    #SoCalDS17 #IDEAS17 |#rstats | @aj2z @SelfScore Sample AWS Infra 14
  • 15.
    #SoCalDS17 #IDEAS17 |#rstats | @aj2z @SelfScore Choice of Tools 15
  • 16.
    #SoCalDS17 #IDEAS17 |#rstats | @aj2z @SelfScore “Staging” Shiny App 1. Git Commit App to “Dev” branch 2. Jenkins Sync Repo on Commit 3. Sync triggers next Jenkins job creates Docker container 4. Next job: AWS cli tools deploy Docker container to ECS 5. “Dev” Shiny app live on staging 6. API call to notify Slack channel Sample Production Workflows SEM Cost Forecaster 1. Rscript fetches Adwords spend & internal sales data every 5 minutes. 2. Rscript runs existing anomaly detection & forecast model 3. When check fails, API calls from R to SMS (eg Twilio) and Email (eg: SendGrid). 16
  • 17.
    #SoCalDS17 #IDEAS17 |#rstats | @aj2z @SelfScore Building Full-Stack Data Science Teams People - Data / Backend Engineer - Data Scientist - Modeller / Statistician - Product Manager - Devops Engineer Team Output - EDA / ad-hoc - Scheduled Reporting - Batch Predictions - Stream Processing - Real-Time Prediction APIs Our “product” is scalable, actionable intelligence 17 … let’s adopt good software development practices
  • 18.
    #SoCalDS17 #IDEAS17 |#rstats | @aj2z @SelfScore BetteR habits: 1. Write inline and offline tests for your code (testthat, checkmate) 2. Generate informational logs so you can debug later (futile.logger) 3. Add versioning (github) 4. Save business logic as functions in package (selfscoRe) 5. Add examples (Rmd) 6. Write documentation (Rmd) 7. Create a web service (Shiny apps) 8. Put the service in a docker container The Production Mindset for Data Scientists 18
  • 19.
    #SoCalDS17 #IDEAS17 |#rstats | @aj2z @SelfScore Should we buy or build? VS Should my company buy the infra? Should my team build it? 19
  • 20.
    #SoCalDS17 #IDEAS17 |#rstats | @aj2z @SelfScore Buy vs Build Considerations BUY / RENT - If no dev/tech in-house - If time-to-market is key requires: - Custom Integrations - Higher Cost Tolerance - Niche engagements BUILD - If compliance is major factor (HIPAA, PCI) - If cost control is key - Full Control of Features Reqd requires: - In-house talent - Longer time-to-market? - Ongoing maintenance 20
  • 21.
    #SoCalDS17 #IDEAS17 |#rstats | @aj2z @SelfScore Thank You! 21 Img Credits: http://daemon.co.za/2014/04/what-does-full-stack-mean *ML Models Hiring Sr “Full Stack” Data Scientist In Summary - Data Science is Vertical + Lateral! - Colocate data sources - Containerize services in the cloud - Use R’s Rich Ecosystem (or something easy to cross-train other verticals on)