SlideShare a Scribd company logo
Machine Learning with Microsoft Azure
#msdevcon
Dmitry Petukhov,
ML/DS Preacher, Coffee Addicted &&
Machine Intelligence Researcher @ OpenWay
R for Fun Prototyping
developer PC
code
result RAM
Data
IDE
RStudio or/and
Visual Studio
Runtime
CRAN or/and
Microsoft R Open
Flexibility Distributed Scalable: horizontal, vertical Fault-tolerance Reliable
OSS-based BigData-ready LSML Secure
R for full cycle development
CRISP-DM
Model evaluation
Evaluate measures of quality model
(ROC, RMSE, F-Score, etc.)
Feature Selection**
Feature Selection
Feature Scaling (Normalization)
Dimension Reduction
Final Model
Training ML algorithm
Share results
Revision
FinalModelEvaluation
Data Flow
Cross-validation
Training Dataset Test Dataset
Source: http://0xCode.in/azure-ml-for-data-scientist
This work is licensed under a Creative Commons Attribution 4.0 International License
Step 1: read data
# 1. from local file system
library(data.table)
dt <- fread("data/transactions.csv")
# > Read 6849346 rows and 6 (of 6) columns from 0.299 GB file in 00:00:31
# 2. from Web
dt <- fread("https://raw.githubusercontent.com/greggles/mcc-codes/master/mcc_codes.csv",
sep = ",", stringsAsFactors = F, header = T, colClasses = list(character = 2)))
# > % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed
# > 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0100 14872 100 14872 0 0 29744 0 --:--:-- --
:--:-- --:--:-- 31710
# 3. from Azure Blob Storage
library(AzureSMR)
sc <- createAzureContext(tenantID = "{TID}", clientID = "{CID}", authKey = "{KEY}")
sc
azureGetBlob(sc,
storageAccount = "contestsdata",
container = "financial",
blob = "transactions.csv",
type = "text")
Step 1: read data
# 4. from MS SQL Server
library(RODBC) # Provides database connectivity
connectionString <- "Driver={ODBC Driver 13 for SQL
Server};Server=tcp:msdevcon.database.windows.net,1433;Database=TransDb;Uid=..."
trans.conn <- odbcDriverConnect(connectionString) # open RODBC connection
sqlSave(trans.conn, mcc.raw, "MCC2", addPK = T) # save data to table
mccFromDb <- sqlQuery(trans.conn, "SELECT * FROM MCC2 WHERE edited_description LIKE '%For Visa Only%'") # get data
head(mccFromDb)
#> rownames code edited_description combined_description
#> 1 978 9700 Automated Referral Service ( For Visa Only) Automated Referral Service ( For Visa Only)
#> 2 979 9701 Visa Credential Service ( For Visa Only) Visa Credential Service ( For Visa Only)
#> 3 980 9702 GCAS Emergency Services ( For Visa Only) GCAS Emergency Services ( For Visa Only)
#> 4 981 9950 Intra ??“ Company Purchases ( For Visa Only) Intra ??“ Company Purchases ( For Visa Only) Intra ??“
close(trans.conn)
# * Excel, HDFS, Amazon S3, REST-services as data sources
# { "0 10:23:26" "1 10:19:29" "1 10:20:56" } > { 0, 1, 1 }
getDay <- function(x) { strsplit(x, split = " ")[[1]][1] }
trans <- trans.raw %>%
# remove invalid rows
filter(
!is.na(amount) | amount != 0
) %>%
# transform data
mutate(
OperationType = factor(ifelse(amount > 0, "income", "withdraw")),
TransDay = as.numeric(sapply(tr_datetime, getDay)),
Amount = abs(amount)
) %>%
# remove redundant columns
select(
-c(tr_datetime, amount, term_id)
) %>%
# set column names
rename(
CustomerId = customer_id, MCC = mcc_code, TransType = tr_type
) %>%
# sort
arrange(
TransDay, Amount
)
Step 2: preprocessing data
Step 3: feature engineering
# calculate stats
library(dplyr)
customers.stats <- trans.x %>%
mutate(LogAmount = log(Amount)) %>%
group_by(CustomerId, OperationType, Gender) %>%
filter(n() > 30) %>%
summarize(
Min = min(LogAmount),
P1 = quantile(LogAmount, probs = c(.01)),
Q1 = quantile(LogAmount, probs = c(.25)),
Mean = mean(LogAmount),
Q3 = quantile(LogAmount, probs = c(.75)),
P99 = quantile(LogAmount, probs = c(.99)),
Max = max(LogAmount),
Total = sum(Amount),
Count = n(),
StandDev = sd(LogAmount)
) %>%
ungroup()
# shape from long to wide table form
library(reshape2)
x <- dcast(customers.stats, CustomerId + Gender ~ OperationType, value.var = "Mean", fun.aggregate = mean)
Step 3: feature engineering
library(ggplot2)
ggplot(x, aes(x = income, y = withdraw)) +
geom_point(alpha = 0.25, colour = "darkblue") + facet_grid(. ~ Gender) +
xlab("Income, rub") + ylab("Withdraw, rub")
Step 4: training ML-model
# train model
model <- glm(formula = gender ~ ., family = binomial(link = "logit"), data = dt.train)
# score model
p <- predict(model, newdata = dt.test, type = "response")
pr <- prediction(p, dt.test$gender)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
# evaluate model
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
Challenges
Data Science evolve rapidly
Data growing even faster
Data >> Memory (now and evermore)
We must scale better
Complex infrastructure
Zoo of frameworks
May be cloud?
#msdevcon
Big Data + Cloud + Machine Learning
Долго, дорого, …
#msdevcon
Apache Spark/Hadoop + Azure + R Server
Доступен как PaaS-сервис
Application Server
(Task Manager)
Flexibility Distributed Large scalable Fault-tolerance Reliable
OSS-based BigData-ready LSML Secure
Team
Head
Node Worker
Node
DFS
ML for the bloody Enterprise
Version
Control
Distributed Execution Framework
Tasks
Big Data Cluster
Tasks
Pull
code
Azure Blob Storage
Microsoft R Server
Team
Head
Node Worker
Node
HDFS API
R for the Enterprise
Apache Spark / Hadoop
Tasks
Azure HDInsight
Tasks
Pull
code
Microsoft R
Microsoft R Open and Microsoft R Server #R
MicrosoftML #R
Microsoft R Server for Azure HDInsight #PaaS
R Server on Apache Spark
Data Science VM #R #IaaS
CNTK & GPU Instances #NN #GPU #OSS
Batch AI Training preview #PaaS #NN #GPU
Azure Machine Learning #PaaS
R scripts, modules and models #R
Jupyter Notebooks #R #SaaS
R-to-cloud: AzureSMR, AzureML #R #OSS
Cognitive Services #SaaS #NN
SQL Server R Services #R #PaaS
Power BI #R #Viz
Execute R scripts
Visual Studio
R extensions for VS2015
R in-box-support for VS2017
MicrosoftAzure
© 2017, Dmitry Petukhov. CC BY-SA 4.0 license. Microsoft and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
Data Science must win!
Q&A
Now or later (use contacts below)
Ping me
Habr: @codezombie
All contacts: http://0xCode.in/author

More Related Content

What's hot (20)

PPTX
3. R- list and data frame
krishna singh
 
PPTX
PistonHead's use of MongoDB for Analytics
Andrew Morgan
 
PPT
Data Visualizations with D3
Doug Domeny
 
DOCX
Advanced Data Visualization Examples with R-Part II
Dr. Volkan OBAN
 
PDF
NoSQL meets Microservices - Michael Hackstein
distributed matters
 
PDF
Learn D3.js in 90 minutes
Jos Dirksen
 
PDF
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
NoSQLmatters
 
PPTX
The rise of json in rdbms land jab17
alikonweb
 
PDF
Clojure for Data Science
henrygarner
 
DOCX
R-ggplot2 package Examples
Dr. Volkan OBAN
 
PDF
Manchester Hadoop Meetup: Spark Cassandra Integration
Christopher Batey
 
PPTX
Enter The Matrix
Mike Anderson
 
PPTX
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB
 
PPTX
MongoDB Stich Overview
MongoDB
 
PDF
Polyglot Persistence & Multi Model-Databases at JMaghreb3.0
ArangoDB Database
 
DOCX
Advanced Data Visualization in R- Somes Examples.
Dr. Volkan OBAN
 
PDF
Reading Cassandra Meetup Feb 2015: Apache Spark
Christopher Batey
 
DOCX
CLUSTERGRAM
Dr. Volkan OBAN
 
PDF
Window functions in MySQL 8.0
Mydbops
 
PPTX
Megadata With Python and Hadoop
ryancox
 
3. R- list and data frame
krishna singh
 
PistonHead's use of MongoDB for Analytics
Andrew Morgan
 
Data Visualizations with D3
Doug Domeny
 
Advanced Data Visualization Examples with R-Part II
Dr. Volkan OBAN
 
NoSQL meets Microservices - Michael Hackstein
distributed matters
 
Learn D3.js in 90 minutes
Jos Dirksen
 
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
NoSQLmatters
 
The rise of json in rdbms land jab17
alikonweb
 
Clojure for Data Science
henrygarner
 
R-ggplot2 package Examples
Dr. Volkan OBAN
 
Manchester Hadoop Meetup: Spark Cassandra Integration
Christopher Batey
 
Enter The Matrix
Mike Anderson
 
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB
 
MongoDB Stich Overview
MongoDB
 
Polyglot Persistence & Multi Model-Databases at JMaghreb3.0
ArangoDB Database
 
Advanced Data Visualization in R- Somes Examples.
Dr. Volkan OBAN
 
Reading Cassandra Meetup Feb 2015: Apache Spark
Christopher Batey
 
CLUSTERGRAM
Dr. Volkan OBAN
 
Window functions in MySQL 8.0
Mydbops
 
Megadata With Python and Hadoop
ryancox
 

Viewers also liked (20)

PDF
Schneider Electric Smart City Success Stories (Worldwide)
Schneider Electric India
 
PPTX
Philip bane smart city
aztechcouncil
 
PPTX
Azure Machine Learning
Dmitry Petukhov
 
PPTX
City as Platform Cooperative - Smart City Expo - Barcelona
DigitalTown, Inc
 
PPTX
Machine Intelligence for Fraud Prediction
Dmitry Petukhov
 
PPTX
Democratizing Artificial Intelligence
Dmitry Petukhov
 
PDF
Auxis Webinar: Diving into RPA
Auxis Consulting & Outsourcing
 
PPTX
AI for Retail Banking
Dmitry Petukhov
 
PDF
Monetizing the iot by Sandhiprakash Bhide generic-01-24-2017
sandhibhide
 
PPTX
Smart-city implementation reference model
Alexander SAMARIN
 
PDF
2016 Current State of IoT
Alexander Meinhardt
 
PPTX
AI in IoT: Use Cases and Challenges
Dmitry Petukhov
 
PDF
[Webinar Slides] Robotic Process Automation 101 What is it? What can it mean ...
AIIM International
 
PPTX
CISCO SMART CITY
Pujan Motiwala
 
PPTX
Microsoft Machine Learning Server. Architecture View
Dmitry Petukhov
 
PDF
Smart City and Smart Government : Strategy, Model, and Cases of Korea
Jong-Sung Hwang
 
PPTX
What is next for IoT and IIoT
Ahmed Banafa
 
PPTX
AI & Robotic Process Automation (RPA) to Digitally Transform Your Environment
Cprime
 
PDF
Build your First IoT Application with IBM Watson IoT
Janakiram MSV
 
PDF
Iot for smart city
sanalkumar k
 
Schneider Electric Smart City Success Stories (Worldwide)
Schneider Electric India
 
Philip bane smart city
aztechcouncil
 
Azure Machine Learning
Dmitry Petukhov
 
City as Platform Cooperative - Smart City Expo - Barcelona
DigitalTown, Inc
 
Machine Intelligence for Fraud Prediction
Dmitry Petukhov
 
Democratizing Artificial Intelligence
Dmitry Petukhov
 
Auxis Webinar: Diving into RPA
Auxis Consulting & Outsourcing
 
AI for Retail Banking
Dmitry Petukhov
 
Monetizing the iot by Sandhiprakash Bhide generic-01-24-2017
sandhibhide
 
Smart-city implementation reference model
Alexander SAMARIN
 
2016 Current State of IoT
Alexander Meinhardt
 
AI in IoT: Use Cases and Challenges
Dmitry Petukhov
 
[Webinar Slides] Robotic Process Automation 101 What is it? What can it mean ...
AIIM International
 
CISCO SMART CITY
Pujan Motiwala
 
Microsoft Machine Learning Server. Architecture View
Dmitry Petukhov
 
Smart City and Smart Government : Strategy, Model, and Cases of Korea
Jong-Sung Hwang
 
What is next for IoT and IIoT
Ahmed Banafa
 
AI & Robotic Process Automation (RPA) to Digitally Transform Your Environment
Cprime
 
Build your First IoT Application with IBM Watson IoT
Janakiram MSV
 
Iot for smart city
sanalkumar k
 
Ad

Similar to Machine Learning with Microsoft Azure (20)

PDF
The Machine Learning Workflow with Azure
Ivo Andreev
 
PPTX
Integrating Azure Machine Learning and Predictive Analytics with SharePoint O...
Bhakthi Liyanage
 
PPTX
Azure Machine Learning Challenge_Speakers Presentation.pptx
DrSatwinderSingh3
 
PDF
Building a Data Science as a Service Platform in Azure with Databricks
Databricks
 
PPTX
Machine learning
Saravanan Subburayal
 
PDF
The Data Science Process - Do we need it and how to apply?
Ivo Andreev
 
PDF
Machine learning for IoT - unpacking the blackbox
Ivo Andreev
 
PDF
The machine learning process: From ideation to deployment with Azure Machine ...
Francesca Lazzeri, PhD
 
PDF
Azure Machine Learning
Mostafa
 
PPTX
AzureML Welcome to the future of Predictive Analytics
Ruben Pertusa Lopez
 
PPTX
Data Science with Azure Machine Learning and  R
Christos Charmatzis
 
PDF
Insider's introduction to microsoft azure machine learning: 201411 Seattle Bu...
Mark Tabladillo
 
PDF
Azure Machine Learning and ML on Premises
Ivo Andreev
 
PDF
The Power of Auto ML and How Does it Work
Ivo Andreev
 
PDF
Prepare your data for machine learning
Ivo Andreev
 
PPTX
Data analytics on Azure
Elena Lopez
 
DOCX
Vadlamudi saketh30 (ml)
Vadlamudi Saketh
 
PDF
Azure Machine Learning tutorial
Giacomo Lanciano
 
PPTX
Data Science in the cloud with Microsoft Azure
TechExeter
 
PPTX
Azure machine learning tech mela
Yogendra Tamang
 
The Machine Learning Workflow with Azure
Ivo Andreev
 
Integrating Azure Machine Learning and Predictive Analytics with SharePoint O...
Bhakthi Liyanage
 
Azure Machine Learning Challenge_Speakers Presentation.pptx
DrSatwinderSingh3
 
Building a Data Science as a Service Platform in Azure with Databricks
Databricks
 
Machine learning
Saravanan Subburayal
 
The Data Science Process - Do we need it and how to apply?
Ivo Andreev
 
Machine learning for IoT - unpacking the blackbox
Ivo Andreev
 
The machine learning process: From ideation to deployment with Azure Machine ...
Francesca Lazzeri, PhD
 
Azure Machine Learning
Mostafa
 
AzureML Welcome to the future of Predictive Analytics
Ruben Pertusa Lopez
 
Data Science with Azure Machine Learning and  R
Christos Charmatzis
 
Insider's introduction to microsoft azure machine learning: 201411 Seattle Bu...
Mark Tabladillo
 
Azure Machine Learning and ML on Premises
Ivo Andreev
 
The Power of Auto ML and How Does it Work
Ivo Andreev
 
Prepare your data for machine learning
Ivo Andreev
 
Data analytics on Azure
Elena Lopez
 
Vadlamudi saketh30 (ml)
Vadlamudi Saketh
 
Azure Machine Learning tutorial
Giacomo Lanciano
 
Data Science in the cloud with Microsoft Azure
TechExeter
 
Azure machine learning tech mela
Yogendra Tamang
 
Ad

More from Dmitry Petukhov (9)

PPTX
Introduction to Auto ML
Dmitry Petukhov
 
PPTX
Intelligent Banking: AI cases in Retail and Commercial Banking
Dmitry Petukhov
 
PPTX
IaaS, PaaS, and DevOps for Data Scientist
Dmitry Petukhov
 
PPTX
Introduction to Deep Learning
Dmitry Petukhov
 
PPTX
Introduction to Machine Learning
Dmitry Petukhov
 
PPTX
R + Apache Spark
Dmitry Petukhov
 
PPTX
Introduction to R
Dmitry Petukhov
 
PPTX
Microsoft Azure + R
Dmitry Petukhov
 
PPTX
Machine Learning in Microsoft Azure
Dmitry Petukhov
 
Introduction to Auto ML
Dmitry Petukhov
 
Intelligent Banking: AI cases in Retail and Commercial Banking
Dmitry Petukhov
 
IaaS, PaaS, and DevOps for Data Scientist
Dmitry Petukhov
 
Introduction to Deep Learning
Dmitry Petukhov
 
Introduction to Machine Learning
Dmitry Petukhov
 
R + Apache Spark
Dmitry Petukhov
 
Introduction to R
Dmitry Petukhov
 
Microsoft Azure + R
Dmitry Petukhov
 
Machine Learning in Microsoft Azure
Dmitry Petukhov
 

Recently uploaded (20)

PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 

Machine Learning with Microsoft Azure

  • 1. Machine Learning with Microsoft Azure #msdevcon Dmitry Petukhov, ML/DS Preacher, Coffee Addicted && Machine Intelligence Researcher @ OpenWay
  • 2. R for Fun Prototyping developer PC code result RAM Data IDE RStudio or/and Visual Studio Runtime CRAN or/and Microsoft R Open Flexibility Distributed Scalable: horizontal, vertical Fault-tolerance Reliable OSS-based BigData-ready LSML Secure
  • 3. R for full cycle development CRISP-DM Model evaluation Evaluate measures of quality model (ROC, RMSE, F-Score, etc.) Feature Selection** Feature Selection Feature Scaling (Normalization) Dimension Reduction Final Model Training ML algorithm Share results Revision FinalModelEvaluation Data Flow Cross-validation Training Dataset Test Dataset Source: http://0xCode.in/azure-ml-for-data-scientist This work is licensed under a Creative Commons Attribution 4.0 International License
  • 4. Step 1: read data # 1. from local file system library(data.table) dt <- fread("data/transactions.csv") # > Read 6849346 rows and 6 (of 6) columns from 0.299 GB file in 00:00:31 # 2. from Web dt <- fread("https://raw.githubusercontent.com/greggles/mcc-codes/master/mcc_codes.csv", sep = ",", stringsAsFactors = F, header = T, colClasses = list(character = 2))) # > % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed # > 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0100 14872 100 14872 0 0 29744 0 --:--:-- -- :--:-- --:--:-- 31710 # 3. from Azure Blob Storage library(AzureSMR) sc <- createAzureContext(tenantID = "{TID}", clientID = "{CID}", authKey = "{KEY}") sc azureGetBlob(sc, storageAccount = "contestsdata", container = "financial", blob = "transactions.csv", type = "text")
  • 5. Step 1: read data # 4. from MS SQL Server library(RODBC) # Provides database connectivity connectionString <- "Driver={ODBC Driver 13 for SQL Server};Server=tcp:msdevcon.database.windows.net,1433;Database=TransDb;Uid=..." trans.conn <- odbcDriverConnect(connectionString) # open RODBC connection sqlSave(trans.conn, mcc.raw, "MCC2", addPK = T) # save data to table mccFromDb <- sqlQuery(trans.conn, "SELECT * FROM MCC2 WHERE edited_description LIKE '%For Visa Only%'") # get data head(mccFromDb) #> rownames code edited_description combined_description #> 1 978 9700 Automated Referral Service ( For Visa Only) Automated Referral Service ( For Visa Only) #> 2 979 9701 Visa Credential Service ( For Visa Only) Visa Credential Service ( For Visa Only) #> 3 980 9702 GCAS Emergency Services ( For Visa Only) GCAS Emergency Services ( For Visa Only) #> 4 981 9950 Intra ??“ Company Purchases ( For Visa Only) Intra ??“ Company Purchases ( For Visa Only) Intra ??“ close(trans.conn) # * Excel, HDFS, Amazon S3, REST-services as data sources
  • 6. # { "0 10:23:26" "1 10:19:29" "1 10:20:56" } > { 0, 1, 1 } getDay <- function(x) { strsplit(x, split = " ")[[1]][1] } trans <- trans.raw %>% # remove invalid rows filter( !is.na(amount) | amount != 0 ) %>% # transform data mutate( OperationType = factor(ifelse(amount > 0, "income", "withdraw")), TransDay = as.numeric(sapply(tr_datetime, getDay)), Amount = abs(amount) ) %>% # remove redundant columns select( -c(tr_datetime, amount, term_id) ) %>% # set column names rename( CustomerId = customer_id, MCC = mcc_code, TransType = tr_type ) %>% # sort arrange( TransDay, Amount ) Step 2: preprocessing data
  • 7. Step 3: feature engineering # calculate stats library(dplyr) customers.stats <- trans.x %>% mutate(LogAmount = log(Amount)) %>% group_by(CustomerId, OperationType, Gender) %>% filter(n() > 30) %>% summarize( Min = min(LogAmount), P1 = quantile(LogAmount, probs = c(.01)), Q1 = quantile(LogAmount, probs = c(.25)), Mean = mean(LogAmount), Q3 = quantile(LogAmount, probs = c(.75)), P99 = quantile(LogAmount, probs = c(.99)), Max = max(LogAmount), Total = sum(Amount), Count = n(), StandDev = sd(LogAmount) ) %>% ungroup() # shape from long to wide table form library(reshape2) x <- dcast(customers.stats, CustomerId + Gender ~ OperationType, value.var = "Mean", fun.aggregate = mean)
  • 8. Step 3: feature engineering library(ggplot2) ggplot(x, aes(x = income, y = withdraw)) + geom_point(alpha = 0.25, colour = "darkblue") + facet_grid(. ~ Gender) + xlab("Income, rub") + ylab("Withdraw, rub")
  • 9. Step 4: training ML-model # train model model <- glm(formula = gender ~ ., family = binomial(link = "logit"), data = dt.train) # score model p <- predict(model, newdata = dt.test, type = "response") pr <- prediction(p, dt.test$gender) prf <- performance(pr, measure = "tpr", x.measure = "fpr") plot(prf) # evaluate model auc <- performance(pr, measure = "auc") auc <- [email protected][[1]] auc
  • 10. Challenges Data Science evolve rapidly Data growing even faster Data >> Memory (now and evermore) We must scale better Complex infrastructure Zoo of frameworks May be cloud?
  • 11. #msdevcon Big Data + Cloud + Machine Learning Долго, дорого, …
  • 12. #msdevcon Apache Spark/Hadoop + Azure + R Server Доступен как PaaS-сервис
  • 13. Application Server (Task Manager) Flexibility Distributed Large scalable Fault-tolerance Reliable OSS-based BigData-ready LSML Secure Team Head Node Worker Node DFS ML for the bloody Enterprise Version Control Distributed Execution Framework Tasks Big Data Cluster Tasks Pull code
  • 14. Azure Blob Storage Microsoft R Server Team Head Node Worker Node HDFS API R for the Enterprise Apache Spark / Hadoop Tasks Azure HDInsight Tasks Pull code
  • 15. Microsoft R Microsoft R Open and Microsoft R Server #R MicrosoftML #R Microsoft R Server for Azure HDInsight #PaaS R Server on Apache Spark Data Science VM #R #IaaS CNTK & GPU Instances #NN #GPU #OSS Batch AI Training preview #PaaS #NN #GPU Azure Machine Learning #PaaS R scripts, modules and models #R Jupyter Notebooks #R #SaaS R-to-cloud: AzureSMR, AzureML #R #OSS Cognitive Services #SaaS #NN SQL Server R Services #R #PaaS Power BI #R #Viz Execute R scripts Visual Studio R extensions for VS2015 R in-box-support for VS2017 MicrosoftAzure
  • 16. © 2017, Dmitry Petukhov. CC BY-SA 4.0 license. Microsoft and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. Data Science must win!
  • 17. Q&A Now or later (use contacts below) Ping me Habr: @codezombie All contacts: http://0xCode.in/author

Editor's Notes

  • #17: (c) 2017, Dmitry Petukhov. CC BY-SA 4.0 license.
  • #18: Event: https://events.techdays.ru/Future-Technologies/2017-06/