Synthetic Data Generation for Machine Learning
2020 Copyright QuantUniversity LLC.
Presented By:
Sri Krishnamurthy, CFA, CAP
Sri.Krishnamurthy@qusandbox.com
www.quantuniversity.com
03/05/2020
Boston, MA
2
Speaker bio
• Quant, Data Science & ML practitioner
• Prior Experience at MathWorks, Citigroup
and Endeca and 25+ financial services and
energy customers.
• Columnist for the Wilmott Magazine
• Author of forthcoming book
“Financial Modeling: A case study approach”
published by Wiley
• Teaches Data Science/AI at Northeastern
University, Boston
• Reviewer: Journal of Asset Management
Sri Krishnamurthy
Founder and CEO
QuantUniversity
3
About QuantUniversity
• Boston-based Data Science, Quant
Finance and Machine Learning
training and consulting advisory
• Trained more than 1000 students in
Quantitative methods, Data Science,
ML and Big Data Technologies
• Building a platform for
operationalizing AI and Machine
Learning in the Enterprise
4
1. Challenges with Real Datasets
2. Synthetic Dataset generation tools
▫ Proprietary
▫ Open Source
– Faker
– Data Synthesizer
– SDV
– Synthpop
– GANs
3. Demos
▫ Data Synthesizer
▫ Sales Data Generator
▫ VIX Data Generator
Agenda
Challenges with Real Datasets
6
7
• It may not be feasible to get samples for all
categories
• Lighting conditions
• Modifications (Glasses/No glasses,
Moustache/ No Moustache etc.)
• Positions
Coverage
Challenges with real datasets
8
All scenarios haven’t
played out
• Stress scenarios
• What-if scenarios
Challenges with real datasets
Figure ref: http://www.actuaries.org/CTTEES_SOLV/Documents/StressTestingPaper.pdf
9
Missing values
• Missing at random
• Missing sequences
• Need data to fill frames
Challenges with real datasets
10
• Access
▫ Hard to find
▫ Rare class problems
▫ Privacy concerns
making it difficult to
share
Challenges with real datasets
11
Imbalanced
• Need more samples of rare
class
• Need proxies for data points
that were not observed or
recorded
Challenges with real datasets
12
Labels
• Human labeling is hard
• Synthetic label generators
Challenges with real datasets
Tools for Synthetic Data Generation
14
Proprietary Tools
Company Core Technology
Tonic.ai
All-in-one platform for data anonymization, subsetting, and synthesis
integrated with databases (hadoop, oracle, mysql, MS sql server,
mongo db, amazon aurora/redshift, and google big query)
- Uses Condenser and Masquerade
Mostly.ai
Tablular data using generative deep neural networks (no image data)
CVEDIA
- Sensor modeling and algorithm training
- Handle image using SynCity as a custom pocket laboratory to
generate highly entropic scenes, conditions, and metadata. Enable
real-time Hardware-In-the-Loop (HWIL), Human-In-the-Loop (HITL) or
Software-In-the-Loop (SIL) simulations even with complex sensor
configurations
Deep vision data image creation
synthetic training data
Synthesis.ai The data generation platform for computer vision
15
Opensource tools
16
SDV
https://www.computer.org/csdl/proceedings-
article/dsaa/2016/07796926/12OmNwx3Q7S
17
Data Synthesizer
https://faculty.washington.edu/billhowe/publications/pdfs/pin
g17datasynthesizer.pdf
18
Synthpop
19
VAE
https://arxiv.org/pdf/1808.06444.pdf
20
GAN
https://developers.google.com/machine-
learning/gan/gan_structure
21
WGAN
1. Loan Data Synthesizer
2. Sales Data Generator
3. Vix Data Generator
23
24
Demo 1 – Loan Data Synthesizer
25
Demo 2: Synthetic Sales data generation
26
Demo 3 : Synthetic VIX generation
27
If you want to be a part of QuSandbox private Beta
Contact us:
info@qusandbox
28
1. Model Governance in the Age of Data Science and AI
▫ GFMI Course, March 9th, 10th, New York, NY
2. Synthetic VIX data generation using deep learning techniques
▫ QWAFAFEW meeting - March 17th, 2020, Boston MA
3. Using synthetic data for ML in Finance
▫ 2nd Annual Machine Learning in Quantitative Finance – April 1st, 2020, New York, NY
4. Tackling the biggest limitations of ML
▫ 2nd Annual Machine Learning in Quantitative Finance – April 1st, 2020, New York, NY
5. Foundations of Machine learning and AI for Financial Professionals
▫ 8-week Online course offered in partnership with PRMIA – May 12th – June 30th, 2020, Online
6. A Master Class on AI and Machine Learning for Financial Professionals
▫ Invited session at the 73rd CFA Annual Conference – May 17th, 2020, Atlanta, GA
Upcoming events by QuantUniversity
Sri Krishnamurthy, CFA, CAP
Founder and Chief Data Scientist
sri@quantuniversity.com
srikrishnamurthy
www.QuantUniversity.com
www.analyticscertificate.com
www.qusandbox.com
Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be
distributed or used in any other publication without the prior written consent of QuantUniversity LLC.
29

Synthetic data generation for machine learning

  • 1.
    Synthetic Data Generationfor Machine Learning 2020 Copyright QuantUniversity LLC. Presented By: Sri Krishnamurthy, CFA, CAP [email protected] www.quantuniversity.com 03/05/2020 Boston, MA
  • 2.
    2 Speaker bio • Quant,Data Science & ML practitioner • Prior Experience at MathWorks, Citigroup and Endeca and 25+ financial services and energy customers. • Columnist for the Wilmott Magazine • Author of forthcoming book “Financial Modeling: A case study approach” published by Wiley • Teaches Data Science/AI at Northeastern University, Boston • Reviewer: Journal of Asset Management Sri Krishnamurthy Founder and CEO QuantUniversity
  • 3.
    3 About QuantUniversity • Boston-basedData Science, Quant Finance and Machine Learning training and consulting advisory • Trained more than 1000 students in Quantitative methods, Data Science, ML and Big Data Technologies • Building a platform for operationalizing AI and Machine Learning in the Enterprise
  • 4.
    4 1. Challenges withReal Datasets 2. Synthetic Dataset generation tools ▫ Proprietary ▫ Open Source – Faker – Data Synthesizer – SDV – Synthpop – GANs 3. Demos ▫ Data Synthesizer ▫ Sales Data Generator ▫ VIX Data Generator Agenda
  • 5.
  • 6.
  • 7.
    7 • It maynot be feasible to get samples for all categories • Lighting conditions • Modifications (Glasses/No glasses, Moustache/ No Moustache etc.) • Positions Coverage Challenges with real datasets
  • 8.
    8 All scenarios haven’t playedout • Stress scenarios • What-if scenarios Challenges with real datasets Figure ref: http://www.actuaries.org/CTTEES_SOLV/Documents/StressTestingPaper.pdf
  • 9.
    9 Missing values • Missingat random • Missing sequences • Need data to fill frames Challenges with real datasets
  • 10.
    10 • Access ▫ Hardto find ▫ Rare class problems ▫ Privacy concerns making it difficult to share Challenges with real datasets
  • 11.
    11 Imbalanced • Need moresamples of rare class • Need proxies for data points that were not observed or recorded Challenges with real datasets
  • 12.
    12 Labels • Human labelingis hard • Synthetic label generators Challenges with real datasets
  • 13.
    Tools for SyntheticData Generation
  • 14.
    14 Proprietary Tools Company CoreTechnology Tonic.ai All-in-one platform for data anonymization, subsetting, and synthesis integrated with databases (hadoop, oracle, mysql, MS sql server, mongo db, amazon aurora/redshift, and google big query) - Uses Condenser and Masquerade Mostly.ai Tablular data using generative deep neural networks (no image data) CVEDIA - Sensor modeling and algorithm training - Handle image using SynCity as a custom pocket laboratory to generate highly entropic scenes, conditions, and metadata. Enable real-time Hardware-In-the-Loop (HWIL), Human-In-the-Loop (HITL) or Software-In-the-Loop (SIL) simulations even with complex sensor configurations Deep vision data image creation synthetic training data Synthesis.ai The data generation platform for computer vision
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
    1. Loan DataSynthesizer 2. Sales Data Generator 3. Vix Data Generator
  • 23.
  • 24.
    24 Demo 1 –Loan Data Synthesizer
  • 25.
    25 Demo 2: SyntheticSales data generation
  • 26.
    26 Demo 3 :Synthetic VIX generation
  • 27.
    27 If you wantto be a part of QuSandbox private Beta Contact us: info@qusandbox
  • 28.
    28 1. Model Governancein the Age of Data Science and AI ▫ GFMI Course, March 9th, 10th, New York, NY 2. Synthetic VIX data generation using deep learning techniques ▫ QWAFAFEW meeting - March 17th, 2020, Boston MA 3. Using synthetic data for ML in Finance ▫ 2nd Annual Machine Learning in Quantitative Finance – April 1st, 2020, New York, NY 4. Tackling the biggest limitations of ML ▫ 2nd Annual Machine Learning in Quantitative Finance – April 1st, 2020, New York, NY 5. Foundations of Machine learning and AI for Financial Professionals ▫ 8-week Online course offered in partnership with PRMIA – May 12th – June 30th, 2020, Online 6. A Master Class on AI and Machine Learning for Financial Professionals ▫ Invited session at the 73rd CFA Annual Conference – May 17th, 2020, Atlanta, GA Upcoming events by QuantUniversity
  • 29.
    Sri Krishnamurthy, CFA,CAP Founder and Chief Data Scientist [email protected] srikrishnamurthy www.QuantUniversity.com www.analyticscertificate.com www.qusandbox.com Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be distributed or used in any other publication without the prior written consent of QuantUniversity LLC. 29