SlideShare a Scribd company logo
Pavel Hardak, Dir Product (Workday)
Ned Borisov (Ph.D), Sr Eng Mgr (Workday)
Lightning-Fast Analytics for
Workday Transactional Data
#ExpSAIS18
Agenda
• Workday (Pavel H)
– Introduction to Workday
– Business challenges
– Platform for Transactional Apps
• Prism Analytics (Ned B)
– High Level Architecture
– Functional Modules
– Problems encountered
• Wrap-up (Pavel H)
2#ExpSAIS18
Workday
• Pure SaaS company (founded in 2005)
• Enterprise cloud apps – HCM and Finances
– Named as “Leader” in Gartner Magic Quadrants
• 2200+ customers, 175+ of Fortune 500
– Revenue: $2.1B, 36% YoY
• 8600+ employees worldwide
– #7 in FORTUNE "100 Best Companies to Work For”
– Pleasanton (HQ), San Mateo, San Francisco
– Boulder (CO), Dublin (Ireland), Victoria (BC), …
3#ExpSAIS18
Workday Confidential
#ExpSAIS18
Continuous Innovation in Cloud
5#ExpSAIS18
6#ExpSAIS18
Enterprise SaaS Challenges
• Concurrency
– From small to huge companies - every ‘worker’ is Workday user
• Reliability
– All users add and change data, generating many transactions
• Security
– Customers trust us with very confidential and private information
• Scalability
– Import several years from the previous system(s) and keep growing
• Speed
– Everybody wants fast response time J
7#ExpSAIS18
Business Process
Framework
Object
Data Model
Reporting and
Analytics
Security Integration
Cloud
One Source for Data | One Security Model | One Experience | One Community
Machine
Learning
One Platform
#ExpSAIS18
Object
Data Model
One Source for Data | One Security Model | One Experience | One Community
One Platform
Object Data Model
MetadataExtensibleDurable
#ExpSAIS18
Reporting and
Analytics
One Source for Data | One Security Model | One Experience | One Community
One Platform
Reporting and Analytics
Dashboards CollaborationDistribution
But we want more…
• Import 3rd party data from external sources
– Unknown schema, need validations and cleansing
• Blend external data with Workday data
– Self Service Data Preparation
– Publish custom report sources
– Leverage the same security paradigms
• Data Discovery and Reporting
– Visualize, slice and dice by any dimension
– Perform faster than ever before
11#ExpSAIS18
12#ExpSAIS18
Just add some …
• Water (?)
• Coffee (?)
• Energy drink (?)
• Apache Spark (!)
13#ExpSAIS18
Why Apache Spark
• Wanted to standardize on ONE data processing
technology which keeps evolving
• Needed extensibility to handle diverse use cases
• Scalability for on-disk views and in-memory
processing
• SQL processing is a HUGE plus
#ExpSAIS18
High Level Prism Architecture
Report Queries Web UI Requests
Data Prep:
Interactive Transforms
HDFS
Workday Data
External Data
Samples
#ExpSAIS18
Prism Server
Data Preparation
• A dataset may import
other datasets to
transform them (think
SQL View)
• Transforms include:
Filter, Join, Union,
Group By, etc.
• Example data are shown
to help verify the
transformation
#ExpSAIS18
High Level Prism Architecture
Report Queries Web UI Requests
Data Prep:
Interactive Transforms
Lens Build:
Batch Transforms
HDFS
Workday Data
External Data
Samples
Data
#ExpSAIS18
Prism Server
Lens Build
Lens
• Materializing all
transforms
• Columnar format with
further split into small
blocks
Spark
Jobs
#ExpSAIS18
High Level Prism Architecture
Report Queries Web UI Requests
Query Engine:
Interactive BI Queries
Data Prep:
Interactive Transforms
Lens Build:
Batch Transforms
HDFS
Workday Data
External Data
Samples
Lens
Data
#ExpSAIS18
Prism Server
Query Engine
• Analyst-driven Analysis
• Drag & drop chart creation
• Analyst defined computed fields
• Quick measurement aggregates
• Execution
• Query Engine executes the queries
• Interactive response is required
#ExpSAIS18
High Level Prism Architecture
Report Queries Web UI Requests
Query Engine:
Interactive BI Queries
Data Prep:
Interactive Transforms
Lens Build:
Batch Transforms
HDFS
Workday Data
External Data
Samples
Lens
Data
#ExpSAIS18
Prism Server
Spark in Prism Architecture
Prism Analytics launches and maintains lifecycle of three types
of Spark Applications
• Data Prep: a single (smaller) always-on Spark Application
– executes dataset transformations over small samples of data
• Lens Build: on-demand batch Application
– one per Lens Build process
– executes dataset transformations over full datasets
• Query Engine: a single (larger) always-on Application
– executes reporting queries over Lens data
– caches columns of Lenses in memory
#ExpSAIS18
Query Engine & Spark
Query Engine
Prism
Spark
Server
Spark
Driver
Prism Server
Data Prep
. . .
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
#ExpSAIS18
Notable Observations
• Memory Allocation Strategy
• Row Level Security
#ExpSAIS18
Memory Allocation Strategy
• Executors
• Driver
Column Data
Cache
30%
Execution
60% 10%
Buffer
Accumulators
20%
Streaming
60% 20%
Buffer
Executor JVM
Driver JVM
#ExpSAIS18
à 20% faster queries
Row-Level Security
• Implemented as a dimension predicate. For example:
• In-List for supervisory_org could be very large
• More than one In-List
• Complex list values (e.g. nested conjunctions)
SELECT employee, SUM(quantity)
FROM Employee_Stock_Grants
WHERE supervisory_org IN (org1, org33, org_508)
GROUP BY employee;
#ExpSAIS18
Scenario Details
• Customer Use Case
– Predicates with 10+ In-Lists
– Values between 6K and 12K
– Additional mix of conjunctions and disjunctions
• The Same Query
With Security = 100X Without Security
#ExpSAIS18
Analysis
• Finding 1
– Parsing, planning and optimizing was taking ~27 seconds
– We did it 4 times
• Finding 2
– Major cause is the number of times the Catalyst
expressions (In and InSet) and their arguments were
being traversed and copied during plan analysis and
optimization.
– Minor cause is the amount of time spent in serializing
Scala’s TrieSet when shipping the plan to executors
#ExpSAIS18
Solution
• Custom InSet-Like expressions (case classes)
– Hide the large literals sets through a curried-argument
– Resulted in queries going from 27 sec to 4 sec.
• Further Optimizations
– Our InSet-Like expression did not materialize the target
in-sets until after the plan was de-serialized on the
executors
– Resulted in improvement from 4 sec to 2 sec.
#ExpSAIS18
Future Plans
• Better query latency for big datasets
• Deeper integration with reports and apps
• Integration with Kubernetes and AWS
• Improved scalability and concurrency
• Achieve ‘Zero DownTime’
…and much more I can not share here J
30#ExpSAIS18
Questions?
• IF ( you are looking for …
Great work culture &&
Technology challenges &&
Lots of fun and perks )
• THEN
Come to work with us!!!
workday.com/jobs
31#ExpSAIS18
More Info
• Building a modern data discovery and BI platform using
Apache Spark and Catalyst by Kevin Beyer
• Data Preparation in Workday Prism Analytics: Solving
Complex Problems the Workday Way by Jianneng Li
• Exploring Workday’s Architecture by James Pasley
32#ExpSAIS18

More Related Content

What's hot (20)

PDF
Data Governance & Data Steward Certification
DATAVERSITY
 
PDF
TekMindz Master Data Management Capabilities
Akshay Pandita
 
PPTX
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
DataWorks Summit/Hadoop Summit
 
PPTX
The Three Things You Need to Know to Transform Any Size Organization Into an ...
Mike Cottmeyer
 
PDF
Monitoring and observability
Theo Schlossnagle
 
PPTX
Microsoft Data Platform - What's included
James Serra
 
PDF
Observability
Ebru Cucen Çüçen
 
PPTX
Building a modern data warehouse
James Serra
 
PDF
NoSQL and NewSQL: Tradeoffs between Scalable Performance & Consistency
ScyllaDB
 
PDF
Enterprise Architecture vs. Data Architecture
DATAVERSITY
 
PDF
Business Intelligence Presentation (1/2)
Bernardo Najlis
 
PDF
Sizing Splunk SmartStore - Spend Less and Get More Out of Splunk
Paula Koziol
 
PDF
Enabling a Data Mesh Architecture and Data Sharing Culture with Denodo
Denodo
 
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
PPTX
Apache Superset - open source data exploration and visualization (Conclusion ...
Lucas Jellema
 
PPTX
High Performance Object Storage in 30 Minutes with Supermicro and MinIO
Rebekah Rodriguez
 
PPTX
RFP Response for Unique Bank Technical Migration
DEEPRAJ PATHAK
 
PDF
Effective AIOps with Open Source Software in a Week
Databricks
 
PPTX
Cloudera SDX
Cloudera, Inc.
 
PPTX
Azure data platform overview
James Serra
 
Data Governance & Data Steward Certification
DATAVERSITY
 
TekMindz Master Data Management Capabilities
Akshay Pandita
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
DataWorks Summit/Hadoop Summit
 
The Three Things You Need to Know to Transform Any Size Organization Into an ...
Mike Cottmeyer
 
Monitoring and observability
Theo Schlossnagle
 
Microsoft Data Platform - What's included
James Serra
 
Observability
Ebru Cucen Çüçen
 
Building a modern data warehouse
James Serra
 
NoSQL and NewSQL: Tradeoffs between Scalable Performance & Consistency
ScyllaDB
 
Enterprise Architecture vs. Data Architecture
DATAVERSITY
 
Business Intelligence Presentation (1/2)
Bernardo Najlis
 
Sizing Splunk SmartStore - Spend Less and Get More Out of Splunk
Paula Koziol
 
Enabling a Data Mesh Architecture and Data Sharing Culture with Denodo
Denodo
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Apache Superset - open source data exploration and visualization (Conclusion ...
Lucas Jellema
 
High Performance Object Storage in 30 Minutes with Supermicro and MinIO
Rebekah Rodriguez
 
RFP Response for Unique Bank Technical Migration
DEEPRAJ PATHAK
 
Effective AIOps with Open Source Software in a Week
Databricks
 
Cloudera SDX
Cloudera, Inc.
 
Azure data platform overview
James Serra
 

Similar to Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and Ned Borisov (20)

PPTX
From Legacy Web Application To SharePoint - a case study
Elizabeth Szabo
 
PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
PDF
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks
 
PDF
Managing data analytics in a hybrid cloud
Karan Singh
 
PPT
MongoDB Tick Data Presentation
MongoDB
 
PPTX
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
Deepak Chandramouli
 
PDF
Unlocking the Value of Your Data Lake
DATAVERSITY
 
PDF
Enabling Key Business Advantage from Big Data through Advanced Ingest Process...
StampedeCon
 
PDF
the Data World Distilled
RTTS
 
PPTX
Transform your DBMS to drive engagement innovation with Big Data
Ashnikbiz
 
PDF
OPEN'17_4_Postgres: The Centerpiece for Modernising IT Infrastructures
Kangaroot
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
DOC
Sandeep Grandhi (1)
SANDEEP GRANDHI
 
PPTX
L’architettura di classe enterprise di nuova generazione
MongoDB
 
PDF
QuerySurge Slide Deck for Big Data Testing Webinar
RTTS
 
PPTX
Webinar: “ditch Oracle NOW”: Best Practices for Migrating to MongoDB
MongoDB
 
PDF
Talend introduction v1
Softnix Technology
 
PPTX
Graphs fun vjug2
Neo4j
 
PDF
Creating a Next-Generation Big Data Architecture
Perficient, Inc.
 
PDF
Creatinganext generationbigdataarchitecture-141204150317-conversion-gate02
email2jl
 
From Legacy Web Application To SharePoint - a case study
Elizabeth Szabo
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks
 
Managing data analytics in a hybrid cloud
Karan Singh
 
MongoDB Tick Data Presentation
MongoDB
 
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
Deepak Chandramouli
 
Unlocking the Value of Your Data Lake
DATAVERSITY
 
Enabling Key Business Advantage from Big Data through Advanced Ingest Process...
StampedeCon
 
the Data World Distilled
RTTS
 
Transform your DBMS to drive engagement innovation with Big Data
Ashnikbiz
 
OPEN'17_4_Postgres: The Centerpiece for Modernising IT Infrastructures
Kangaroot
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Sandeep Grandhi (1)
SANDEEP GRANDHI
 
L’architettura di classe enterprise di nuova generazione
MongoDB
 
QuerySurge Slide Deck for Big Data Testing Webinar
RTTS
 
Webinar: “ditch Oracle NOW”: Best Practices for Migrating to MongoDB
MongoDB
 
Talend introduction v1
Softnix Technology
 
Graphs fun vjug2
Neo4j
 
Creating a Next-Generation Big Data Architecture
Perficient, Inc.
 
Creatinganext generationbigdataarchitecture-141204150317-conversion-gate02
email2jl
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
Climate Action.pptx action plan for climate
justfortalabat
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 

Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and Ned Borisov

  • 1. Pavel Hardak, Dir Product (Workday) Ned Borisov (Ph.D), Sr Eng Mgr (Workday) Lightning-Fast Analytics for Workday Transactional Data #ExpSAIS18
  • 2. Agenda • Workday (Pavel H) – Introduction to Workday – Business challenges – Platform for Transactional Apps • Prism Analytics (Ned B) – High Level Architecture – Functional Modules – Problems encountered • Wrap-up (Pavel H) 2#ExpSAIS18
  • 3. Workday • Pure SaaS company (founded in 2005) • Enterprise cloud apps – HCM and Finances – Named as “Leader” in Gartner Magic Quadrants • 2200+ customers, 175+ of Fortune 500 – Revenue: $2.1B, 36% YoY • 8600+ employees worldwide – #7 in FORTUNE "100 Best Companies to Work For” – Pleasanton (HQ), San Mateo, San Francisco – Boulder (CO), Dublin (Ireland), Victoria (BC), … 3#ExpSAIS18
  • 5. Continuous Innovation in Cloud 5#ExpSAIS18
  • 7. Enterprise SaaS Challenges • Concurrency – From small to huge companies - every ‘worker’ is Workday user • Reliability – All users add and change data, generating many transactions • Security – Customers trust us with very confidential and private information • Scalability – Import several years from the previous system(s) and keep growing • Speed – Everybody wants fast response time J 7#ExpSAIS18
  • 8. Business Process Framework Object Data Model Reporting and Analytics Security Integration Cloud One Source for Data | One Security Model | One Experience | One Community Machine Learning One Platform #ExpSAIS18
  • 9. Object Data Model One Source for Data | One Security Model | One Experience | One Community One Platform Object Data Model MetadataExtensibleDurable #ExpSAIS18
  • 10. Reporting and Analytics One Source for Data | One Security Model | One Experience | One Community One Platform Reporting and Analytics Dashboards CollaborationDistribution
  • 11. But we want more… • Import 3rd party data from external sources – Unknown schema, need validations and cleansing • Blend external data with Workday data – Self Service Data Preparation – Publish custom report sources – Leverage the same security paradigms • Data Discovery and Reporting – Visualize, slice and dice by any dimension – Perform faster than ever before 11#ExpSAIS18
  • 13. Just add some … • Water (?) • Coffee (?) • Energy drink (?) • Apache Spark (!) 13#ExpSAIS18
  • 14. Why Apache Spark • Wanted to standardize on ONE data processing technology which keeps evolving • Needed extensibility to handle diverse use cases • Scalability for on-disk views and in-memory processing • SQL processing is a HUGE plus #ExpSAIS18
  • 15. High Level Prism Architecture Report Queries Web UI Requests Data Prep: Interactive Transforms HDFS Workday Data External Data Samples #ExpSAIS18 Prism Server
  • 16. Data Preparation • A dataset may import other datasets to transform them (think SQL View) • Transforms include: Filter, Join, Union, Group By, etc. • Example data are shown to help verify the transformation #ExpSAIS18
  • 17. High Level Prism Architecture Report Queries Web UI Requests Data Prep: Interactive Transforms Lens Build: Batch Transforms HDFS Workday Data External Data Samples Data #ExpSAIS18 Prism Server
  • 18. Lens Build Lens • Materializing all transforms • Columnar format with further split into small blocks Spark Jobs #ExpSAIS18
  • 19. High Level Prism Architecture Report Queries Web UI Requests Query Engine: Interactive BI Queries Data Prep: Interactive Transforms Lens Build: Batch Transforms HDFS Workday Data External Data Samples Lens Data #ExpSAIS18 Prism Server
  • 20. Query Engine • Analyst-driven Analysis • Drag & drop chart creation • Analyst defined computed fields • Quick measurement aggregates • Execution • Query Engine executes the queries • Interactive response is required #ExpSAIS18
  • 21. High Level Prism Architecture Report Queries Web UI Requests Query Engine: Interactive BI Queries Data Prep: Interactive Transforms Lens Build: Batch Transforms HDFS Workday Data External Data Samples Lens Data #ExpSAIS18 Prism Server
  • 22. Spark in Prism Architecture Prism Analytics launches and maintains lifecycle of three types of Spark Applications • Data Prep: a single (smaller) always-on Spark Application – executes dataset transformations over small samples of data • Lens Build: on-demand batch Application – one per Lens Build process – executes dataset transformations over full datasets • Query Engine: a single (larger) always-on Application – executes reporting queries over Lens data – caches columns of Lenses in memory #ExpSAIS18
  • 23. Query Engine & Spark Query Engine Prism Spark Server Spark Driver Prism Server Data Prep . . . Spark Executor Spark Executor Spark Executor Spark Executor Spark Executor Spark Executor Spark Executor Spark Executor Spark Executor Spark Executor Spark Executor Spark Executor #ExpSAIS18
  • 24. Notable Observations • Memory Allocation Strategy • Row Level Security #ExpSAIS18
  • 25. Memory Allocation Strategy • Executors • Driver Column Data Cache 30% Execution 60% 10% Buffer Accumulators 20% Streaming 60% 20% Buffer Executor JVM Driver JVM #ExpSAIS18 à 20% faster queries
  • 26. Row-Level Security • Implemented as a dimension predicate. For example: • In-List for supervisory_org could be very large • More than one In-List • Complex list values (e.g. nested conjunctions) SELECT employee, SUM(quantity) FROM Employee_Stock_Grants WHERE supervisory_org IN (org1, org33, org_508) GROUP BY employee; #ExpSAIS18
  • 27. Scenario Details • Customer Use Case – Predicates with 10+ In-Lists – Values between 6K and 12K – Additional mix of conjunctions and disjunctions • The Same Query With Security = 100X Without Security #ExpSAIS18
  • 28. Analysis • Finding 1 – Parsing, planning and optimizing was taking ~27 seconds – We did it 4 times • Finding 2 – Major cause is the number of times the Catalyst expressions (In and InSet) and their arguments were being traversed and copied during plan analysis and optimization. – Minor cause is the amount of time spent in serializing Scala’s TrieSet when shipping the plan to executors #ExpSAIS18
  • 29. Solution • Custom InSet-Like expressions (case classes) – Hide the large literals sets through a curried-argument – Resulted in queries going from 27 sec to 4 sec. • Further Optimizations – Our InSet-Like expression did not materialize the target in-sets until after the plan was de-serialized on the executors – Resulted in improvement from 4 sec to 2 sec. #ExpSAIS18
  • 30. Future Plans • Better query latency for big datasets • Deeper integration with reports and apps • Integration with Kubernetes and AWS • Improved scalability and concurrency • Achieve ‘Zero DownTime’ …and much more I can not share here J 30#ExpSAIS18
  • 31. Questions? • IF ( you are looking for … Great work culture && Technology challenges && Lots of fun and perks ) • THEN Come to work with us!!! workday.com/jobs 31#ExpSAIS18
  • 32. More Info • Building a modern data discovery and BI platform using Apache Spark and Catalyst by Kevin Beyer • Data Preparation in Workday Prism Analytics: Solving Complex Problems the Workday Way by Jianneng Li • Exploring Workday’s Architecture by James Pasley 32#ExpSAIS18