SlideShare a Scribd company logo
Optimizing Industrial Operations
in Real time
using the Bigdata Ecosystem
Kishore Reddipalli
Director - Software Engineering
GE Digital
Agenda
• Usecase
• Spark as Analytic Runtime
• Optimization Framework
• Streaming and Batch Analysis
• Challenges
• QA
GE Mission
• Improve Asset Reliability and Availability
• Monitor Mission Critical Events
• Optimize the Manufacturing process
• Optimize Fleet Operations
• Reduce Unplanned Downtime
Usecase
Power Plant Efficiency:
• Heat rate in the context of power plants can be thought of as
the input needed to produce one unit of output. It generally
indicates the amount of fuel required to generate one unit of
electricity.
• Performance parameters tracked for any thermal power
plant like efficiency, fuel costs, plant load factor, emissions
level, etc. are a function of the station heat rate and can
be linked directly
Source : https://en.wikipedia.org/wiki/Heat_rate_(efficiency)
Data Volume
• In aviation a GE jet engine produces 5000 data points that
can analyzed per second to optimize flight times
• In Power there are 500000 data points need to analyzed for
generating the outcomes. The data points are being
generated from ~1000 sensors
• Data being generated from thousands of GE equipments at a
high volume and rate need to be stored, analyzed at a peta
byte scale.
Predix – Industrial Internet platform that can be
leveraged to build industrial applications
www.predix.io
Architecture
Spark as a Analytic Runtime
• Rest API (Spark Job Server)
• Security
• Multi-tenancy
• Optimization Framework
• Spark SQL
• Spark Streaming
Optimization Framework
Need for framework – To simplify and bring consistency in
the development of analytics and abstract the complexity of
data connectivity and processing of large volumes of data
• API
• Schema
• Data Providers (Input / Output)
• Data Frames (Variety of Data – Timeseries, Asset,
Configuration)
• Parallelism (Partitioning of data for processing)
• Multi-Mode (Stream vs Batch)
• Multi-Stream Source
• UDF (Aggregation, Interpolation, Unit of Measure)
Optimization Framework -
Architecture
Data Providers
The data connectors to fetch the data from
variety of data sources.
Example:
1. File– (HDFS)
2. HTTP – Restful Services (Asset, Timeseries,
any business services)
3. Database (Cassandra, Postgres)
4. Messaging (Kafka, Kinesis, EventHub)
Timeseries – Dataframe Schema
{
"tags": [
{
"tagId": ”temperature",
"data": [
{
"q": "3",
"ts": "2015-07-
23T12:25:00.000-0000",
"v": "425.07935"
Timeseries DataFrame
Asset Dataframe - Schema
"tagClassifications": [
{
"id": "OO-
BL000472_Tag_Temperature_Cl
assification_ID",
"name": "OO-
BL000472_Tag_Temperature_Cl
assification_name",
"description": "This is tag
Temperature Classification
description",
"unitGroup": "temperature",
"properties": [
{
"id": "low",
"value": [
80
],
"type": "double"
},
{
"id": "high",
"value": [
120
],
"type": "double"
},
{
"id": "threshold",
"value": [
100
Asset Dataframe
Sample Analytic
Stream Processing – Data Flow
Stream Processing
• Micro Batch Interval
• Continuous Application
• Multi Stream Sources
• Tenant Aware data Pipeline
• Context based data pipeline
• Window based Slicing– Moving Average
Stream Processing - Pointers
• Micro Batch Interval - “Depends on
Usecase”
• Data Congestion – Instream vs Processing
• Delayed Data – Quality In absence of data
Batch Processing
Batch Processing
• Time range of data
• Aggregations
• Parallel Collections
• Partitioning of Data
Challenges
Stream Processing:
- Data Arrival – Delays (Spark 2.x)
- State Persistence (Spark 2.x)
DataProviders:
-GRPC Connector (Shading)
Performance Tuning:
-Parallel Collections of Data (Read/Write)
Yarn-Client Mode Limitations: (Cluster Mode)
-Latency (Distribution of Jars)
-Loading from HDFS
Shading
Performance Metrics (Batch)
Performance Metrics (Stream)
Monitoring (Grafana)
Future Next Steps
• Spark 2.x – Structured Streaming
• Machine Learning Pipelines
• Zeppelin as Service – Interactive
Analysis
• Data Providers – Registration as a
Service
QA

More Related Content

What's hot (20)

PPTX
Innovation in the Enterprise Rent-A-Car Data Warehouse
DataWorks Summit
 
PPTX
Solving Performance Problems on Hadoop
Tyler Mitchell
 
PPTX
Democratizing data science Using spark, hive and druid
DataWorks Summit
 
PPTX
Accelerating Data Warehouse Modernization
DataWorks Summit/Hadoop Summit
 
PDF
Building Custom Big Data Integrations
Pat Patterson
 
PDF
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Spark Summit
 
PPTX
The key to unlocking the Value in the IoT? Managing the Data!
DataWorks Summit/Hadoop Summit
 
PDF
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Databricks
 
PDF
Introducing a horizontally scalable, inference-based business Rules Engine fo...
Cask Data
 
PPTX
Enterprise large scale graph analytics and computing base on distribute graph...
DataWorks Summit
 
PPTX
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
PPTX
a Real-time Processing System based on Spark streaming int he field of Teleco...
DataWorks Summit
 
PDF
Building a Federated Data Directory Platform for Public Health
Databricks
 
PDF
High Performance Spatial-Temporal Trajectory Analysis with Spark
DataWorks Summit/Hadoop Summit
 
PPTX
Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
DataWorks Summit
 
PPTX
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark Summit
 
PDF
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
DataWorks Summit
 
PPTX
Optimize Data for the Logical Data Warehouse
Attunity
 
PPTX
Intuit Analytics Cloud 101
DataWorks Summit/Hadoop Summit
 
PPTX
Which data should you move to Hadoop?
Attunity
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
DataWorks Summit
 
Solving Performance Problems on Hadoop
Tyler Mitchell
 
Democratizing data science Using spark, hive and druid
DataWorks Summit
 
Accelerating Data Warehouse Modernization
DataWorks Summit/Hadoop Summit
 
Building Custom Big Data Integrations
Pat Patterson
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Spark Summit
 
The key to unlocking the Value in the IoT? Managing the Data!
DataWorks Summit/Hadoop Summit
 
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Databricks
 
Introducing a horizontally scalable, inference-based business Rules Engine fo...
Cask Data
 
Enterprise large scale graph analytics and computing base on distribute graph...
DataWorks Summit
 
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
a Real-time Processing System based on Spark streaming int he field of Teleco...
DataWorks Summit
 
Building a Federated Data Directory Platform for Public Health
Databricks
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
DataWorks Summit/Hadoop Summit
 
Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
DataWorks Summit
 
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark Summit
 
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
DataWorks Summit
 
Optimize Data for the Logical Data Warehouse
Attunity
 
Intuit Analytics Cloud 101
DataWorks Summit/Hadoop Summit
 
Which data should you move to Hadoop?
Attunity
 

Similar to Optimizing industrial operations using the big data ecosystem (20)

PDF
Enabling Key Business Advantage from Big Data through Advanced Ingest Process...
StampedeCon
 
PPTX
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
PPTX
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Dataconomy Media
 
PPTX
MongoDB for Time Series Data
MongoDB
 
PPTX
Mongo db 2.4 time series data - Brignoli
Codemotion
 
PPTX
From Kafka to BigQuery - Strata Singapore
Ofir Sharony
 
PPTX
Future Grid Overview 2018
Chris J Law
 
PPTX
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
DataScienceConferenc1
 
PPTX
From Data to Services at the Speed of Business
Ali Hodroj
 
PPTX
StreamCentral Technical Overview
Raheel Retiwalla
 
PDF
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
PDF
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Redis Labs
 
PPTX
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
DataStax
 
PDF
DW on AWS
Gaurav Agrawal
 
PDF
[db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use...
Insight Technology, Inc.
 
PPTX
Architectures, Frameworks and Infrastructure
harendra_pathak
 
PDF
How to build a data stack from scratch
Vinayak Hegde
 
PPTX
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
Maya Lumbroso
 
PPTX
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
Dataconomy Media
 
PDF
Exploring Neo4j Graph Database as a Fast Data Access Layer
Sambit Banerjee
 
Enabling Key Business Advantage from Big Data through Advanced Ingest Process...
StampedeCon
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Dataconomy Media
 
MongoDB for Time Series Data
MongoDB
 
Mongo db 2.4 time series data - Brignoli
Codemotion
 
From Kafka to BigQuery - Strata Singapore
Ofir Sharony
 
Future Grid Overview 2018
Chris J Law
 
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
DataScienceConferenc1
 
From Data to Services at the Speed of Business
Ali Hodroj
 
StreamCentral Technical Overview
Raheel Retiwalla
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Redis Labs
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
DataStax
 
DW on AWS
Gaurav Agrawal
 
[db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use...
Insight Technology, Inc.
 
Architectures, Frameworks and Infrastructure
harendra_pathak
 
How to build a data stack from scratch
Vinayak Hegde
 
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
Maya Lumbroso
 
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
Dataconomy Media
 
Exploring Neo4j Graph Database as a Fast Data Access Layer
Sambit Banerjee
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
July Patch Tuesday
Ivanti
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
July Patch Tuesday
Ivanti
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Python basic programing language for automation
DanialHabibi2
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 

Optimizing industrial operations using the big data ecosystem

  • 1. Optimizing Industrial Operations in Real time using the Bigdata Ecosystem Kishore Reddipalli Director - Software Engineering GE Digital
  • 2. Agenda • Usecase • Spark as Analytic Runtime • Optimization Framework • Streaming and Batch Analysis • Challenges • QA
  • 3. GE Mission • Improve Asset Reliability and Availability • Monitor Mission Critical Events • Optimize the Manufacturing process • Optimize Fleet Operations • Reduce Unplanned Downtime
  • 4. Usecase Power Plant Efficiency: • Heat rate in the context of power plants can be thought of as the input needed to produce one unit of output. It generally indicates the amount of fuel required to generate one unit of electricity. • Performance parameters tracked for any thermal power plant like efficiency, fuel costs, plant load factor, emissions level, etc. are a function of the station heat rate and can be linked directly Source : https://en.wikipedia.org/wiki/Heat_rate_(efficiency)
  • 5. Data Volume • In aviation a GE jet engine produces 5000 data points that can analyzed per second to optimize flight times • In Power there are 500000 data points need to analyzed for generating the outcomes. The data points are being generated from ~1000 sensors • Data being generated from thousands of GE equipments at a high volume and rate need to be stored, analyzed at a peta byte scale.
  • 6. Predix – Industrial Internet platform that can be leveraged to build industrial applications www.predix.io
  • 8. Spark as a Analytic Runtime • Rest API (Spark Job Server) • Security • Multi-tenancy • Optimization Framework • Spark SQL • Spark Streaming
  • 9. Optimization Framework Need for framework – To simplify and bring consistency in the development of analytics and abstract the complexity of data connectivity and processing of large volumes of data • API • Schema • Data Providers (Input / Output) • Data Frames (Variety of Data – Timeseries, Asset, Configuration) • Parallelism (Partitioning of data for processing) • Multi-Mode (Stream vs Batch) • Multi-Stream Source • UDF (Aggregation, Interpolation, Unit of Measure)
  • 11. Data Providers The data connectors to fetch the data from variety of data sources. Example: 1. File– (HDFS) 2. HTTP – Restful Services (Asset, Timeseries, any business services) 3. Database (Cassandra, Postgres) 4. Messaging (Kafka, Kinesis, EventHub)
  • 12. Timeseries – Dataframe Schema { "tags": [ { "tagId": ”temperature", "data": [ { "q": "3", "ts": "2015-07- 23T12:25:00.000-0000", "v": "425.07935"
  • 14. Asset Dataframe - Schema "tagClassifications": [ { "id": "OO- BL000472_Tag_Temperature_Cl assification_ID", "name": "OO- BL000472_Tag_Temperature_Cl assification_name", "description": "This is tag Temperature Classification description", "unitGroup": "temperature", "properties": [ { "id": "low", "value": [ 80 ], "type": "double" }, { "id": "high", "value": [ 120 ], "type": "double" }, { "id": "threshold", "value": [ 100
  • 18. Stream Processing • Micro Batch Interval • Continuous Application • Multi Stream Sources • Tenant Aware data Pipeline • Context based data pipeline • Window based Slicing– Moving Average
  • 19. Stream Processing - Pointers • Micro Batch Interval - “Depends on Usecase” • Data Congestion – Instream vs Processing • Delayed Data – Quality In absence of data
  • 21. Batch Processing • Time range of data • Aggregations • Parallel Collections • Partitioning of Data
  • 22. Challenges Stream Processing: - Data Arrival – Delays (Spark 2.x) - State Persistence (Spark 2.x) DataProviders: -GRPC Connector (Shading) Performance Tuning: -Parallel Collections of Data (Read/Write) Yarn-Client Mode Limitations: (Cluster Mode) -Latency (Distribution of Jars) -Loading from HDFS
  • 27. Future Next Steps • Spark 2.x – Structured Streaming • Machine Learning Pipelines • Zeppelin as Service – Interactive Analysis • Data Providers – Registration as a Service
  • 28. QA

Editor's Notes

  • #14: Timeseries Dataframe Configuration Dataframe
  • #16: Timeseries Dataframe Configuration Dataframe
  • #17: Timeseries Dataframe Configuration Dataframe
  • #23: Some of the challenges in industrial usecases late arrival of data – We need to make sure the batch interval to be tuned for the usecase needs to ignore the late data We also have usecases to persist the state intermediate We developed a spark custom receiver to stream the data from an in-house messaging layer – eventhub (grpc) . Some of the challenges while building are the class conflict issues the typical java class loading issues which different version of third part libraries. For the reason we used the approach of shading which enabled.
  • #27: Graphite and Grafana – Ability to monitor and visualize the Spark Performance and provide the ability to create dashboards – consolidated UI
  • #28: Ability to author, test and productionize the analytics Support for machine learning pipelines Support for registering custom data providers Unit of Measure conversions Spark 2.0 Adoption – Structured Streaming