SlideShare a Scribd company logo
Fault-tolerant
File Input & Output
Chandni Singh - Committer Apache Apex
May 4, 2016
Background- Windows in Apex
- Window: finite piece of a data set along temporal boundaries*
- Apex assigns an id to each window which helps with fault-tolerance.
- An operator is provided hooks to know which window id it is on.
File Input
- AbstractFileInputOperator for reading approx. equal sized
files.
- Out of box supported file formats include txt, Parque and
Avro.
- FileSplitterInput & AbstractFSBlockReader
- reading different sized files
- parallelizing read on a single file.
AbstractFileInputOperator
- Scans a folder periodically for new files.
- Parses the file for records.
- Fault-tolerant and scalable.
AbstractFileInputOperator : Fault tolerance
- A record is not lost.
- A record is associated with only one window id irrespective
of failures.
- If a window is replayed then all the records associated with
it will be replayed.
AbstractFileInputOperator : Fault tolerance cont’d
Failure
Window 0: committed
AbstractFileInputOperator : Fault tolerance cont’d
Fault tolerance is achieved by
- Support from platform
- Automatic checkpointing of the state of every operator in the dag.
- Automatic restoring a failed operator in another container.
- WindowDataManager
- Saves incremental state every window.
- Helps with replaying windows that were completed by this operator.
AbstractFileInputOperator : Scalability
- Operator partitions read different subset of files.
- Files are distributed between partitions based on their hash.
- Number of partitions can be changed at run time by changing a property.
- For advanced use cases, subclasses can override the directory scanner to customize behavior such
as having each partition scan a different directory.
- Auto-scalability supported as well in AbstractThroughputFileInputOperator.
AbstractFileInputOperator : Implementations
- LineByLineFileInputOperator in Malhar library
- Custom implementation
public class CustomFileInputOperator<RECORD> extends AbstractFileInputOperator<RECORD>
{
public final transient DefaultOutputPort<RECORD> output = new DefaultOutputPort<RECORD>();
@Override
protected RECORD readEntity() throws IOException
{
//read record from input stream
RECORD record= inputStream.read(...);
return record;
}
@Override
protected void emit(RECORD tuple)
{
output.emit(tuple);
}
}
FileSplitterInput & AbstractFSBlockReader
- Task of discovering files and reading them is separated into different logical
operators.
- File splitter discovers files asynchronously and creates task descriptions-
FileBlockMetadata.
- Block readers use FileBlockMetada to read a particular block of file.
- Fault-tolerant, parallelizes reading on a single file and is auto-scalable.
FileSplitterInput & AbstractFSBlockReader: Fault tolerance
- Platform supports checkpointing state and re-deployment automatically.
- FileSplitterInput uses WindowDataManager to replay tuples of completed
windows.
- AbstractFSBlockReader relies on the upstream buffer-server to replay tuples
from a given window.
- Buffer-server is a buffer associated with each output port of an operator which
holds tuples emitted by that port.
FileSplitterInput & AbstractFSBlockReader: Fault tolerance
cont’d
FileSplitterInput & AbstractFSBlockReader: Scalability
- FileSplitterInput is a simple operator which does not take much resources.
- Block reader does the actual work of reading files and is auto-scalable (in beta).
- Min and max partitions are configurable.
- Frequency of re-partition is controlled by a time interval property.
- Scales up/down based on the pending FileBlockMetadata in the input port queue.
FileSplitterInput & AbstractFSBlockReader: Implementations
- FileSplitterInput is concrete. Default behavior can be overridden.
- FS Block Readers
- FSSliceReader : record is a slice
- AbstractFSLineReader and AbstractFSReadAheadLineReader: record is a line
- Custom FS Block Reader
public class CustomFSBlockReader<RECORD> extends AbstractFSBlockReader<RECORD>
{
public CustomFSBlockReader()
{
//initialize reader context
this.readerContext = new RecordReaderContext();
}
@Override
protected RECORD convertToRecord(byte[] bytes)
{
//convert bytes to RECORD
return RECORD.from(bytes);
}
}
AbstractFileOutputOperator
- Persists data to a single file or multiple files.
- Automatic rotation of files (optional) based on
- file size
- window count
- Optional compression and encryption of data.
- Fault-tolerant
- Scalable as long as different partitions write to different files. Subclasses can
achieve this by appending the operator id to the file name.
AbstractFileOutputOperator : Fault tolerance
Record is persisted exactly once.
- A record is never missed.
- A record is not duplicated.
Example application that persists data exactly once:
AtomicFileOutputApp
AbstractFileOutputOperator : Fault tolerance cont’d
AbstractFileOutputOperator : Fault tolerance cont’d
To write exactly once
- Assumes idempotent processing
- Checkpoint consists of size of each file the operator has
written so far.
- Truncation of files to the size saved in the restoration
checkpoint.
AbstractFileOutputOperator : Fault tolerance cont’d
To avoid dangling lease issues in HDFS
- Data is always written to temporary files
- Renaming temp files to actual files when a file is finalized, that is, closed for
writing.
- User can choose when the files get finalized. Rotation handles finalization
automatically.
AbstractFileOutputOperator : Custom Implementation
public class CustomFileOutputOperator<RECORD> extends AbstractFileOutputOperator<RECORD>
{
public CustomFileOutputOperator()
{
setMaxLength(1024 * 1024);
setRotationWindows(600);
}
@Override
protected String getFileName(RECORD tuple)
{
//file name
return tuple.getFileName();
}
@Override
protected byte[] getBytesForTuple(RECORD tuple)
{
//bytes from record
return tuple.toBytes();
}
}
Acknowledgements
- Apex dev team
Munagala Ramanath
Pramod Immaneni
Sasha Parfenov
Thomas Weise
Timothy Farkas
- Meetup organizers
Amol Kekre
Qybare Pula
Ian Gomez
- Apache Apex Community
© 2016 DataTorrent
Resources
22
* https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
• Apache Apex website - http://apex.apache.org/
• Subscribe - http://apex.apache.org/community.html
• Download - http://apex.apache.org/downloads.html
• Twitter - @ApacheApex; Follow - https://twitter.com/apacheapex
• Facebook - https://www.facebook.com/ApacheApex/
• Meetup - http://www.meetup.com/topics/apache-apex
• Free Enterprise License for Startups - https://www.datatorrent.com/product/startup-
accelerator/
© 2016 DataTorrent
We Are Hiring
23
• jobs@datatorrent.com
• Developers/Architects
• QA Automation Developers
• Information Developers
• Build and Release
• Community Leaders

More Related Content

What's hot (20)

PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
PPTX
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Apache Apex
 
PPTX
Apache Apex Fault Tolerance and Processing Semantics
Apache Apex
 
PPTX
Architectual Comparison of Apache Apex and Spark Streaming
Apache Apex
 
PPTX
Introduction to Apache Apex
Apache Apex
 
PPTX
DataTorrent Presentation @ Big Data Application Meetup
Thomas Weise
 
PDF
Introduction to Apache Apex - CoDS 2016
Bhupesh Chawda
 
PDF
Apex as yarn application
Chinmay Kolhatkar
 
PPTX
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
Apache Apex
 
PPTX
Java High Level Stream API
Apache Apex
 
PPTX
Fault Tolerance and Processing Semantics in Apache Apex
Apache Apex Organizer
 
PDF
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
PPTX
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
PPTX
Apache Apex Introduction with PubMatic
Apache Apex
 
PPTX
Stream Processing with Apache Apex
Pramod Immaneni
 
PDF
From Batch to Streaming with Apache Apex Dataworks Summit 2017
Apache Apex
 
PDF
Extending The Yahoo Streaming Benchmark to Apache Apex
Apache Apex
 
PDF
Developing streaming applications with apache apex (strata + hadoop world)
Apache Apex
 
PDF
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Apache Apex
 
PPTX
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Apache Apex
 
Apache Apex Fault Tolerance and Processing Semantics
Apache Apex
 
Architectual Comparison of Apache Apex and Spark Streaming
Apache Apex
 
Introduction to Apache Apex
Apache Apex
 
DataTorrent Presentation @ Big Data Application Meetup
Thomas Weise
 
Introduction to Apache Apex - CoDS 2016
Bhupesh Chawda
 
Apex as yarn application
Chinmay Kolhatkar
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
Apache Apex
 
Java High Level Stream API
Apache Apex
 
Fault Tolerance and Processing Semantics in Apache Apex
Apache Apex Organizer
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Apache Apex Introduction with PubMatic
Apache Apex
 
Stream Processing with Apache Apex
Pramod Immaneni
 
From Batch to Streaming with Apache Apex Dataworks Summit 2017
Apache Apex
 
Extending The Yahoo Streaming Benchmark to Apache Apex
Apache Apex
 
Developing streaming applications with apache apex (strata + hadoop world)
Apache Apex
 
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Apache Apex
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 

Similar to Fault-Tolerant File Input & Output (20)

PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Yahoo Developer Network
 
PDF
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
PPTX
Apache Apex Fault Tolerance and Processing Semantics
Apache Apex
 
PDF
Introduction to Apache Apex by Thomas Weise
Big Data Spain
 
PDF
Hadoop Programming - MapReduce, Input, Output, Serialization, Job
Jason J Pulikkottil
 
PDF
Stateful streaming data pipelines
Timothy Farkas
 
PPTX
Stream processing - Apache flink
Renato Guimaraes
 
PPSX
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
Apache Apex
 
PPTX
Apache Apex: Stream Processing Architecture and Applications
Comsysto Reply GmbH
 
PDF
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
IndicThreads
 
PDF
Apache Hadoop Java API
Adam Kawa
 
PDF
From Overnight to Always On @ Jfokus 2019
Enno Runne
 
PPT
Jedi Slides Intro2 Chapter12 Advanced Io Streams
Don Bosco BSIT
 
PPTX
Mapreduce advanced
Chirag Ahuja
 
PDF
Real-time Stream Processing using Apache Apex
Apache Apex
 
PDF
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Big Data Spain
 
PDF
Hadoop Hackathon Reader
Evert Lammerts
 
PPT
Meethadoop
IIIT-H
 
PDF
BigDataSpain 2016: Introduction to Apache Apex
Thomas Weise
 
PPTX
Large Scale Data With Hadoop
guest27e6764
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Yahoo Developer Network
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
Apache Apex Fault Tolerance and Processing Semantics
Apache Apex
 
Introduction to Apache Apex by Thomas Weise
Big Data Spain
 
Hadoop Programming - MapReduce, Input, Output, Serialization, Job
Jason J Pulikkottil
 
Stateful streaming data pipelines
Timothy Farkas
 
Stream processing - Apache flink
Renato Guimaraes
 
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
Apache Apex
 
Apache Apex: Stream Processing Architecture and Applications
Comsysto Reply GmbH
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
IndicThreads
 
Apache Hadoop Java API
Adam Kawa
 
From Overnight to Always On @ Jfokus 2019
Enno Runne
 
Jedi Slides Intro2 Chapter12 Advanced Io Streams
Don Bosco BSIT
 
Mapreduce advanced
Chirag Ahuja
 
Real-time Stream Processing using Apache Apex
Apache Apex
 
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Big Data Spain
 
Hadoop Hackathon Reader
Evert Lammerts
 
Meethadoop
IIIT-H
 
BigDataSpain 2016: Introduction to Apache Apex
Thomas Weise
 
Large Scale Data With Hadoop
guest27e6764
 
Ad

More from Apache Apex (17)

PDF
Low Latency Polyglot Model Scoring using Apache Apex
Apache Apex
 
PPTX
Deep Dive into Apache Apex App Development
Apache Apex
 
PPTX
Hadoop Interacting with HDFS
Apache Apex
 
PPTX
Introduction to Real-Time Data Processing
Apache Apex
 
PPTX
Introduction to Yarn
Apache Apex
 
PPTX
Introduction to Map Reduce
Apache Apex
 
PPTX
HDFS Internals
Apache Apex
 
PPTX
Intro to Big Data Hadoop
Apache Apex
 
PPTX
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Apache Apex
 
PPTX
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Apache Apex
 
PPTX
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Apache Apex
 
PPTX
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Apache Apex
 
PPTX
Ingestion and Dimensions Compute and Enrich using Apache Apex
Apache Apex
 
PPTX
Apache Beam (incubating)
Apache Apex
 
PPTX
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Apache Apex
 
PPTX
Apache Apex & Bigtop
Apache Apex
 
PDF
Building Your First Apache Apex Application
Apache Apex
 
Low Latency Polyglot Model Scoring using Apache Apex
Apache Apex
 
Deep Dive into Apache Apex App Development
Apache Apex
 
Hadoop Interacting with HDFS
Apache Apex
 
Introduction to Real-Time Data Processing
Apache Apex
 
Introduction to Yarn
Apache Apex
 
Introduction to Map Reduce
Apache Apex
 
HDFS Internals
Apache Apex
 
Intro to Big Data Hadoop
Apache Apex
 
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Apache Apex
 
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Apache Apex
 
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Apache Apex
 
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Apache Apex
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Apache Apex
 
Apache Beam (incubating)
Apache Apex
 
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Apache Apex
 
Apache Apex & Bigtop
Apache Apex
 
Building Your First Apache Apex Application
Apache Apex
 
Ad

Recently uploaded (20)

PDF
July Patch Tuesday
Ivanti
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
July Patch Tuesday
Ivanti
 
Biography of Daniel Podor.pdf
Daniel Podor
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 

Fault-Tolerant File Input & Output

  • 1. Fault-tolerant File Input & Output Chandni Singh - Committer Apache Apex May 4, 2016
  • 2. Background- Windows in Apex - Window: finite piece of a data set along temporal boundaries* - Apex assigns an id to each window which helps with fault-tolerance. - An operator is provided hooks to know which window id it is on.
  • 3. File Input - AbstractFileInputOperator for reading approx. equal sized files. - Out of box supported file formats include txt, Parque and Avro. - FileSplitterInput & AbstractFSBlockReader - reading different sized files - parallelizing read on a single file.
  • 4. AbstractFileInputOperator - Scans a folder periodically for new files. - Parses the file for records. - Fault-tolerant and scalable.
  • 5. AbstractFileInputOperator : Fault tolerance - A record is not lost. - A record is associated with only one window id irrespective of failures. - If a window is replayed then all the records associated with it will be replayed.
  • 6. AbstractFileInputOperator : Fault tolerance cont’d Failure Window 0: committed
  • 7. AbstractFileInputOperator : Fault tolerance cont’d Fault tolerance is achieved by - Support from platform - Automatic checkpointing of the state of every operator in the dag. - Automatic restoring a failed operator in another container. - WindowDataManager - Saves incremental state every window. - Helps with replaying windows that were completed by this operator.
  • 8. AbstractFileInputOperator : Scalability - Operator partitions read different subset of files. - Files are distributed between partitions based on their hash. - Number of partitions can be changed at run time by changing a property. - For advanced use cases, subclasses can override the directory scanner to customize behavior such as having each partition scan a different directory. - Auto-scalability supported as well in AbstractThroughputFileInputOperator.
  • 9. AbstractFileInputOperator : Implementations - LineByLineFileInputOperator in Malhar library - Custom implementation public class CustomFileInputOperator<RECORD> extends AbstractFileInputOperator<RECORD> { public final transient DefaultOutputPort<RECORD> output = new DefaultOutputPort<RECORD>(); @Override protected RECORD readEntity() throws IOException { //read record from input stream RECORD record= inputStream.read(...); return record; } @Override protected void emit(RECORD tuple) { output.emit(tuple); } }
  • 10. FileSplitterInput & AbstractFSBlockReader - Task of discovering files and reading them is separated into different logical operators. - File splitter discovers files asynchronously and creates task descriptions- FileBlockMetadata. - Block readers use FileBlockMetada to read a particular block of file. - Fault-tolerant, parallelizes reading on a single file and is auto-scalable.
  • 11. FileSplitterInput & AbstractFSBlockReader: Fault tolerance - Platform supports checkpointing state and re-deployment automatically. - FileSplitterInput uses WindowDataManager to replay tuples of completed windows. - AbstractFSBlockReader relies on the upstream buffer-server to replay tuples from a given window. - Buffer-server is a buffer associated with each output port of an operator which holds tuples emitted by that port.
  • 12. FileSplitterInput & AbstractFSBlockReader: Fault tolerance cont’d
  • 13. FileSplitterInput & AbstractFSBlockReader: Scalability - FileSplitterInput is a simple operator which does not take much resources. - Block reader does the actual work of reading files and is auto-scalable (in beta). - Min and max partitions are configurable. - Frequency of re-partition is controlled by a time interval property. - Scales up/down based on the pending FileBlockMetadata in the input port queue.
  • 14. FileSplitterInput & AbstractFSBlockReader: Implementations - FileSplitterInput is concrete. Default behavior can be overridden. - FS Block Readers - FSSliceReader : record is a slice - AbstractFSLineReader and AbstractFSReadAheadLineReader: record is a line - Custom FS Block Reader public class CustomFSBlockReader<RECORD> extends AbstractFSBlockReader<RECORD> { public CustomFSBlockReader() { //initialize reader context this.readerContext = new RecordReaderContext(); } @Override protected RECORD convertToRecord(byte[] bytes) { //convert bytes to RECORD return RECORD.from(bytes); } }
  • 15. AbstractFileOutputOperator - Persists data to a single file or multiple files. - Automatic rotation of files (optional) based on - file size - window count - Optional compression and encryption of data. - Fault-tolerant - Scalable as long as different partitions write to different files. Subclasses can achieve this by appending the operator id to the file name.
  • 16. AbstractFileOutputOperator : Fault tolerance Record is persisted exactly once. - A record is never missed. - A record is not duplicated. Example application that persists data exactly once: AtomicFileOutputApp
  • 18. AbstractFileOutputOperator : Fault tolerance cont’d To write exactly once - Assumes idempotent processing - Checkpoint consists of size of each file the operator has written so far. - Truncation of files to the size saved in the restoration checkpoint.
  • 19. AbstractFileOutputOperator : Fault tolerance cont’d To avoid dangling lease issues in HDFS - Data is always written to temporary files - Renaming temp files to actual files when a file is finalized, that is, closed for writing. - User can choose when the files get finalized. Rotation handles finalization automatically.
  • 20. AbstractFileOutputOperator : Custom Implementation public class CustomFileOutputOperator<RECORD> extends AbstractFileOutputOperator<RECORD> { public CustomFileOutputOperator() { setMaxLength(1024 * 1024); setRotationWindows(600); } @Override protected String getFileName(RECORD tuple) { //file name return tuple.getFileName(); } @Override protected byte[] getBytesForTuple(RECORD tuple) { //bytes from record return tuple.toBytes(); } }
  • 21. Acknowledgements - Apex dev team Munagala Ramanath Pramod Immaneni Sasha Parfenov Thomas Weise Timothy Farkas - Meetup organizers Amol Kekre Qybare Pula Ian Gomez - Apache Apex Community
  • 22. © 2016 DataTorrent Resources 22 * https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 • Apache Apex website - http://apex.apache.org/ • Subscribe - http://apex.apache.org/community.html • Download - http://apex.apache.org/downloads.html • Twitter - @ApacheApex; Follow - https://twitter.com/apacheapex • Facebook - https://www.facebook.com/ApacheApex/ • Meetup - http://www.meetup.com/topics/apache-apex • Free Enterprise License for Startups - https://www.datatorrent.com/product/startup- accelerator/
  • 23. © 2016 DataTorrent We Are Hiring 23 • [email protected] • Developers/Architects • QA Automation Developers • Information Developers • Build and Release • Community Leaders