SlideShare a Scribd company logo
Supercharging your Apache OODT deployments with the Process Control System Chris A. Mattmann NASA JPL/Univ. Southern California/ASF [email_address]  November 9, 2011
Apache Member involved in OODT (VP, PMC), Tika (VP,PMC), Nutch (PMC), Incubator (PMC), SIS (Mentor), Lucy (Mentor) and Gora (Champion), MRUnit (Mentor), Airavata (Mentor) Senior Computer Scientist at NASA JPL in Pasadena, CA USA Software Architecture/Engineering Prof at Univ. of Southern California  And you are?
Welcome to the Apache in Space! (OODT) Track
Agenda Overview of OODT and its history What is the Process Control System (PCS)? PCS Architecture Some hands on examples Health Monitoring Pedigree/Provenance Deploying PCS  Where we’re headed
Lessons from 90’s era missions Increasing data volumes (exponential growth) Increasing complexity of instruments and algorithms Increasing availability of proxy/sim/ancillary data Increasing rate of technology refresh …  all of this while NASA Earth Mission funding was decreasing A data system framework based on a standard architecture and reusable software components for supporting all future missions.
Enter OODT Object Oriented Data Technology  http://oodt.apache.org Funded initially in 1998 by NASA ’s Office of Space Science Envisaged as a national software framework for sharing data across heterogeneous, distributed data repositories OODT is both an architecture and a reference implementation providing Data Production Data Distribution Data Discovery Data Access OODT is Open Source and available from the Apache Software Foundation
Apache OODT Originally funded by NASA to focus on distributed science data system environments science data generation  data capture, end-to-end Distributed access to science data repositories by the community A set of building blocks/services to exploit common system patterns for reuse  Supports deployment based on a rich information model Selected as a top level Apache Software Foundation project in January 2011 Runner up for NASA Software of the Year Used for a number of science data system activities in planetary, earth, biomedicine, astrophysics http://oodt.apache.org
Apache OODT Press
Why Apache and OODT? OODT is meant to be a set of tools to help build data systems It ’s not meant to be “turn key”  It attempts to exploit the boundary between bringing in capability vs. being overly rigid in science Each discipline/project extends Apache is the elite open source community for software developers Less than 100 projects have been promoted to top level (Apache Web Server, Tomcat, Solr, Hadoop) Differs from other open source communities; it provides a governance and management structure
Apache OODT Community Includes PMC members from NASA JPL, Univ. of Southern California, Google, Children’s Hospital Los Angeles (CHLA), Vdio, South African SKA Project Projects that are deploying it operationally at Decadal-survey recommended NASA Earth science  missions, NIH, and NCI, CHLA, USC, South African SKA project Use in the classroom My graduate-level software architecture and seach engines courses
OODT Framework and PCS OBJECT ORIENTED DATA TECHNOLOGY FRAMEWORK OODT/Science Web Tools Archive Client Profile XML Data Data System 1 Data System 2 Archive Service Profile Service Product Service Query Service Bridge to  External Services Navigation Service Other Service 1 Other Service 2 Process Control  System (PCS) Catalog & Archive Service (CAS) CAS has recently become known as Process Control System when applied to mission work. Catalog & Archive Service
Current PCS deployments Orbiting Carbon Observatory (OCO-2)  - spectrometer instrument NASA ESSP Mission, launch date: TBD 2013 PCS supporting Thermal Vacuum Tests, Ground-based instrument data processing, Space-based instrument data processing and Science Computing Facility EOM Data Volume: 61-81 TB in 3 yrs  Processing Throughput:  200-300 jobs/day NPP Sounder PEATE  - infrared sounder Joint NASA/NPOESS mission, launch date: October 2011 PCS supporting Science Computing Facility (PEATE) EOM Data Volume: 600 TB in 5 yrs Processing Throughput:  600 jobs/day QuikSCAT  - scatterometer NASA Quick-Recovery Mission, launch date: June 1999 PCS supporting instrument data processing and science analyst sandbox Originally planned as a 2-year mission SMAP   - high-res radar and radiometer NASA decadal study mission, launch date: 2014 PCS supporting radar instrument and science algorithm development testbed
Other PCS applications Bioinformatics National Institutes of Health (NIH) National Cancer Institute ’s (NCI) Early Detection Research Network (EDRN) Children ’s Hospital LA Virtual Pediatric Intensive Care Unit (VPICU) Technology Demonstration JPL ’s Active Mirror Telescope (AMT) White Sands Missile Range Earth Science NASA ’s Virtual Oceanographic Data Center (VODC) JPL ’s Climate Data eXchange (CDX) Astronomy and Radio Prototype work on MeerKAT with South Africans and KAT-7 telescope Discussions ongoing with NRAO Socorro (EVLA and ALMA)
PCS Core Components All Core components implemented as web services XML-RPC used to communicate between components Servers implemented in Java Clients implemented in Java, scripts, Python,  PHP and web-apps Service configuration implemented in ASCII and XML files
Core Capabilities File Manager does Data Management Tracks all of the stored data, files & metadata Moves data to appropriate locations before and after initiating PGE runs and from staging area to controlled access storage Workflow Manager does Pipeline Processing Automates processing when all run conditions are ready Monitors and logs processing status Resource Manager does Resource Management Allocates processing jobs to computing resources Monitors and logs job & resource status Copies output data to storage locations where space is available Provides the means to monitor resource usage
PCS Ingestion Use Case
File/Metadata Capabilities
PCS Processing Use Case
Advanced Workflow Monitoring
Resource Monitoring
PCS Support for OCO OCO has three PCS deployments (installation of core components): Thermal Vacuum Instrument Testing deployment A PCS configuration was successfully deployed to process and analyze 100% of all L1a products generated during t-vac testing   Space-based Operations & Ground-based FTS processing deployment Automatic processing of all raw instrument data through AOPD L2 algorithm Currently operational,  our FTS deployment has processed over 4 TB of FTS spectrum and FTS L1a products for science analysis to data Science Computing Facility (SCF) deployment Supports all L2 full physics algorithm processing for science analysis and cal/val Supports scientists ’ investigations of alternative algorithms & data products Ability to adapt to change Scaled up the database catalog size When size grew > 1 million products, moved from Lucene to Oracle in a weekend! Had to repartition the FTS archive layout and structure 2 years into the mission Recataloged all 1 million FTS products and moved all data within a few weeks! Accommodated Ops/SCF hardware reconfiguration 1 year prior to launch Physically-shared and virtually-separated to virtually-shared and physically-separated at no extra cost!
OCO Hardware Environment with PCS Source: S. Neely, OCO Hardware Review
How do we deploy PCS for a mission? We implement the following mission-specific customizations Server Configuration Implemented in ASCII properties files  Product metadata specification Implemented in XML policy files Processing Rules Implemented as Java classes and/or XML policy files PGE Configuration Implemented in XML policy files Compute Node Usage Policies Implemented in XML policy files Here ’s what we don’t change All PCS Servers (e.g. File Manager, Workflow Manager, Resource Manager) Core data management, pipeline process management and job scheduling/submission capabilities File Catalog schema Workflow Model Repository Schema
Server and PGE Configuration
What is the Level of  Effort for personalizing PCS? PCS Server Configuration –  “days” Deployment specific Addition of New File (Product) Type –  “ days” Product metadata specification Metadata extraction (if applicable) Ingest Policy specification (if remote pull or remote push) Addition of a New PGE – (initial integration, ~ weeks) Policy specification Production rules PGE Initiation Estimates based on OCO and NPP experience
A typical PCS service (e.g., fm, wm, rm)
What’s PCS configuration? Configuration follows typical Apache-like server configuration A set of properties and flags that are set in an ASCII text file that initialize the service at runtime Properties configure The underlying subsystems of the PCS service For file manager, properties configure e.g.,  Data transfer chunk size Whether or not the catalog database should use quoted strings for columns What subsystems are actually chosen (e.g, database versus Lucene, remote versus local data transfer) Can we see an example?
PCS File Manager Configuration File Set runtime properties Choose extension points Sensible defaults if you don’t want to change them
What’s PCS policy? Policy is the convention in which missions define The products that should be ingested and managed The PGE default input parameters and their required data and metadata inputs (data flow) The PGE pre-conditions and execution sequence (control flow) The underlying hardware/resource environment in which PGEs should run and data/metadata should be captured What nodes are available? How much disk space is available? How should we allocate PGEs to nodes? Can we see an example?
PCS File Manager Policy for  OCO FTS Products The scheme for laying out products in the archive The scheme for extracting metadata from products on the server side A name, ID an description of the each product
PCS Workflow Manager Policy for  OCO FTS Products Define data flow Define control flow
PCS Overall Architecture What have we told you about so far? What are we going to tell you about now?
The concept of “production rules” Production rules are common terminology to refer to the identification of the mission specific variation points in PGE pipeline processing Product cataloging and archiving So far, we’ve discussed Configuration  Policy Policy is one piece of the puzzle in production rules
Production rule areas of concerns Policy defining file ingestion What metadata should PCS capture per product? Where do product files go? Policy defining PGE data flow and control flow PGE pre-conditions  File staging rules Queries to the PCS file manager service 1-5 are implemented in PCS (depending on complexity) as either: Java Code XML files Some combination of Java code and XML files
PCS Task Wrapper aka CAS-PGE Gathers information from the file manager  Files to stage Input metadata (time ranges, flags, etc.) Builds input file(s) for the PGE Executes the PGE Invokes PCS crawler to ingest output product and metadata Notifies Workflow and Resource Managers about task (job) status Can optionally Generate PCS metadata files
PCS experience on recent missions How long did it take to build out the PCS configuration and policy? For OCO, once each pipeline was system engineered and PGEs were designed Level of Effort Configuration for file, resource, workflow manager : 1-2 days (1x cost) Policy for file, resource, workflow manager: 1-2 days per new PGE and new Product Type Production rules: 1 week per PGE For NPP Sounder PEATE, once each PGE was system engineered and designed Level of Effort Configuration for file, resource, workflow manager : 1-2 days (1x cost) Policy for file, resource, workflow manager: 1-2 days per new PGE and new Product Type Production rules: 1 week per PGE Total Level of Effort OCO: 1.0 FTEs over 5 years NPP Sounder PEATE: 2.5 FTEs over 3 years
Some relevant experience with NRAO: EVLA prototype Explore JPL data system expertise Leverage Apache OODT Leverage architecture experience Build on NRAO Socorro F2F given in April 2011 and Innovations in Data-Intensive Astronomy meeting in May 2011 Define achievable prototype Focus on EVLA summer school pipeline Heavy focus on CASApy, simple pipelining, metadata extraction, archiving of directory-based products Ideal for OODT system
Architecture
Pre-Requisites Apache OODT Version: 0.3-SNAPSHOT JDK6, Maven2.2.1 Stock Linux box
Installed Services File Manager http://ska-dc.jpl.nasa.gov:9000 Crawler http://ska-dc.jpl.na.gov:9020 Tomcat5 Curator:  http://ska-dc.jpl.nasa.gov:8080/curator/ Browser:  http://ska-dc.jpl.nasa.gov/   PCS Services:  http://ska-dc.jpl.nasa.gov:8080/pcs/services/   CAS Product Services:  http://ska-dc.jpl.nasa.gov:8080/fmprod/   Workflow Monitor:  http://ska-dc.jpl.nasa.gov:8080/wmonitor/   Met Extractors /usr/local/ska-dc/pge/extractors (Cube, Cal Tables) PCS package /usr/local/ska-dc/pcs (scripts dir contains pcs_stat, pcs_trace, etc.)
Demonstration Use Case Run EVLA Spectral Line Cube generation First step is ingest EVLARawDataOutput from Joe Then fire off evlascube event Workflow manager writes CASApy script dynamically Via CAS-PGE CAS-PGE starts CASApy CASApy generates Cal tables and 2 Spectral Line Cube Images CAS-PGE ingests them into the File Manager Gravy: UIs,Cmd Line Tools, Services
Results: Workflow Monitor
Results: Data Portal
Results: Prod Browser
Results: PCS Trace Cmd Line
Results: PCS Stat Cmd Line
Results: PCS REST Services: Trace curl  http://host/pcs/services/pedigree/report/flux_redo.cal
Results: PCS REST Service: Health curl  http://host/pcs/services/health/report   Read up on  https://issues.apache.org/jira/browse/OODT-139 Read documentation on PCS services:  https://cwiki.apache.org/confluence/display/OODT/OODT+REST+Services
Results: RSS feed of prods
Results: RDF of products
Where are we headed? OPSui work OODT-157 you will have heard about this earlier in the day from Andrew Hart Improved PCS services Integrate more services into OODT-139 including curation services, and workflow services for processing Workflow2 improvements described in OODT-215
Where are we headed Integration with Hadoop Nextgen M/R http://svn.apache.org/repos/asf/oodt/branches/wengine-branch/   Integration with more catalogs Apache Gora, MongoDB Integration with GIS services GDAL, regridding, etc. Improved science algorithm wrapping
OODT Project Contact Info Learn more and track our progress at: http://oodt.apache.org   WIKI:  https://cwiki.apache.org/OODT /   JIRA:  https://issues.apache.org/jira/browse/OODT   Join the mailing list: [email_address]   Chat on IRC: #oodt on irc.freenode.net Acknowledgements Key Members of the OODT teams: Chris Mattmann, Daniel J. Crichton, Steve Hughes, Andrew Hart, Sean Kelly, Sean Hardman, Paul Ramirez, David Woollard, Brian Foster, Dana Freeborn, Emily Law, Mike Cayanan, Luca Cinquini, Heather Kincaid Projects, Sponsors, Collaborators: Planetary Data System, Early Detection Research Network, Climate Data Exchange, Virtual Pediatric Intensive Care Unit, NASA SMAP Mission, NASA OCO-2 Mission, NASA NPP Sounder Peate, NASA ACOS Mission, Earth System Grid Federation
Alright, I ’ll shut up now Any questions? THANK YOU! [email_address]   @chrismattmann  on Twitter

More Related Content

What's hot (20)

PDF
Apache Spark 101
Abdullah Çetin ÇAVDAR
 
PDF
Apache Spark PDF
Naresh Rupareliya
 
PDF
Advanced Natural Language Processing with Apache Spark NLP
Databricks
 
PDF
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Yahoo Developer Network
 
PDF
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
PDF
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Databricks
 
PDF
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
PDF
Scalable scheduling of updates in streaming data warehouses
IRJET Journal
 
PDF
BDSE 2015 Evaluation of Big Data Platforms with HiBench
t_ivanov
 
PPT
DIET_BLAST
Frederic Desprez
 
PDF
Spark vs Hadoop
Olesya Eidam
 
PDF
InternReport
Swetha Tanamala
 
PDF
Lessons Learned on Benchmarking Big Data Platforms
t_ivanov
 
PPTX
Learning spark ch01 - Introduction to Data Analysis with Spark
phanleson
 
PPTX
Learning spark ch04 - Working with Key/Value Pairs
phanleson
 
PDF
Apache spark
Dona Mary Philip
 
PDF
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Eren Avşaroğulları
 
PPTX
Apache spark
TEJPAL GAUTAM
 
PDF
Intro to Apache Spark
BTI360
 
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Apache Spark 101
Abdullah Çetin ÇAVDAR
 
Apache Spark PDF
Naresh Rupareliya
 
Advanced Natural Language Processing with Apache Spark NLP
Databricks
 
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Yahoo Developer Network
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Databricks
 
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
Scalable scheduling of updates in streaming data warehouses
IRJET Journal
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
t_ivanov
 
DIET_BLAST
Frederic Desprez
 
Spark vs Hadoop
Olesya Eidam
 
InternReport
Swetha Tanamala
 
Lessons Learned on Benchmarking Big Data Platforms
t_ivanov
 
Learning spark ch01 - Introduction to Data Analysis with Spark
phanleson
 
Learning spark ch04 - Working with Key/Value Pairs
phanleson
 
Apache spark
Dona Mary Philip
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Eren Avşaroğulları
 
Apache spark
TEJPAL GAUTAM
 
Intro to Apache Spark
BTI360
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 

Viewers also liked (20)

PPT
καβαφης1
gymnasio
 
PPT
积极心态培训
20004
 
PPT
Christmas Carols
gymnasio
 
DOC
Nlp致胜行销学
20004
 
PPT
6
20004
 
PPT
麦肯锡演讲技巧
20004
 
DOC
七天学会时间管理
20004
 
PPS
Mesaje de an nou.
Nicky Nic
 
PPT
进阶策略销售培训
20004
 
PDF
Iafie europe 2017
David Jimenez
 
PPT
Laura Wether
missmarsh
 
PPT
心态调整及认同
20004
 
PDF
Not So Hidden Disability: Building Community Through Fashion
flobotic
 
PPTX
Linkedin.Deck
bepker
 
PDF
Ozgur Uckan - Knowledge Economy & ICT Policies
Ozgur Uckan
 
PDF
Iscsc improvement framework_orientation_vs022112
Priti Irani
 
PPTX
E Vm Virtualization
Arturo Saavedra
 
PDF
O uckan ag-kapital-post-kapital - 060510
Ozgur Uckan
 
DOC
Business Intelligence - JRM
Jay R Modi
 
PPT
Christmas cards 2012 13
gymnasio
 
καβαφης1
gymnasio
 
积极心态培训
20004
 
Christmas Carols
gymnasio
 
Nlp致胜行销学
20004
 
麦肯锡演讲技巧
20004
 
七天学会时间管理
20004
 
Mesaje de an nou.
Nicky Nic
 
进阶策略销售培训
20004
 
Iafie europe 2017
David Jimenez
 
Laura Wether
missmarsh
 
心态调整及认同
20004
 
Not So Hidden Disability: Building Community Through Fashion
flobotic
 
Linkedin.Deck
bepker
 
Ozgur Uckan - Knowledge Economy & ICT Policies
Ozgur Uckan
 
Iscsc improvement framework_orientation_vs022112
Priti Irani
 
E Vm Virtualization
Arturo Saavedra
 
O uckan ag-kapital-post-kapital - 060510
Ozgur Uckan
 
Business Intelligence - JRM
Jay R Modi
 
Christmas cards 2012 13
gymnasio
 
Ad

Similar to Supercharging your Apache OODT deployments with the Process Control System (20)

PPT
A Look into the Apache OODT Ecosystem
Chris Mattmann
 
PDF
FILES IN TODAY’S WORLD - #MFSummit2017
Micro Focus
 
PDF
Kubernetes - Hosted OSG Services
Igor Sfiligoi
 
PPT
OGCE Overview for SciDAC 2009
marpierc
 
PPT
OGCE Project Overview
marpierc
 
PDF
Improving Operational Space Responsiveness
Pat Cappelaere
 
PDF
Scientific Computing With Amazon Web Services
Jamie Kinney
 
PPTX
Indiana University's Advanced Science Gateway Support
marpierc
 
PDF
Proactive Data Containers (PDC): An Object-centric Data Store for Large-scale...
Globus
 
PPTX
Imperative Induced Innovation - Patrick W. Dowd, Ph. D
scoopnewsgroup
 
PDF
OCCIware - A Formal Toolchain for Managing Everything-as-a-Service
Jean Parpaillon
 
PPTX
Solving Challenges With 'Huge Data'
IBM Sverige
 
PPT
Nagios Conference 2014 - Janice Singh - Real World Uses for Nagios APIs
Nagios
 
PPTX
Network Engineering for High Speed Data Sharing
Globus
 
PDF
Open compute and future of data centers
Future Cloud Summit
 
PPTX
CERN Mass and Agility talk at OSCON 2014
Tim Bell
 
PDF
Architecting a 35 PB distributed parallel file system for science
Speck&Tech
 
PPTX
OGCE TG09 Tech Track Presentation
marpierc
 
PDF
Data Capacitor II at Indiana University
inside-BigData.com
 
PDF
Pic archiver stansted
Archiver
 
A Look into the Apache OODT Ecosystem
Chris Mattmann
 
FILES IN TODAY’S WORLD - #MFSummit2017
Micro Focus
 
Kubernetes - Hosted OSG Services
Igor Sfiligoi
 
OGCE Overview for SciDAC 2009
marpierc
 
OGCE Project Overview
marpierc
 
Improving Operational Space Responsiveness
Pat Cappelaere
 
Scientific Computing With Amazon Web Services
Jamie Kinney
 
Indiana University's Advanced Science Gateway Support
marpierc
 
Proactive Data Containers (PDC): An Object-centric Data Store for Large-scale...
Globus
 
Imperative Induced Innovation - Patrick W. Dowd, Ph. D
scoopnewsgroup
 
OCCIware - A Formal Toolchain for Managing Everything-as-a-Service
Jean Parpaillon
 
Solving Challenges With 'Huge Data'
IBM Sverige
 
Nagios Conference 2014 - Janice Singh - Real World Uses for Nagios APIs
Nagios
 
Network Engineering for High Speed Data Sharing
Globus
 
Open compute and future of data centers
Future Cloud Summit
 
CERN Mass and Agility talk at OSCON 2014
Tim Bell
 
Architecting a 35 PB distributed parallel file system for science
Speck&Tech
 
OGCE TG09 Tech Track Presentation
marpierc
 
Data Capacitor II at Indiana University
inside-BigData.com
 
Pic archiver stansted
Archiver
 
Ad

More from Chris Mattmann (9)

PPTX
Wengines, Workflows, and 2 years of advanced data processing in Apache OODT
Chris Mattmann
 
PPT
Scalable Data Mining and Archiving in the Era of the Square Kilometre Array
Chris Mattmann
 
PPT
Teaching NASA to Open Source its Software the Apache Way
Chris Mattmann
 
PPT
Apache Tika: 1 point Oh!
Chris Mattmann
 
PPT
Understanding the Meaningful Use of Open Source Software
Chris Mattmann
 
PPT
An Open Source Strategy for NASA
Chris Mattmann
 
PDF
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Chris Mattmann
 
PPT
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Chris Mattmann
 
PPT
Scientific data curation and processing with Apache Tika
Chris Mattmann
 
Wengines, Workflows, and 2 years of advanced data processing in Apache OODT
Chris Mattmann
 
Scalable Data Mining and Archiving in the Era of the Square Kilometre Array
Chris Mattmann
 
Teaching NASA to Open Source its Software the Apache Way
Chris Mattmann
 
Apache Tika: 1 point Oh!
Chris Mattmann
 
Understanding the Meaningful Use of Open Source Software
Chris Mattmann
 
An Open Source Strategy for NASA
Chris Mattmann
 
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Chris Mattmann
 
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Chris Mattmann
 
Scientific data curation and processing with Apache Tika
Chris Mattmann
 

Recently uploaded (20)

PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 

Supercharging your Apache OODT deployments with the Process Control System

  • 1. Supercharging your Apache OODT deployments with the Process Control System Chris A. Mattmann NASA JPL/Univ. Southern California/ASF [email_address] November 9, 2011
  • 2. Apache Member involved in OODT (VP, PMC), Tika (VP,PMC), Nutch (PMC), Incubator (PMC), SIS (Mentor), Lucy (Mentor) and Gora (Champion), MRUnit (Mentor), Airavata (Mentor) Senior Computer Scientist at NASA JPL in Pasadena, CA USA Software Architecture/Engineering Prof at Univ. of Southern California And you are?
  • 3. Welcome to the Apache in Space! (OODT) Track
  • 4. Agenda Overview of OODT and its history What is the Process Control System (PCS)? PCS Architecture Some hands on examples Health Monitoring Pedigree/Provenance Deploying PCS Where we’re headed
  • 5. Lessons from 90’s era missions Increasing data volumes (exponential growth) Increasing complexity of instruments and algorithms Increasing availability of proxy/sim/ancillary data Increasing rate of technology refresh … all of this while NASA Earth Mission funding was decreasing A data system framework based on a standard architecture and reusable software components for supporting all future missions.
  • 6. Enter OODT Object Oriented Data Technology http://oodt.apache.org Funded initially in 1998 by NASA ’s Office of Space Science Envisaged as a national software framework for sharing data across heterogeneous, distributed data repositories OODT is both an architecture and a reference implementation providing Data Production Data Distribution Data Discovery Data Access OODT is Open Source and available from the Apache Software Foundation
  • 7. Apache OODT Originally funded by NASA to focus on distributed science data system environments science data generation data capture, end-to-end Distributed access to science data repositories by the community A set of building blocks/services to exploit common system patterns for reuse Supports deployment based on a rich information model Selected as a top level Apache Software Foundation project in January 2011 Runner up for NASA Software of the Year Used for a number of science data system activities in planetary, earth, biomedicine, astrophysics http://oodt.apache.org
  • 9. Why Apache and OODT? OODT is meant to be a set of tools to help build data systems It ’s not meant to be “turn key” It attempts to exploit the boundary between bringing in capability vs. being overly rigid in science Each discipline/project extends Apache is the elite open source community for software developers Less than 100 projects have been promoted to top level (Apache Web Server, Tomcat, Solr, Hadoop) Differs from other open source communities; it provides a governance and management structure
  • 10. Apache OODT Community Includes PMC members from NASA JPL, Univ. of Southern California, Google, Children’s Hospital Los Angeles (CHLA), Vdio, South African SKA Project Projects that are deploying it operationally at Decadal-survey recommended NASA Earth science missions, NIH, and NCI, CHLA, USC, South African SKA project Use in the classroom My graduate-level software architecture and seach engines courses
  • 11. OODT Framework and PCS OBJECT ORIENTED DATA TECHNOLOGY FRAMEWORK OODT/Science Web Tools Archive Client Profile XML Data Data System 1 Data System 2 Archive Service Profile Service Product Service Query Service Bridge to External Services Navigation Service Other Service 1 Other Service 2 Process Control System (PCS) Catalog & Archive Service (CAS) CAS has recently become known as Process Control System when applied to mission work. Catalog & Archive Service
  • 12. Current PCS deployments Orbiting Carbon Observatory (OCO-2) - spectrometer instrument NASA ESSP Mission, launch date: TBD 2013 PCS supporting Thermal Vacuum Tests, Ground-based instrument data processing, Space-based instrument data processing and Science Computing Facility EOM Data Volume: 61-81 TB in 3 yrs Processing Throughput: 200-300 jobs/day NPP Sounder PEATE - infrared sounder Joint NASA/NPOESS mission, launch date: October 2011 PCS supporting Science Computing Facility (PEATE) EOM Data Volume: 600 TB in 5 yrs Processing Throughput: 600 jobs/day QuikSCAT - scatterometer NASA Quick-Recovery Mission, launch date: June 1999 PCS supporting instrument data processing and science analyst sandbox Originally planned as a 2-year mission SMAP - high-res radar and radiometer NASA decadal study mission, launch date: 2014 PCS supporting radar instrument and science algorithm development testbed
  • 13. Other PCS applications Bioinformatics National Institutes of Health (NIH) National Cancer Institute ’s (NCI) Early Detection Research Network (EDRN) Children ’s Hospital LA Virtual Pediatric Intensive Care Unit (VPICU) Technology Demonstration JPL ’s Active Mirror Telescope (AMT) White Sands Missile Range Earth Science NASA ’s Virtual Oceanographic Data Center (VODC) JPL ’s Climate Data eXchange (CDX) Astronomy and Radio Prototype work on MeerKAT with South Africans and KAT-7 telescope Discussions ongoing with NRAO Socorro (EVLA and ALMA)
  • 14. PCS Core Components All Core components implemented as web services XML-RPC used to communicate between components Servers implemented in Java Clients implemented in Java, scripts, Python, PHP and web-apps Service configuration implemented in ASCII and XML files
  • 15. Core Capabilities File Manager does Data Management Tracks all of the stored data, files & metadata Moves data to appropriate locations before and after initiating PGE runs and from staging area to controlled access storage Workflow Manager does Pipeline Processing Automates processing when all run conditions are ready Monitors and logs processing status Resource Manager does Resource Management Allocates processing jobs to computing resources Monitors and logs job & resource status Copies output data to storage locations where space is available Provides the means to monitor resource usage
  • 21. PCS Support for OCO OCO has three PCS deployments (installation of core components): Thermal Vacuum Instrument Testing deployment A PCS configuration was successfully deployed to process and analyze 100% of all L1a products generated during t-vac testing Space-based Operations & Ground-based FTS processing deployment Automatic processing of all raw instrument data through AOPD L2 algorithm Currently operational, our FTS deployment has processed over 4 TB of FTS spectrum and FTS L1a products for science analysis to data Science Computing Facility (SCF) deployment Supports all L2 full physics algorithm processing for science analysis and cal/val Supports scientists ’ investigations of alternative algorithms & data products Ability to adapt to change Scaled up the database catalog size When size grew > 1 million products, moved from Lucene to Oracle in a weekend! Had to repartition the FTS archive layout and structure 2 years into the mission Recataloged all 1 million FTS products and moved all data within a few weeks! Accommodated Ops/SCF hardware reconfiguration 1 year prior to launch Physically-shared and virtually-separated to virtually-shared and physically-separated at no extra cost!
  • 22. OCO Hardware Environment with PCS Source: S. Neely, OCO Hardware Review
  • 23. How do we deploy PCS for a mission? We implement the following mission-specific customizations Server Configuration Implemented in ASCII properties files Product metadata specification Implemented in XML policy files Processing Rules Implemented as Java classes and/or XML policy files PGE Configuration Implemented in XML policy files Compute Node Usage Policies Implemented in XML policy files Here ’s what we don’t change All PCS Servers (e.g. File Manager, Workflow Manager, Resource Manager) Core data management, pipeline process management and job scheduling/submission capabilities File Catalog schema Workflow Model Repository Schema
  • 24. Server and PGE Configuration
  • 25. What is the Level of Effort for personalizing PCS? PCS Server Configuration – “days” Deployment specific Addition of New File (Product) Type – “ days” Product metadata specification Metadata extraction (if applicable) Ingest Policy specification (if remote pull or remote push) Addition of a New PGE – (initial integration, ~ weeks) Policy specification Production rules PGE Initiation Estimates based on OCO and NPP experience
  • 26. A typical PCS service (e.g., fm, wm, rm)
  • 27. What’s PCS configuration? Configuration follows typical Apache-like server configuration A set of properties and flags that are set in an ASCII text file that initialize the service at runtime Properties configure The underlying subsystems of the PCS service For file manager, properties configure e.g., Data transfer chunk size Whether or not the catalog database should use quoted strings for columns What subsystems are actually chosen (e.g, database versus Lucene, remote versus local data transfer) Can we see an example?
  • 28. PCS File Manager Configuration File Set runtime properties Choose extension points Sensible defaults if you don’t want to change them
  • 29. What’s PCS policy? Policy is the convention in which missions define The products that should be ingested and managed The PGE default input parameters and their required data and metadata inputs (data flow) The PGE pre-conditions and execution sequence (control flow) The underlying hardware/resource environment in which PGEs should run and data/metadata should be captured What nodes are available? How much disk space is available? How should we allocate PGEs to nodes? Can we see an example?
  • 30. PCS File Manager Policy for OCO FTS Products The scheme for laying out products in the archive The scheme for extracting metadata from products on the server side A name, ID an description of the each product
  • 31. PCS Workflow Manager Policy for OCO FTS Products Define data flow Define control flow
  • 32. PCS Overall Architecture What have we told you about so far? What are we going to tell you about now?
  • 33. The concept of “production rules” Production rules are common terminology to refer to the identification of the mission specific variation points in PGE pipeline processing Product cataloging and archiving So far, we’ve discussed Configuration Policy Policy is one piece of the puzzle in production rules
  • 34. Production rule areas of concerns Policy defining file ingestion What metadata should PCS capture per product? Where do product files go? Policy defining PGE data flow and control flow PGE pre-conditions File staging rules Queries to the PCS file manager service 1-5 are implemented in PCS (depending on complexity) as either: Java Code XML files Some combination of Java code and XML files
  • 35. PCS Task Wrapper aka CAS-PGE Gathers information from the file manager Files to stage Input metadata (time ranges, flags, etc.) Builds input file(s) for the PGE Executes the PGE Invokes PCS crawler to ingest output product and metadata Notifies Workflow and Resource Managers about task (job) status Can optionally Generate PCS metadata files
  • 36. PCS experience on recent missions How long did it take to build out the PCS configuration and policy? For OCO, once each pipeline was system engineered and PGEs were designed Level of Effort Configuration for file, resource, workflow manager : 1-2 days (1x cost) Policy for file, resource, workflow manager: 1-2 days per new PGE and new Product Type Production rules: 1 week per PGE For NPP Sounder PEATE, once each PGE was system engineered and designed Level of Effort Configuration for file, resource, workflow manager : 1-2 days (1x cost) Policy for file, resource, workflow manager: 1-2 days per new PGE and new Product Type Production rules: 1 week per PGE Total Level of Effort OCO: 1.0 FTEs over 5 years NPP Sounder PEATE: 2.5 FTEs over 3 years
  • 37. Some relevant experience with NRAO: EVLA prototype Explore JPL data system expertise Leverage Apache OODT Leverage architecture experience Build on NRAO Socorro F2F given in April 2011 and Innovations in Data-Intensive Astronomy meeting in May 2011 Define achievable prototype Focus on EVLA summer school pipeline Heavy focus on CASApy, simple pipelining, metadata extraction, archiving of directory-based products Ideal for OODT system
  • 39. Pre-Requisites Apache OODT Version: 0.3-SNAPSHOT JDK6, Maven2.2.1 Stock Linux box
  • 40. Installed Services File Manager http://ska-dc.jpl.nasa.gov:9000 Crawler http://ska-dc.jpl.na.gov:9020 Tomcat5 Curator: http://ska-dc.jpl.nasa.gov:8080/curator/ Browser: http://ska-dc.jpl.nasa.gov/ PCS Services: http://ska-dc.jpl.nasa.gov:8080/pcs/services/ CAS Product Services: http://ska-dc.jpl.nasa.gov:8080/fmprod/ Workflow Monitor: http://ska-dc.jpl.nasa.gov:8080/wmonitor/ Met Extractors /usr/local/ska-dc/pge/extractors (Cube, Cal Tables) PCS package /usr/local/ska-dc/pcs (scripts dir contains pcs_stat, pcs_trace, etc.)
  • 41. Demonstration Use Case Run EVLA Spectral Line Cube generation First step is ingest EVLARawDataOutput from Joe Then fire off evlascube event Workflow manager writes CASApy script dynamically Via CAS-PGE CAS-PGE starts CASApy CASApy generates Cal tables and 2 Spectral Line Cube Images CAS-PGE ingests them into the File Manager Gravy: UIs,Cmd Line Tools, Services
  • 45. Results: PCS Trace Cmd Line
  • 46. Results: PCS Stat Cmd Line
  • 47. Results: PCS REST Services: Trace curl http://host/pcs/services/pedigree/report/flux_redo.cal
  • 48. Results: PCS REST Service: Health curl http://host/pcs/services/health/report Read up on https://issues.apache.org/jira/browse/OODT-139 Read documentation on PCS services: https://cwiki.apache.org/confluence/display/OODT/OODT+REST+Services
  • 49. Results: RSS feed of prods
  • 50. Results: RDF of products
  • 51. Where are we headed? OPSui work OODT-157 you will have heard about this earlier in the day from Andrew Hart Improved PCS services Integrate more services into OODT-139 including curation services, and workflow services for processing Workflow2 improvements described in OODT-215
  • 52. Where are we headed Integration with Hadoop Nextgen M/R http://svn.apache.org/repos/asf/oodt/branches/wengine-branch/ Integration with more catalogs Apache Gora, MongoDB Integration with GIS services GDAL, regridding, etc. Improved science algorithm wrapping
  • 53. OODT Project Contact Info Learn more and track our progress at: http://oodt.apache.org WIKI: https://cwiki.apache.org/OODT / JIRA: https://issues.apache.org/jira/browse/OODT Join the mailing list: [email_address] Chat on IRC: #oodt on irc.freenode.net Acknowledgements Key Members of the OODT teams: Chris Mattmann, Daniel J. Crichton, Steve Hughes, Andrew Hart, Sean Kelly, Sean Hardman, Paul Ramirez, David Woollard, Brian Foster, Dana Freeborn, Emily Law, Mike Cayanan, Luca Cinquini, Heather Kincaid Projects, Sponsors, Collaborators: Planetary Data System, Early Detection Research Network, Climate Data Exchange, Virtual Pediatric Intensive Care Unit, NASA SMAP Mission, NASA OCO-2 Mission, NASA NPP Sounder Peate, NASA ACOS Mission, Earth System Grid Federation
  • 54. Alright, I ’ll shut up now Any questions? THANK YOU! [email_address] @chrismattmann on Twitter