SlideShare a Scribd company logo
Neo4j Data Loading
with Kettle
Matt Casters
Chief Solutions Architect / Kettle Project Founder
Agenda
➢ What is Kettle?
➢ The Neo4j plugins
➢ Data loading performance tips
➢ Streaming data integration
➢ Metadata driven data possibilities
➢ Kettle Execution lineage in a graph
➢ Roadmap update
➢ Q&A
What is Kettle?
3
Kettle: Introduction
➢ Pentaho Data Integration from Hitachi Vantara
➢ One of the most widely used ETL tools
➢ Ready for the most demanding tasks
➢ Open source Apache Public License 2.0
➢ Well maintained
➢ Large community, marketplace, ...
➢ Easy to embed, install, package, rebrand
➢ Download : Sourceforge / Pentaho / 8.2 / PDI-CE
Kettle: where is it used?
➢ On tiny and enormous systems, real or virtual
➢ Very small computers, Raspberry Pie sized
➢ Your laptop or browser
➢ Locally or in the cloud
➢ On Hadoop clusters, VMs, Docker, Serverless,
➢ At large and small companies
➢ In government
➢ In education
➢ In the Neo4j Solutions Reference Architecture
Kettle: Why is it used?
➢ Reduce costs!
➢ Answers the “build or buy?” question
build
buy
Time
Accum.
Cost
Kettle
Kettle: Architecture
➢ Metadata driven, engine based :
○ No code generation
○ Define what you need to happen
→ GUI, Web, code, rules, …
○ Clear and transparent, self documenting
➢ Types of work:
○ Jobs for workflows
○ Transformations for parallel data streaming
Kettle: Design
➢ 100% Exposure of our engine through UI elements
➢ Everyone should be able to play along: plugins!
➢ We built integration points for others: run everywhere!
➢ Allow the user to avoid programming anything
➢ Allow the user to program anything: JavaScript, Java,
Groovy, RegEx, Rules, Python, Ruby, R, …
➢ Transparency wins: best in class logging, data lineage,
execution lineage, debugging, data previewing, row
sniff testing, …
Kettle: things of note
➢ SpoonGit: UI integration with git
➢ WebSpoon: web interface to the full Spoon UI
➢ Data Sets: build transformation unit tests
➢ Huge list of other plugins available, including from
Neo4j, on a marketplace, …
➢ Support for the latest technology stacks
➢ Project on github has over 1,000 forks
https://github.com/pentaho/pentaho-kettle
Kettle: The Toolset
➢ Spoon: GUI
➢ Scripts
➢ Server(s)
➢ Java API & SDK
➢ Standard file format
➢ Plugin ecosystem
➢ Docker image(s)
➢ Documentation, books, ...
Neo4j Kettle Plugins
11
Neo4j Plugins: where to find?
➢ Started by the community, extended by Neo4j
➢ Releases/Download shortcut:
○ http://neo4j.kettle.be
➢ Project:
○ https://github.com/knowbi/knowbi-pentaho-
pdi-neo4j-output
Give us feedback!
Neo4j Cypher
➢ For reading and writing
➢ Dynamic Cypher
➢ Batching and UNWIND
➢ Parameters
➢ Return values
➢ Helpers
Neo4j Output
➢ Easy node creation
➢ Create/Merge of ()-[]-()
➢ Batching and UNWIND
➢ Dynamic labels
Neo4j Graph Output
➢ Update (parts of) a graph
➢ Using a logical model
➢ Using field mapping
➢ Auto-generate Cypher
Check Neo4j Connection
➢ Job Entry (workflow)
➢ Validate DBs are up
➢ Used in error diagnostic
➢ Defensive setup
➢ Pessimistic approach
Neo4j Cypher Script
➢ Job Entry (workflow)
➢ Executes series of Cypher statements
Neo4j Kettle Plugins v4
18
Plugins v4
➢ Bulk loading steps
➢ Performance options
➢ Encrypted/obfuscated password in variables
➢ Bug fixes & UI improvements
Neo4j Generate CSVs
➢ Generate CSV files for Neo4j Import
➢ Generates appropriate header
➢ Handles escaping, quoting, …
➢ Outputs file names
Neo4j Split Graph
➢ Splits a graph field into nodes and relationships
➢ Used for unique value calculation
Neo4j Importer
➢ Runs a neo4j-import command
➢ Accepts the filenames of CSV files
Data loading
Performance tips
23
Pre-processing in Kettle
➢ Do work in Kettle that can be avoided in Neo4j
➢ Calculate unique nodes
➢ Do required data conversions
➢ Data cleaning
Parallel loading & batching
➢ Parallel node creation
➢ Limit high parallelism in the general case
➢ UNWIND in Neo4j Cypher step
➢ Create option in Neo4j Output step
➢ Use larger batch sizes (>1000)
➢ Create indexes up-front or with the options
Importing data
➢ Bulk loading with import is much faster
➢ A few orders of magnitude faster
➢ Collect all the data in CSV files
➢ Use the new steps to load
➢ Seamless path to incremental loads
Streaming data loads
27
Streaming options
➢ Micro-batching (every X minutes)
➢ Kafka, Event Hubs, Queues,... (never ending)
Streaming options
➢ Transformations can be never ending
➢ Any operation is possible
➢ Can collect data in other data platforms
➢ Is transactionally safe if it is supported (Kafka, …)
➢ Can be parallelized & scaled out
Metadata driven
Data possibilities
30
➢ Kettle transformations & jobs are metadata
➢ ETL Metadata Injection: transformation templates
➢ Neo4j is a great metadata database
➢ Kettle can make use of this
Metadata FTW
Metadata driven loads
➢ Loading hundreds of types of files
➢ Processing data from hundreds of databases
➢ Automatic data standardization and normalisation
→ Massive time gains!
Metadata driven extracts
➢ Without hardcoded sources, selections and targets
➢ Sourcing selections from users, processes, ...
➢ Using the possibilities of the Kettle engine
→ Flexibility, performance, without coding
Kettle Execution Lineage
34
Kettle Logging Architecture
➢ Unique ID per execution
➢ Precise sourcing of logging records
➢ Very “graphy” data
Execution
Metadata
Impact
Parent /
child
relation
Parent /
child
relation
The Kettle Neo4j Logging plugin
➢ Stores operational metadata in a graph
➢ https://github.com/mattcasters/kettle-neo4j-logging
➢ Tools
○ View execution information: log, duration, errors
○ Find error paths
○ Jump to error location
○ Find execution path of a step
○ Get time window: “since last succesful execution”
Execution lineage in a graph
➢ Documents the exection process
○ Log text, metadata, times, ...
Roadmap update
38
Roadmap Neo4j plugin
➢ 25 releases in 2018
➢ Major 4.0 release next week
➢ Then:
○ New Neo4j Output step
○ More graph data type operations
○ <Insert YOUR suggestion!>
➢ Tuning options for Neo4j steps running in initial
Kettle Apache Beam implementation:
→ DataFlow, Spark, Flink, …
Roadmap Neo4j Logging plugin
➢ Generic impact information logging
➢ Store data lineage in Neo4j
➢ Git revision graph loading (new step)
➢ Storing and viewing unit testing results
➢ Operational “dashboard”
Q&A
41

More Related Content

What's hot (20)

PDF
엘라스틱서치, 로그스태시, 키바나
종민 김
 
PDF
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxData
 
PDF
맵매칭 (부정확한 GPS포인트들로부터 경로 추정하기)
if kakao
 
DOCX
Mongoose getting started-Mongo Db with Node js
Pallavi Srivastava
 
PDF
Little Big Data #1. 바닥부터 시작하는 데이터 인프라
Seongyun Byeon
 
PPTX
Apache Arrow: In Theory, In Practice
Dremio Corporation
 
PDF
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
Kyunghwan Kim
 
PPTX
Kafka vs kinesis
Kaushal Lahankar, FRM
 
PDF
MongoDB vs. Postgres Benchmarks
EDB
 
PDF
Web Development with NodeJS
Riza Fahmi
 
PDF
Intro to beautiful soup
Andreas Chandra
 
PDF
How to name things: the hardest problem in programming
Peter Hilton
 
PDF
An introduction to MongoDB
César Trigo
 
PDF
Building Applications with a Graph Database
Tobias Lindaaker
 
PDF
data stage-material
Rajesh Kv
 
PPTX
Chapter 15 Representation learning - 1
KyeongUkJang
 
PDF
우아하게 준비하는 테스트와 리팩토링 - PyCon Korea 2018
Kenneth Ceyer
 
PDF
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
PDF
Yarn by default (Spark on YARN)
Ferran Galí Reniu
 
PDF
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Sease
 
엘라스틱서치, 로그스태시, 키바나
종민 김
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxData
 
맵매칭 (부정확한 GPS포인트들로부터 경로 추정하기)
if kakao
 
Mongoose getting started-Mongo Db with Node js
Pallavi Srivastava
 
Little Big Data #1. 바닥부터 시작하는 데이터 인프라
Seongyun Byeon
 
Apache Arrow: In Theory, In Practice
Dremio Corporation
 
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
Kyunghwan Kim
 
Kafka vs kinesis
Kaushal Lahankar, FRM
 
MongoDB vs. Postgres Benchmarks
EDB
 
Web Development with NodeJS
Riza Fahmi
 
Intro to beautiful soup
Andreas Chandra
 
How to name things: the hardest problem in programming
Peter Hilton
 
An introduction to MongoDB
César Trigo
 
Building Applications with a Graph Database
Tobias Lindaaker
 
data stage-material
Rajesh Kv
 
Chapter 15 Representation learning - 1
KyeongUkJang
 
우아하게 준비하는 테스트와 리팩토링 - PyCon Korea 2018
Kenneth Ceyer
 
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
Yarn by default (Spark on YARN)
Ferran Galí Reniu
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Sease
 

Similar to Neo4j Data Loading with Kettle (20)

PPTX
GraphDay Paris - Intégrer des flux de données dans Neo4j avec l'ETL Open Sour...
Neo4j
 
PDF
Neo4J meetup, Brussels, 2018-06-12
Bart Maertens
 
ODP
An Introduction to Pentaho Kettle
Dan Moore
 
PDF
Introduction To Pentaho Kettle
Boulder Java User's Group
 
PPT
Pentaho etl-tool
Sreenivas Kappala
 
PPTX
Migrating from MongoDB to Neo4j - Lessons Learned
Nick Manning
 
PDF
Neo4j: The path to success with Graph Database and Graph Data Science
Neo4j
 
PDF
Neo4j Graph Platform Overview, Kurt Freytag, Neo4j
Neo4j
 
ODP
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
Roland Bouman
 
PDF
Atlantis Word Processor 4.4.5.1 Free Download
blouch111kp
 
PDF
Auslogics Video Grabber Free 1.0.0.12 Free
shanbahikp01
 
PDF
Capture One Enterprise for MacOS Download
blouch139kp
 
PPTX
Neo4j & AWS Bedrock workshop at GraphSummit London 14 Nov 2023.pptx
Neo4j
 
PDF
Peek into Neo4j Product Strategy and Roadmap
Neo4j
 
PDF
Kettle: Pentaho Data Integration tool
Alex Rayón Jerez
 
PPTX
Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando m...
Alex Rayón Jerez
 
PDF
What's New In Neo4j 3.4 & Bloom Update
Neo4j
 
PDF
GRAPHISOFT ArchiCAD for MacOS Download
alihamzakpa099
 
PDF
Software Ideas Modeler Ultimate (Latest 2025)
hozaifa04kp
 
PDF
Windows 7 Crack All Activator Versions 100% working
blouch85kp
 
GraphDay Paris - Intégrer des flux de données dans Neo4j avec l'ETL Open Sour...
Neo4j
 
Neo4J meetup, Brussels, 2018-06-12
Bart Maertens
 
An Introduction to Pentaho Kettle
Dan Moore
 
Introduction To Pentaho Kettle
Boulder Java User's Group
 
Pentaho etl-tool
Sreenivas Kappala
 
Migrating from MongoDB to Neo4j - Lessons Learned
Nick Manning
 
Neo4j: The path to success with Graph Database and Graph Data Science
Neo4j
 
Neo4j Graph Platform Overview, Kurt Freytag, Neo4j
Neo4j
 
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
Roland Bouman
 
Atlantis Word Processor 4.4.5.1 Free Download
blouch111kp
 
Auslogics Video Grabber Free 1.0.0.12 Free
shanbahikp01
 
Capture One Enterprise for MacOS Download
blouch139kp
 
Neo4j & AWS Bedrock workshop at GraphSummit London 14 Nov 2023.pptx
Neo4j
 
Peek into Neo4j Product Strategy and Roadmap
Neo4j
 
Kettle: Pentaho Data Integration tool
Alex Rayón Jerez
 
Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando m...
Alex Rayón Jerez
 
What's New In Neo4j 3.4 & Bloom Update
Neo4j
 
GRAPHISOFT ArchiCAD for MacOS Download
alihamzakpa099
 
Software Ideas Modeler Ultimate (Latest 2025)
hozaifa04kp
 
Windows 7 Crack All Activator Versions 100% working
blouch85kp
 
Ad

More from Neo4j (20)

PDF
GraphSummit Singapore Master Deck - May 20, 2025
Neo4j
 
PPTX
Graphs & GraphRAG - Essential Ingredients for GenAI
Neo4j
 
PPTX
Neo4j Knowledge for Customer Experience.pptx
Neo4j
 
PPTX
GraphTalk New Zealand - The Art of The Possible.pptx
Neo4j
 
PDF
Neo4j: The Art of the Possible with Graph
Neo4j
 
PDF
Smarter Knowledge Graphs For Public Sector
Neo4j
 
PDF
GraphRAG and Knowledge Graphs Exploring AI's Future
Neo4j
 
PDF
Matinée GenAI & GraphRAG Paris - Décembre 24
Neo4j
 
PDF
ANZ Presentation: GraphSummit Melbourne 2024
Neo4j
 
PDF
Google Cloud Presentation GraphSummit Melbourne 2024: Building Generative AI ...
Neo4j
 
PDF
Telstra Presentation GraphSummit Melbourne: Optimising Business Outcomes with...
Neo4j
 
PDF
Hands-On GraphRAG Workshop: GraphSummit Melbourne 2024
Neo4j
 
PDF
Démonstration Digital Twin Building Wire Management
Neo4j
 
PDF
Swiss Life - Les graphes au service de la détection de fraude dans le domaine...
Neo4j
 
PDF
Démonstration Supply Chain - GraphTalk Paris
Neo4j
 
PDF
The Art of Possible - GraphTalk Paris Opening Session
Neo4j
 
PPTX
How Siemens bolstered supply chain resilience with graph-powered AI insights ...
Neo4j
 
PDF
Knowledge Graphs for AI-Ready Data and Enterprise Deployment - Gartner IT Sym...
Neo4j
 
PDF
Neo4j Graph Data Modelling Session - GraphTalk
Neo4j
 
PDF
Neo4j: The Art of Possible with Graph Technology
Neo4j
 
GraphSummit Singapore Master Deck - May 20, 2025
Neo4j
 
Graphs & GraphRAG - Essential Ingredients for GenAI
Neo4j
 
Neo4j Knowledge for Customer Experience.pptx
Neo4j
 
GraphTalk New Zealand - The Art of The Possible.pptx
Neo4j
 
Neo4j: The Art of the Possible with Graph
Neo4j
 
Smarter Knowledge Graphs For Public Sector
Neo4j
 
GraphRAG and Knowledge Graphs Exploring AI's Future
Neo4j
 
Matinée GenAI & GraphRAG Paris - Décembre 24
Neo4j
 
ANZ Presentation: GraphSummit Melbourne 2024
Neo4j
 
Google Cloud Presentation GraphSummit Melbourne 2024: Building Generative AI ...
Neo4j
 
Telstra Presentation GraphSummit Melbourne: Optimising Business Outcomes with...
Neo4j
 
Hands-On GraphRAG Workshop: GraphSummit Melbourne 2024
Neo4j
 
Démonstration Digital Twin Building Wire Management
Neo4j
 
Swiss Life - Les graphes au service de la détection de fraude dans le domaine...
Neo4j
 
Démonstration Supply Chain - GraphTalk Paris
Neo4j
 
The Art of Possible - GraphTalk Paris Opening Session
Neo4j
 
How Siemens bolstered supply chain resilience with graph-powered AI insights ...
Neo4j
 
Knowledge Graphs for AI-Ready Data and Enterprise Deployment - Gartner IT Sym...
Neo4j
 
Neo4j Graph Data Modelling Session - GraphTalk
Neo4j
 
Neo4j: The Art of Possible with Graph Technology
Neo4j
 
Ad

Recently uploaded (20)

PPTX
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PDF
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
PPTX
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
PPTX
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PDF
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PDF
Executive Business Intelligence Dashboards
vandeslie24
 
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
PPTX
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
PPTX
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
PPTX
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
PDF
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
PDF
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
PPTX
Engineering the Java Web Application (MVC)
abhishekoza1981
 
PPTX
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Human Resources Information System (HRIS)
Amity University, Patna
 
Executive Business Intelligence Dashboards
vandeslie24
 
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
Engineering the Java Web Application (MVC)
abhishekoza1981
 
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 

Neo4j Data Loading with Kettle

  • 1. Neo4j Data Loading with Kettle Matt Casters Chief Solutions Architect / Kettle Project Founder
  • 2. Agenda ➢ What is Kettle? ➢ The Neo4j plugins ➢ Data loading performance tips ➢ Streaming data integration ➢ Metadata driven data possibilities ➢ Kettle Execution lineage in a graph ➢ Roadmap update ➢ Q&A
  • 4. Kettle: Introduction ➢ Pentaho Data Integration from Hitachi Vantara ➢ One of the most widely used ETL tools ➢ Ready for the most demanding tasks ➢ Open source Apache Public License 2.0 ➢ Well maintained ➢ Large community, marketplace, ... ➢ Easy to embed, install, package, rebrand ➢ Download : Sourceforge / Pentaho / 8.2 / PDI-CE
  • 5. Kettle: where is it used? ➢ On tiny and enormous systems, real or virtual ➢ Very small computers, Raspberry Pie sized ➢ Your laptop or browser ➢ Locally or in the cloud ➢ On Hadoop clusters, VMs, Docker, Serverless, ➢ At large and small companies ➢ In government ➢ In education ➢ In the Neo4j Solutions Reference Architecture
  • 6. Kettle: Why is it used? ➢ Reduce costs! ➢ Answers the “build or buy?” question build buy Time Accum. Cost Kettle
  • 7. Kettle: Architecture ➢ Metadata driven, engine based : ○ No code generation ○ Define what you need to happen → GUI, Web, code, rules, … ○ Clear and transparent, self documenting ➢ Types of work: ○ Jobs for workflows ○ Transformations for parallel data streaming
  • 8. Kettle: Design ➢ 100% Exposure of our engine through UI elements ➢ Everyone should be able to play along: plugins! ➢ We built integration points for others: run everywhere! ➢ Allow the user to avoid programming anything ➢ Allow the user to program anything: JavaScript, Java, Groovy, RegEx, Rules, Python, Ruby, R, … ➢ Transparency wins: best in class logging, data lineage, execution lineage, debugging, data previewing, row sniff testing, …
  • 9. Kettle: things of note ➢ SpoonGit: UI integration with git ➢ WebSpoon: web interface to the full Spoon UI ➢ Data Sets: build transformation unit tests ➢ Huge list of other plugins available, including from Neo4j, on a marketplace, … ➢ Support for the latest technology stacks ➢ Project on github has over 1,000 forks https://github.com/pentaho/pentaho-kettle
  • 10. Kettle: The Toolset ➢ Spoon: GUI ➢ Scripts ➢ Server(s) ➢ Java API & SDK ➢ Standard file format ➢ Plugin ecosystem ➢ Docker image(s) ➢ Documentation, books, ...
  • 12. Neo4j Plugins: where to find? ➢ Started by the community, extended by Neo4j ➢ Releases/Download shortcut: ○ http://neo4j.kettle.be ➢ Project: ○ https://github.com/knowbi/knowbi-pentaho- pdi-neo4j-output Give us feedback!
  • 13. Neo4j Cypher ➢ For reading and writing ➢ Dynamic Cypher ➢ Batching and UNWIND ➢ Parameters ➢ Return values ➢ Helpers
  • 14. Neo4j Output ➢ Easy node creation ➢ Create/Merge of ()-[]-() ➢ Batching and UNWIND ➢ Dynamic labels
  • 15. Neo4j Graph Output ➢ Update (parts of) a graph ➢ Using a logical model ➢ Using field mapping ➢ Auto-generate Cypher
  • 16. Check Neo4j Connection ➢ Job Entry (workflow) ➢ Validate DBs are up ➢ Used in error diagnostic ➢ Defensive setup ➢ Pessimistic approach
  • 17. Neo4j Cypher Script ➢ Job Entry (workflow) ➢ Executes series of Cypher statements
  • 19. Plugins v4 ➢ Bulk loading steps ➢ Performance options ➢ Encrypted/obfuscated password in variables ➢ Bug fixes & UI improvements
  • 20. Neo4j Generate CSVs ➢ Generate CSV files for Neo4j Import ➢ Generates appropriate header ➢ Handles escaping, quoting, … ➢ Outputs file names
  • 21. Neo4j Split Graph ➢ Splits a graph field into nodes and relationships ➢ Used for unique value calculation
  • 22. Neo4j Importer ➢ Runs a neo4j-import command ➢ Accepts the filenames of CSV files
  • 24. Pre-processing in Kettle ➢ Do work in Kettle that can be avoided in Neo4j ➢ Calculate unique nodes ➢ Do required data conversions ➢ Data cleaning
  • 25. Parallel loading & batching ➢ Parallel node creation ➢ Limit high parallelism in the general case ➢ UNWIND in Neo4j Cypher step ➢ Create option in Neo4j Output step ➢ Use larger batch sizes (>1000) ➢ Create indexes up-front or with the options
  • 26. Importing data ➢ Bulk loading with import is much faster ➢ A few orders of magnitude faster ➢ Collect all the data in CSV files ➢ Use the new steps to load ➢ Seamless path to incremental loads
  • 28. Streaming options ➢ Micro-batching (every X minutes) ➢ Kafka, Event Hubs, Queues,... (never ending)
  • 29. Streaming options ➢ Transformations can be never ending ➢ Any operation is possible ➢ Can collect data in other data platforms ➢ Is transactionally safe if it is supported (Kafka, …) ➢ Can be parallelized & scaled out
  • 31. ➢ Kettle transformations & jobs are metadata ➢ ETL Metadata Injection: transformation templates ➢ Neo4j is a great metadata database ➢ Kettle can make use of this Metadata FTW
  • 32. Metadata driven loads ➢ Loading hundreds of types of files ➢ Processing data from hundreds of databases ➢ Automatic data standardization and normalisation → Massive time gains!
  • 33. Metadata driven extracts ➢ Without hardcoded sources, selections and targets ➢ Sourcing selections from users, processes, ... ➢ Using the possibilities of the Kettle engine → Flexibility, performance, without coding
  • 35. Kettle Logging Architecture ➢ Unique ID per execution ➢ Precise sourcing of logging records ➢ Very “graphy” data Execution Metadata Impact Parent / child relation Parent / child relation
  • 36. The Kettle Neo4j Logging plugin ➢ Stores operational metadata in a graph ➢ https://github.com/mattcasters/kettle-neo4j-logging ➢ Tools ○ View execution information: log, duration, errors ○ Find error paths ○ Jump to error location ○ Find execution path of a step ○ Get time window: “since last succesful execution”
  • 37. Execution lineage in a graph ➢ Documents the exection process ○ Log text, metadata, times, ...
  • 39. Roadmap Neo4j plugin ➢ 25 releases in 2018 ➢ Major 4.0 release next week ➢ Then: ○ New Neo4j Output step ○ More graph data type operations ○ <Insert YOUR suggestion!> ➢ Tuning options for Neo4j steps running in initial Kettle Apache Beam implementation: → DataFlow, Spark, Flink, …
  • 40. Roadmap Neo4j Logging plugin ➢ Generic impact information logging ➢ Store data lineage in Neo4j ➢ Git revision graph loading (new step) ➢ Storing and viewing unit testing results ➢ Operational “dashboard”