SlideShare a Scribd company logo
An overview of
                       Neo4j Internals


                                   tobias@neotechnology.com
 Tobias Lindaaker                  twitter: @thobe, #neo4j (@neo4j)
                                   web: neo4j.org neotechnology.com
 Hacker @ Neo Technology           my web: thobe.org




Monday, May 21, 2012
Outline
       This is a rough structure of
       how the pieces of Neo4j fit
       together.

       This talk will not cover
       how disks/fs works, we
       just assume it does.           Traversals      Core API       Cypher
       Nor will it cover the “Core
       API”, you are assumed to
       know it.
                                      Node/Relationship
                                                            Thread local diffs
                                        Object cache

                                          FS Cache                         HA


                                        Record files          Transaction log


                                                      Disk(s)



                                                                                 2

Monday, May 21, 2012
Outline

                             Traversals      Core API       Cypher


                             Node/Relationship
                                                   Thread local diffs
                               Object cache

                                 FS Cache                         HA

      Let’s start at the
      bottom: the on disk      Record files          Transaction log
      storage file layout.



                                             Disk(s)



                                                                        3

Monday, May 21, 2012
Simple sample graph. It all boils down to
                                                linked lists of fixed size records on disk.

       Your graph on disk                       Properties are stored as a linked list of
                                                property records, each holding key+value.
                                                Each node/relationship references its first
                                                property record.
                                                The Nodes also reference the first node in its
                                                relationship chain.
                                                Each Relationship references its start and
                   Name: Alistair               end node.
                   Age: 34             KNOWS    It also references the prev/next relationship
                                                record for the start/end node respectively


                                                                   Name: Tobias
                                                                   Age: 27
                                                                   Nationality: Swedish



            KNOWS
                                        KNOWS


                                                                      KNOWS
              Name: Ian
              Age: 42


                                                           Name: Jim
                                                           Age: 37
                                    KNOWS                  Stuff: good

                                                                                                 4

Monday, May 21, 2012
Simple sample graph. It all boils down to
                                         linked lists of fixed size records on disk.

       Your graph on disk                Properties are stored as a linked list of
                                         property records, each holding key+value.
                                         Each node/relationship references its first
        Name                             property record.
                                         The Nodes also reference the first node in its
       Alistair                          relationship chain.
                                         Each Relationship references its start and
                                         end node.                                        Name
                                KNOWS    It also references the prev/next relationship
                                         record for the start/end node respectively       Tobias
      Age
       34
                                                                                                        Age
                                                                                                            27


                                                                                          Nationality

             KNOWS                                                                        Swedish
                                 KNOWS


                                                               KNOWS

                                                                               Name
                                                                                 Jim
                                                                                                            Age
     Name                                                                                                   37
                                                                                 Stuff
       Ian                   KNOWS
                       Age                                                      good

                       42                                                                               4

Monday, May 21, 2012
Simple sample graph. It all boils down to
                                         linked lists of fixed size records on disk.

       Your graph on disk                Properties are stored as a linked list of
                                         property records, each holding key+value.
                                         Each node/relationship references its first
        Name                             property record.
                                         The Nodes also reference the first node in its
       Alistair                          relationship chain.
                                         Each Relationship references its start and
                                         end node.                                        Name
                                KNOWS    It also references the prev/next relationship
                                         record for the start/end node respectively       Tobias
      Age
       34
                                                                                                        Age
                                                                                                            27


                                                                                          Nationality

             KNOWS                                                                        Swedish
                                 KNOWS


                                                               KNOWS

                                                                               Name
                                                                                 Jim
                                                                                                            Age
     Name                                                                                                   37
                                                                                 Stuff
       Ian                   KNOWS
                       Age                                                      good

                       42                                                                               4

Monday, May 21, 2012
Simple sample graph. It all boils down to
                                                   linked lists of fixed size records on disk.

       Your graph on disk                          Properties are stored as a linked list of
                                                   property records, each holding key+value.
                                                   Each node/relationship references its first
        Name
                                       SP    EP    property record.
                                                   The Nodes also reference the first node in its
       Alistair                                    relationship chain.
                                       SN    EN    Each Relationship references its start and
                                                   end node.                                        Name
                                       KNOWS       It also references the prev/next relationship
                                                   record for the start/end node respectively       Tobias
      Age
       34
                                                                                                                  Age
                                                                                                                      27
             SP        EP
                                        SP    EP
             SN        EN                                                                           Nationality
                                        SN    EN
             KNOWS                                                                                  Swedish
                                        KNOWS                            SP       EP
                                                                         SN       EN
                                                                         KNOWS

                                                                                         Name
                                  SP   EP                                                  Jim
                                                                                                                      Age
                                  SN   EN
     Name                                                                                                             37
                                                                                           Stuff
       Ian                        KNOWS
                            Age                                                           good

                            42                                                                                    4

Monday, May 21, 2012
Simple sample graph. It all boils down to
                                                   linked lists of fixed size records on disk.

       Your graph on disk                          Properties are stored as a linked list of
                                                   property records, each holding key+value.
                                                   Each node/relationship references its first
        Name
                                       SP    EP    property record.
                                                   The Nodes also reference the first node in its
       Alistair                                    relationship chain.
                                       SN    EN    Each Relationship references its start and
                                                   end node.                                        Name
                                       KNOWS       It also references the prev/next relationship
                                                   record for the start/end node respectively       Tobias
      Age
       34
                                                                                                                  Age
                                                                                                                      27
             SP        EP
                                        SP    EP
             SN        EN                                                                           Nationality
                                        SN    EN
             KNOWS                                                                                  Swedish
                                        KNOWS                            SP       EP
                                                                         SN       EN
                                                                         KNOWS

                                                                                         Name
                                  SP   EP                                                  Jim
                                                                                                                      Age
                                  SN   EN
     Name                                                                                                             37
                                                                                           Stuff
       Ian                        KNOWS
                            Age                                                           good

                            42                                                                                    4

Monday, May 21, 2012
Simple sample graph. It all boils down to
                                                   linked lists of fixed size records on disk.

       Your graph on disk                          Properties are stored as a linked list of
                                                   property records, each holding key+value.
                                                   Each node/relationship references its first
        Name
                                       SP    EP    property record.
                                                   The Nodes also reference the first node in its
       Alistair                                    relationship chain.
                                       SN    EN    Each Relationship references its start and
                                                   end node.                                        Name
                                       KNOWS       It also references the prev/next relationship
                                                   record for the start/end node respectively       Tobias
      Age
       34
                                                                                                                  Age
                                                                                                                      27
             SP        EP
                                        SP    EP
             SN        EN                                                                           Nationality
                                        SN    EN
             KNOWS                                                                                  Swedish
                                        KNOWS                            SP       EP
                                                                         SN       EN
                                                                         KNOWS

                                                                                         Name
                                  SP   EP                                                  Jim
                                                                                                                      Age
                                  SN   EN
     Name                                                                                                             37
                                                                                           Stuff
       Ian                        KNOWS
                            Age                                                           good

                            42                                                                                    4

Monday, May 21, 2012
Simple sample graph. It all boils down to
                                                   linked lists of fixed size records on disk.

       Your graph on disk                          Properties are stored as a linked list of
                                                   property records, each holding key+value.
                                                   Each node/relationship references its first
        Name
                                       SP    EP    property record.
                                                   The Nodes also reference the first node in its
       Alistair                                    relationship chain.
                                       SN    EN    Each Relationship references its start and
                                                   end node.                                        Name
                                       KNOWS       It also references the prev/next relationship
                                                   record for the start/end node respectively       Tobias
      Age
       34
                                                                                                                  Age
                                                                                                                      27
             SP        EP
                                        SP    EP
             SN        EN                                                                           Nationality
                                        SN    EN
             KNOWS                                                                                  Swedish
                                        KNOWS                            SP       EP
                                                                         SN       EN
                                                                         KNOWS

                                                                                         Name
                                  SP   EP                                                  Jim
                                                                                                                      Age
                                  SN   EN
     Name                                                                                                             37
                                                                                           Stuff
       Ian                        KNOWS
                            Age                                                           good

                            42                                                                                    4

Monday, May 21, 2012
Store files
                       ๏Node store
                       ๏Relationship store
                         •   Relationship type store
                       ๏Property store
                         •   Property key store

                         • (long) String store    Short string and
                                                  array values are
                                                  inlined in the




                         •
                                                  property store, long
                                                  values are stored in

                           (long) Array store     separate store files.


                                                                          5

Monday, May 21, 2012
Neo4j Storage Record Layout
Node (9 bytes)
inUse nextRelId                          nextPropId


     1                               5                9



Relationship (33 bytes)
inUse firstNode                          secondNode       relationshipType    firstPrevRelId    firstNextRelId    secondPrevRelId    secondNextRelId    nextPropId


     1                               5                9                      13                17                21                 25                 29            33



Relationship Type (5 bytes)
inUse typeBlockId


     1                               5



Property (33 bytes)
inUse type              keyIndexId       propBlock                                                                                                      nextPropId


     1              3                5                                                                                                                 29            33



Property Index (9 bytes)
inUse propCount                          keyBlockId


     1                               5                9



Dynamic Store (125 bytes)
inUse next                               data


     1                               5



NeoStore (5 bytes)
inUse datum


     1                               5

Monday, May 21, 2012
Outline

                                      Traversals      Core API       Cypher


     Next: The t wo levels of cache   Node/Relationship
     in Neo4j.                                              Thread local diffs
     The low level FS Cache for the     Object cache
     record files.
     And the high level Object
     cache storing a structure
     more optimized for traversal.
                                          FS Cache                         HA


                                        Record files          Transaction log


                                                      Disk(s)



                                                                                 7

Monday, May 21, 2012
The caches
        ๏ Filesystem cache:
                • Caches regions store file intofiles sized regions)
                    (divides each
                                   of the store
                                                  equally

                • The cache holds a fixed number of regions for each file
                • Regions are evicted based on ahit in non-cached region)
                    (hit count vs. miss count, i.e.
                                                    LFU-like policy


                • Default implementation of regions uses OS mmap
        ๏ Node/Relationship cache
                • Cache a version more optimized for traversals
                                                                            8

Monday, May 21, 2012
What we put in cache
                       ID             Relationship ID refs
                                      in:    R1       R2    ...   Rn               The structure of the elements in the high level
                       type 1                                                      object cache.
                                      out    R1       R2    ...   Rn
                                                                                   On disk most of the information is contained
                                      in:    R1       R2    R3     ...    Rn       in the relationship records, with the nodes just
                       type 2                                                      referencing their first relationship. In the
                                      out    R1       ...   Rn
              Node                                                                 cache this is turned around: the nodes hold
                                                                                   references to all its relationships. The
                        ...            (grouped by type)                           relationships are simple, only holding its
                                                                                   properties.

                                                                                   The relationships for each node is grouped by
                                                                                   RelationshipType to allow fast traversal of a
                                                                                   specific type.
                       Key 1                Key 2           ...          Key n
                                                                                   All references (dotted arrows) are by ID, and
                                                                                   traversals do indirect lookup through the cache.
                              Val 1



                                              Val 2




                       ID         start               end         type     Val n

        Relationship
                       Key 1                Key 2           ...          Key n
                              Val 1



                                              Val 2




                                                                           Val n




                                                                                                                       9

Monday, May 21, 2012
Outline

     So how do traversals work...
                                    Traversals      Core API       Cypher


                                    Node/Relationship
                                                          Thread local diffs
                                      Object cache

                                        FS Cache                         HA


                                      Record files          Transaction log


                                                    Disk(s)



                                                                               10

Monday, May 21, 2012
Traversals - how do they work?
       ๏ RelationshipExpanders: given (a path to) a node, returns           The surface
                                                                            layer, the you
                 Relationships to continue traversing from that node        interact with.


       ๏ Evaluators: given (a path to) a node, returns whether to:
                • Continue traversing on that branch (i.e. expand) or not
                • Include (the path to) the node in the result set or not
       ๏ Then a projection to Path, Node, or Relationship applied to
                 each Path in the result set.
            ... but also:
       ๏ Uniqueness level: policy for when it is ok to revisit a node
                 that has already been visited
       ๏ Implemented on top of the Core API
                                                                            11

Monday, May 21, 2012
More on Traversals
        ๏ Fetch node data from cache - non-blocking access                  This is what happens


                •
                                                                            under the hood.
                       If not in cache, retrieve from storage, into cache
                       ‣If region is in FS cache: blocking but short duration access
                       ‣If region is outside FS cache: blocking slower access
        ๏ Get relationships from cached node
                • If not fetched, retrieve from storage, by following chains
        ๏ Expand relationship(s) to end up on next node(s)
                • The relationship knows the node, no need to fetch it yet
        ๏ Evaluate
                •      possibly emitting a Path into the result set
        ๏ Repeat                                                                 12

Monday, May 21, 2012
Outline

                                                                  How is Cypher different?
                       Traversals      Core API       Cypher      and how dowes it work?



                       Node/Relationship
                                             Thread local diffs
                         Object cache

                           FS Cache                         HA


                         Record files          Transaction log


                                       Disk(s)



                                                                              13

Monday, May 21, 2012
Cypher - Just convenient traversal descriptions?
       ๏ Builds on the same infrastructure as Traversals - Expanders
                • but not on the full Traversal system
       ๏ Uses graph pattern matching for traversing the graph
              • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
            START x=...
                          matching with
                                        x-->z, y-->z,


                                                                Red: pattern graph
                                                                Blue: actual graph
                                                                Green: start node
                                                                Purple: matches




                                                                          14

Monday, May 21, 2012
Cypher - Just convenient traversal descriptions?
       ๏ Builds on the same infrastructure as Traversals - Expanders
                • but not on the full Traversal system
       ๏ Uses graph pattern matching for traversing the graph
              • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
            START x=...
                          matching with
                                        x-->z, y-->z,


                                                                Red: pattern graph
                                                                Blue: actual graph
                                                                Green: start node
                                                                Purple: matches




                                                                          14

Monday, May 21, 2012
Cypher - Just convenient traversal descriptions?
       ๏ Builds on the same infrastructure as Traversals - Expanders
                • but not on the full Traversal system
       ๏ Uses graph pattern matching for traversing the graph
              • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
            START x=...
                          matching with
                                        x-->z, y-->z,


                                                                Red: pattern graph
                                                                Blue: actual graph
                                                                Green: start node
                                                                Purple: matches




                                                                          14

Monday, May 21, 2012
Cypher - Just convenient traversal descriptions?
       ๏ Builds on the same infrastructure as Traversals - Expanders
                • but not on the full Traversal system
       ๏ Uses graph pattern matching for traversing the graph
              • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
            START x=...
                          matching with
                                        x-->z, y-->z,


                                                                Red: pattern graph
                                                                Blue: actual graph
                                                                Green: start node
                                                                Purple: matches




                                                                          14

Monday, May 21, 2012
Cypher - Just convenient traversal descriptions?
       ๏ Builds on the same infrastructure as Traversals - Expanders
                • but not on the full Traversal system
       ๏ Uses graph pattern matching for traversing the graph
              • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
            START x=...
                          matching with
                                        x-->z, y-->z,


                                                                Red: pattern graph
                                                                Blue: actual graph
                                                                Green: start node
                                                                Purple: matches




                                                                          14

Monday, May 21, 2012
Cypher - Just convenient traversal descriptions?
       ๏ Builds on the same infrastructure as Traversals - Expanders
                • but not on the full Traversal system
       ๏ Uses graph pattern matching for traversing the graph
              • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
            START x=...
                          matching with
                                        x-->z, y-->z,


                                                                Red: pattern graph
                                                                Blue: actual graph
                                                                Green: start node
                                                                Purple: matches




                                                                          14

Monday, May 21, 2012
Cypher - Just convenient traversal descriptions?
       ๏ Builds on the same infrastructure as Traversals - Expanders
                • but not on the full Traversal system
       ๏ Uses graph pattern matching for traversing the graph
              • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
            START x=...
                          matching with
                                        x-->z, y-->z,


                                                                Red: pattern graph
                                                                Blue: actual graph
                                                                Green: start node
                                                                Purple: matches




                                                                          14

Monday, May 21, 2012
Cypher - Just convenient traversal descriptions?
       ๏ Builds on the same infrastructure as Traversals - Expanders
                • but not on the full Traversal system
       ๏ Uses graph pattern matching for traversing the graph
              • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
            START x=...
                          matching with
                                        x-->z, y-->z,


                                                                Red: pattern graph
                                                                Blue: actual graph
                                                                Green: start node
                                                                Purple: matches




                                                                          14

Monday, May 21, 2012
Cypher - Just convenient traversal descriptions?
       ๏ Builds on the same infrastructure as Traversals - Expanders
                • but not on the full Traversal system
       ๏ Uses graph pattern matching for traversing the graph
              • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
            START x=...
                          matching with
                                        x-->z, y-->z,


                                                                Red: pattern graph
                                                                Blue: actual graph
                                                                Green: start node
                                                                Purple: matches




                                                                          14

Monday, May 21, 2012
What about gremlin?
        ๏ gremlin is a third party language, built by Marko Rodriguez of
                   Tinkerpop (a group of people who like to hack on graphs)
        ๏ Originally based on the idea of using xpath to describe traversals:
                   ./HAS_CART/CONTAINS_ITEM/PURCHASED/PURCHASED
                   but bastardized to distinguish between nodes and relationships:
                   ./outE[label=HAS_CART]/inV                     Traversals are close
                    /outE[label=CONTAINS_ITEM]/inV to xpath, which is
                                                                  why xpath like
                    /inE[label=PURCHASED]/outV                    descriptions of
                                                                  traversals seemed
                    /outE[label=PURCHASED]/inV                    like a good idea.

        ๏ xpath is not complete enough to express full algorithms, it needs a
                   host language, gremlin originally defined its own.
                   This changed Groovy as a more complete host language and
                   abandoned xpath in favor of method chaining
                   [ replace ‘/’ with ‘.’ ]                              15

Monday, May 21, 2012
Gremlin compared to Cypher
        ๏start me=node:people(name={myname})
              match me-[:HAS_CART]->cart-[:CONTAINS_ITEM]->item
              item<-[:PURCHASED]-user-[:PURCHASED]->recommendation
              return recommendation

        ๏ Cypher is declarative, describes what data to get - its shape
        ๏ Gremlin is imperative, prescribes how to get the data
        ๏ Cypher has more opportunities for optimization by the engine
        ๏ Gremlin can implement pagerank, Cypher can’t (yet?)



                                                                     16

Monday, May 21, 2012
Outline

                       Traversals      Core API       Cypher      Transactions involve t wo
                                                                  parts:
                                                                  The (thread local) changes
                                                                  being done by an active
                                                                  transaction,
                       Node/Relationship                          and the transaction replay log
                                             Thread local diffs
                         Object cache                             for recovery.



                           FS Cache                         HA


                         Record files          Transaction log


                                       Disk(s)



                                                                               17

Monday, May 21, 2012
Transaction Isolation
        ๏ Mutating operations are not written when performed
        ๏ They are stored in a thread confined transaction state object
        ๏ This prevents other threads from seeing uncommitted changes
                   from the transactions of other threads
        ๏ When Transaction.finish() is invoked the transaction is either
                   committed or rolled back
        ๏ Rollback is simple: discard the transaction state object



                                                                          18

Monday, May 21, 2012
Transactions & Durability
        ๏ Commit is:
                • Changes made in the transaction are collected as commands
                • Commands are sorted to get predictable update order
                       ‣This prevents concurrent readers from seeing inconsistent
                        data when the changes are applied to the store
                • Write changes (in sorted order) to the transaction log
                • Mark the transaction as committed in the log
                • Apply the changes (in sorted order) to the store files
                                                                              19

Monday, May 21, 2012
Recovery
        ๏ Transaction commands dictate state, they don’t modify state
                • i.e. SET property "count" to 5
                • rather than ADD 1 to property "count"
        ๏ Thus: Applying the same command twice yields the same state
        ๏ Recovery simply replays all transactions since the last safe point
        ๏ If tx A mutates node1.name, then tx B also mutates
                   node1.name that doesn’t matter, because the database is not
                   recovered until all transactions have been replayed


                                                                         20

Monday, May 21, 2012
Outline

                       Traversals      Core API       Cypher


                       Node/Relationship
                                             Thread local diffs
                         Object cache

                           FS Cache                         HA


                         Record files          Transaction log     High Availability in
                                                                  Neo4j builds on top of
                                                                  the transaction replay


                                       Disk(s)



                                                                              21

Monday, May 21, 2012
Outline                       The transaction logs are
                                                                          shared bet ween all instances
                                                                          in an High Availability setup,
                                                                          all other parts operate on the
                                                                          local data just like in the
                                                                          standalone case.

                               Traversals      Core API       Cypher


                       Local   Node/Relationship
                                                     Thread local diffs
                                 Object cache

                                   FS Cache                         HA


                                 Record files          Transaction log     Shared

                                               Disk(s)



                                                                                       22

Monday, May 21, 2012
• HA - the parts to it:
        ๏ Based on streaming transactions between servers
        ๏ All transactions are committed through the master
                • Then (eventually) applied to the slaves
                • Eventuality synchronizationupdate intervalby interaction
                    or when
                              defined by the
                                              is mandated
        ๏ When writing to a slave:
                • Locks coordinated through the master
                • Transaction data buffered on theget a txid
                    applied first on the master to
                                                   slave

                       then applied with the same txid on the slave
                                                                             23

Monday, May 21, 2012
Creating new Nodes and Relationships
        ๏ New Nodes/Relationships don’t need locks, so they don’t need a
                   transaction synced with master until the transaction commit
        ๏ They do need an ID that is unique and equal among all instances
        ๏ Each instance allocates IDs in blocks from the master, then assigns
                   them to new Nodes/Relationships locally

                • This batch allocation can be seen in (Enterprise) 1000 as
                    Node/Relationship counts jumping in steps of
                                                                    WebAdmin




                                                                           24

Monday, May 21, 2012
HA synchronization points
        ๏ Transactions are uniquely identified by monotonically increasing ID
        ๏ All Requests from slave send the current latest txid on that slave
        ๏ Responses from master send back a stream of transactions that
                   have occurred since then, along with the actual result
        ๏ Transaction is replayed just like when committed / recovered
        ๏ Nodes/Relationships touched during this application are evicted
                   from cache to avoid cache staleness
        ๏ Transaction commands are only sorted when created, before
                   stored/transmitted, thus consistency is preserved during all
                   application phases
                                                                              25

Monday, May 21, 2012
Locking semantics of HA
        ๏ To be granted a lock the slave must have the latest version of the
                   Node/Relationship it is trying to lock

                • This ensures consistency
                • The implementation of “Latest version ofentireNode/
                    Relationship” is “Latest version of the
                                                            the
                                                                 graph”

                • The slave must thus sync transactions from the master


                                                                          26

Monday, May 21, 2012
Master election
        ๏ Each instance communicates/coordinates:
              • its latest transaction id (including the master id for that tx)
              • the id forclock value for when the txid was written
                            that instance
              • (logical)
        ๏    Election chooses:
           1. The instance with highest txid
           2. IF multiple: The instance that was master for that tx
           3. IF unavailable: The instance with the lowest clock value
           4. IF multiple: The instance with the lowest id
        ๏ Election happens when the current master cannot be reached
                •
              Any instance can choose to re-elect
                •
              Each instance runs the election protocol individually
                •
              Notify others when election chooses new master
        ๏ When elected, the new master broadcasts to all instances, 27
             forcing them to bind to the new master
Monday, May 21, 2012
Thank you for listening!

                                    tobias@neotechnology.com
 Tobias Lindaaker                   twitter: @thobe, #neo4j (@neo4j)
                                    web: neo4j.org neotechnology.com
 Hacker @ Neo Technology            my web: thobe.org




Monday, May 21, 2012

More Related Content

What's hot (20)

PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PDF
The Apache Spark File Format Ecosystem
Databricks
 
PDF
Data Security at Scale through Spark and Parquet Encryption
Databricks
 
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
PDF
Cassandra Introduction & Features
DataStax Academy
 
PDF
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxData
 
PPTX
Key-Value NoSQL Database
Heman Hosainpana
 
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
PPTX
Free Training: How to Build a Lakehouse
Databricks
 
PDF
Intro to HBase
alexbaranau
 
PPTX
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
PPTX
Apache HBase™
Prashant Gupta
 
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
PDF
Performance Troubleshooting Using Apache Spark Metrics
Databricks
 
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
PDF
Log Structured Merge Tree
University of California, Santa Cruz
 
PDF
Introducing Neo4j
Neo4j
 
PDF
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
mumrah
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
Introduction to Cassandra
Gokhan Atil
 
Parquet performance tuning: the missing guide
Ryan Blue
 
The Apache Spark File Format Ecosystem
Databricks
 
Data Security at Scale through Spark and Parquet Encryption
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Cassandra Introduction & Features
DataStax Academy
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxData
 
Key-Value NoSQL Database
Heman Hosainpana
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Free Training: How to Build a Lakehouse
Databricks
 
Intro to HBase
alexbaranau
 
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
Apache HBase™
Prashant Gupta
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
Performance Troubleshooting Using Apache Spark Metrics
Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Log Structured Merge Tree
University of California, Santa Cruz
 
Introducing Neo4j
Neo4j
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
mumrah
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Introduction to Cassandra
Gokhan Atil
 

Similar to An overview of Neo4j Internals (20)

PDF
Neo4j Nosqllive
Peter Neubauer
 
PDF
Neo4j - 5 cool graph examples
Peter Neubauer
 
KEY
Neo4j & (J) Ruby Presentation JRubyConf.EU
jexp
 
PDF
2010 09-neo4j-deutsche-telekom
Peter Neubauer
 
KEY
Analyzing FEC Data with NEO4J
davefauth
 
PDF
Neo4j - The Benefits of Graph Databases (OSCON 2009)
Emil Eifrem
 
PDF
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
Emil Eifrem
 
KEY
Geekout Tallinn - Neo4j for the rescue!
Peter Neubauer
 
PDF
Eifrem neo4j
Shridhar Joshi
 
KEY
2012 09 SF Data Mining zero to hero
Peter Neubauer
 
PDF
Neo4j -- or why graph dbs kick ass
Emil Eifrem
 
ODP
Cassandra Overview
btoddb
 
PDF
GDM 2011 - Neo4j and real world apps.
Peter Neubauer
 
PDF
Kickoff research project TU Ilmenau
Henning Rauch
 
PDF
The Definition of GraphDB
Takahiro Inoue
 
PPTX
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Cloudera, Inc.
 
PDF
LectureNotes-03-DSA
Haitham El-Ghareeb
 
PDF
Dynamic languages, for software craftmanship group
Reuven Lerner
 
Neo4j Nosqllive
Peter Neubauer
 
Neo4j - 5 cool graph examples
Peter Neubauer
 
Neo4j & (J) Ruby Presentation JRubyConf.EU
jexp
 
2010 09-neo4j-deutsche-telekom
Peter Neubauer
 
Analyzing FEC Data with NEO4J
davefauth
 
Neo4j - The Benefits of Graph Databases (OSCON 2009)
Emil Eifrem
 
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
Emil Eifrem
 
Geekout Tallinn - Neo4j for the rescue!
Peter Neubauer
 
Eifrem neo4j
Shridhar Joshi
 
2012 09 SF Data Mining zero to hero
Peter Neubauer
 
Neo4j -- or why graph dbs kick ass
Emil Eifrem
 
Cassandra Overview
btoddb
 
GDM 2011 - Neo4j and real world apps.
Peter Neubauer
 
Kickoff research project TU Ilmenau
Henning Rauch
 
The Definition of GraphDB
Takahiro Inoue
 
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Cloudera, Inc.
 
LectureNotes-03-DSA
Haitham El-Ghareeb
 
Dynamic languages, for software craftmanship group
Reuven Lerner
 
Ad

More from Tobias Lindaaker (11)

PDF
NOSQL Overview
Tobias Lindaaker
 
PDF
Building Applications with a Graph Database
Tobias Lindaaker
 
PDF
JDK Power Tools
Tobias Lindaaker
 
PDF
Choosing the right NOSQL database
Tobias Lindaaker
 
PDF
[JavaOne 2011] Models for Concurrent Programming
Tobias Lindaaker
 
PDF
Django and Neo4j - Domain modeling that kicks ass
Tobias Lindaaker
 
PDF
NOSQLEU - Graph Databases and Neo4j
Tobias Lindaaker
 
PDF
Persistent graphs in Python with Neo4j
Tobias Lindaaker
 
PDF
A Better Python for the JVM
Tobias Lindaaker
 
PDF
A Better Python for the JVM
Tobias Lindaaker
 
PDF
Exploiting Concurrency with Dynamic Languages
Tobias Lindaaker
 
NOSQL Overview
Tobias Lindaaker
 
Building Applications with a Graph Database
Tobias Lindaaker
 
JDK Power Tools
Tobias Lindaaker
 
Choosing the right NOSQL database
Tobias Lindaaker
 
[JavaOne 2011] Models for Concurrent Programming
Tobias Lindaaker
 
Django and Neo4j - Domain modeling that kicks ass
Tobias Lindaaker
 
NOSQLEU - Graph Databases and Neo4j
Tobias Lindaaker
 
Persistent graphs in Python with Neo4j
Tobias Lindaaker
 
A Better Python for the JVM
Tobias Lindaaker
 
A Better Python for the JVM
Tobias Lindaaker
 
Exploiting Concurrency with Dynamic Languages
Tobias Lindaaker
 
Ad

Recently uploaded (20)

PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Digital Circuits, important subject in CS
contactparinay1
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 

An overview of Neo4j Internals

  • 1. An overview of Neo4j Internals [email protected] Tobias Lindaaker twitter: @thobe, #neo4j (@neo4j) web: neo4j.org neotechnology.com Hacker @ Neo Technology my web: thobe.org Monday, May 21, 2012
  • 2. Outline This is a rough structure of how the pieces of Neo4j fit together. This talk will not cover how disks/fs works, we just assume it does. Traversals Core API Cypher Nor will it cover the “Core API”, you are assumed to know it. Node/Relationship Thread local diffs Object cache FS Cache HA Record files Transaction log Disk(s) 2 Monday, May 21, 2012
  • 3. Outline Traversals Core API Cypher Node/Relationship Thread local diffs Object cache FS Cache HA Let’s start at the bottom: the on disk Record files Transaction log storage file layout. Disk(s) 3 Monday, May 21, 2012
  • 4. Simple sample graph. It all boils down to linked lists of fixed size records on disk. Your graph on disk Properties are stored as a linked list of property records, each holding key+value. Each node/relationship references its first property record. The Nodes also reference the first node in its relationship chain. Each Relationship references its start and Name: Alistair end node. Age: 34 KNOWS It also references the prev/next relationship record for the start/end node respectively Name: Tobias Age: 27 Nationality: Swedish KNOWS KNOWS KNOWS Name: Ian Age: 42 Name: Jim Age: 37 KNOWS Stuff: good 4 Monday, May 21, 2012
  • 5. Simple sample graph. It all boils down to linked lists of fixed size records on disk. Your graph on disk Properties are stored as a linked list of property records, each holding key+value. Each node/relationship references its first Name property record. The Nodes also reference the first node in its Alistair relationship chain. Each Relationship references its start and end node. Name KNOWS It also references the prev/next relationship record for the start/end node respectively Tobias Age 34 Age 27 Nationality KNOWS Swedish KNOWS KNOWS Name Jim Age Name 37 Stuff Ian KNOWS Age good 42 4 Monday, May 21, 2012
  • 6. Simple sample graph. It all boils down to linked lists of fixed size records on disk. Your graph on disk Properties are stored as a linked list of property records, each holding key+value. Each node/relationship references its first Name property record. The Nodes also reference the first node in its Alistair relationship chain. Each Relationship references its start and end node. Name KNOWS It also references the prev/next relationship record for the start/end node respectively Tobias Age 34 Age 27 Nationality KNOWS Swedish KNOWS KNOWS Name Jim Age Name 37 Stuff Ian KNOWS Age good 42 4 Monday, May 21, 2012
  • 7. Simple sample graph. It all boils down to linked lists of fixed size records on disk. Your graph on disk Properties are stored as a linked list of property records, each holding key+value. Each node/relationship references its first Name SP EP property record. The Nodes also reference the first node in its Alistair relationship chain. SN EN Each Relationship references its start and end node. Name KNOWS It also references the prev/next relationship record for the start/end node respectively Tobias Age 34 Age 27 SP EP SP EP SN EN Nationality SN EN KNOWS Swedish KNOWS SP EP SN EN KNOWS Name SP EP Jim Age SN EN Name 37 Stuff Ian KNOWS Age good 42 4 Monday, May 21, 2012
  • 8. Simple sample graph. It all boils down to linked lists of fixed size records on disk. Your graph on disk Properties are stored as a linked list of property records, each holding key+value. Each node/relationship references its first Name SP EP property record. The Nodes also reference the first node in its Alistair relationship chain. SN EN Each Relationship references its start and end node. Name KNOWS It also references the prev/next relationship record for the start/end node respectively Tobias Age 34 Age 27 SP EP SP EP SN EN Nationality SN EN KNOWS Swedish KNOWS SP EP SN EN KNOWS Name SP EP Jim Age SN EN Name 37 Stuff Ian KNOWS Age good 42 4 Monday, May 21, 2012
  • 9. Simple sample graph. It all boils down to linked lists of fixed size records on disk. Your graph on disk Properties are stored as a linked list of property records, each holding key+value. Each node/relationship references its first Name SP EP property record. The Nodes also reference the first node in its Alistair relationship chain. SN EN Each Relationship references its start and end node. Name KNOWS It also references the prev/next relationship record for the start/end node respectively Tobias Age 34 Age 27 SP EP SP EP SN EN Nationality SN EN KNOWS Swedish KNOWS SP EP SN EN KNOWS Name SP EP Jim Age SN EN Name 37 Stuff Ian KNOWS Age good 42 4 Monday, May 21, 2012
  • 10. Simple sample graph. It all boils down to linked lists of fixed size records on disk. Your graph on disk Properties are stored as a linked list of property records, each holding key+value. Each node/relationship references its first Name SP EP property record. The Nodes also reference the first node in its Alistair relationship chain. SN EN Each Relationship references its start and end node. Name KNOWS It also references the prev/next relationship record for the start/end node respectively Tobias Age 34 Age 27 SP EP SP EP SN EN Nationality SN EN KNOWS Swedish KNOWS SP EP SN EN KNOWS Name SP EP Jim Age SN EN Name 37 Stuff Ian KNOWS Age good 42 4 Monday, May 21, 2012
  • 11. Store files ๏Node store ๏Relationship store • Relationship type store ๏Property store • Property key store • (long) String store Short string and array values are inlined in the • property store, long values are stored in (long) Array store separate store files. 5 Monday, May 21, 2012
  • 12. Neo4j Storage Record Layout Node (9 bytes) inUse nextRelId nextPropId 1 5 9 Relationship (33 bytes) inUse firstNode secondNode relationshipType firstPrevRelId firstNextRelId secondPrevRelId secondNextRelId nextPropId 1 5 9 13 17 21 25 29 33 Relationship Type (5 bytes) inUse typeBlockId 1 5 Property (33 bytes) inUse type keyIndexId propBlock nextPropId 1 3 5 29 33 Property Index (9 bytes) inUse propCount keyBlockId 1 5 9 Dynamic Store (125 bytes) inUse next data 1 5 NeoStore (5 bytes) inUse datum 1 5 Monday, May 21, 2012
  • 13. Outline Traversals Core API Cypher Next: The t wo levels of cache Node/Relationship in Neo4j. Thread local diffs The low level FS Cache for the Object cache record files. And the high level Object cache storing a structure more optimized for traversal. FS Cache HA Record files Transaction log Disk(s) 7 Monday, May 21, 2012
  • 14. The caches ๏ Filesystem cache: • Caches regions store file intofiles sized regions) (divides each of the store equally • The cache holds a fixed number of regions for each file • Regions are evicted based on ahit in non-cached region) (hit count vs. miss count, i.e. LFU-like policy • Default implementation of regions uses OS mmap ๏ Node/Relationship cache • Cache a version more optimized for traversals 8 Monday, May 21, 2012
  • 15. What we put in cache ID Relationship ID refs in: R1 R2 ... Rn The structure of the elements in the high level type 1 object cache. out R1 R2 ... Rn On disk most of the information is contained in: R1 R2 R3 ... Rn in the relationship records, with the nodes just type 2 referencing their first relationship. In the out R1 ... Rn Node cache this is turned around: the nodes hold references to all its relationships. The ... (grouped by type) relationships are simple, only holding its properties. The relationships for each node is grouped by RelationshipType to allow fast traversal of a specific type. Key 1 Key 2 ... Key n All references (dotted arrows) are by ID, and traversals do indirect lookup through the cache. Val 1 Val 2 ID start end type Val n Relationship Key 1 Key 2 ... Key n Val 1 Val 2 Val n 9 Monday, May 21, 2012
  • 16. Outline So how do traversals work... Traversals Core API Cypher Node/Relationship Thread local diffs Object cache FS Cache HA Record files Transaction log Disk(s) 10 Monday, May 21, 2012
  • 17. Traversals - how do they work? ๏ RelationshipExpanders: given (a path to) a node, returns The surface layer, the you Relationships to continue traversing from that node interact with. ๏ Evaluators: given (a path to) a node, returns whether to: • Continue traversing on that branch (i.e. expand) or not • Include (the path to) the node in the result set or not ๏ Then a projection to Path, Node, or Relationship applied to each Path in the result set. ... but also: ๏ Uniqueness level: policy for when it is ok to revisit a node that has already been visited ๏ Implemented on top of the Core API 11 Monday, May 21, 2012
  • 18. More on Traversals ๏ Fetch node data from cache - non-blocking access This is what happens • under the hood. If not in cache, retrieve from storage, into cache ‣If region is in FS cache: blocking but short duration access ‣If region is outside FS cache: blocking slower access ๏ Get relationships from cached node • If not fetched, retrieve from storage, by following chains ๏ Expand relationship(s) to end up on next node(s) • The relationship knows the node, no need to fetch it yet ๏ Evaluate • possibly emitting a Path into the result set ๏ Repeat 12 Monday, May 21, 2012
  • 19. Outline How is Cypher different? Traversals Core API Cypher and how dowes it work? Node/Relationship Thread local diffs Object cache FS Cache HA Record files Transaction log Disk(s) 13 Monday, May 21, 2012
  • 20. Cypher - Just convenient traversal descriptions? ๏ Builds on the same infrastructure as Traversals - Expanders • but not on the full Traversal system ๏ Uses graph pattern matching for traversing the graph • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b START x=... matching with x-->z, y-->z, Red: pattern graph Blue: actual graph Green: start node Purple: matches 14 Monday, May 21, 2012
  • 21. Cypher - Just convenient traversal descriptions? ๏ Builds on the same infrastructure as Traversals - Expanders • but not on the full Traversal system ๏ Uses graph pattern matching for traversing the graph • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b START x=... matching with x-->z, y-->z, Red: pattern graph Blue: actual graph Green: start node Purple: matches 14 Monday, May 21, 2012
  • 22. Cypher - Just convenient traversal descriptions? ๏ Builds on the same infrastructure as Traversals - Expanders • but not on the full Traversal system ๏ Uses graph pattern matching for traversing the graph • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b START x=... matching with x-->z, y-->z, Red: pattern graph Blue: actual graph Green: start node Purple: matches 14 Monday, May 21, 2012
  • 23. Cypher - Just convenient traversal descriptions? ๏ Builds on the same infrastructure as Traversals - Expanders • but not on the full Traversal system ๏ Uses graph pattern matching for traversing the graph • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b START x=... matching with x-->z, y-->z, Red: pattern graph Blue: actual graph Green: start node Purple: matches 14 Monday, May 21, 2012
  • 24. Cypher - Just convenient traversal descriptions? ๏ Builds on the same infrastructure as Traversals - Expanders • but not on the full Traversal system ๏ Uses graph pattern matching for traversing the graph • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b START x=... matching with x-->z, y-->z, Red: pattern graph Blue: actual graph Green: start node Purple: matches 14 Monday, May 21, 2012
  • 25. Cypher - Just convenient traversal descriptions? ๏ Builds on the same infrastructure as Traversals - Expanders • but not on the full Traversal system ๏ Uses graph pattern matching for traversing the graph • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b START x=... matching with x-->z, y-->z, Red: pattern graph Blue: actual graph Green: start node Purple: matches 14 Monday, May 21, 2012
  • 26. Cypher - Just convenient traversal descriptions? ๏ Builds on the same infrastructure as Traversals - Expanders • but not on the full Traversal system ๏ Uses graph pattern matching for traversing the graph • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b START x=... matching with x-->z, y-->z, Red: pattern graph Blue: actual graph Green: start node Purple: matches 14 Monday, May 21, 2012
  • 27. Cypher - Just convenient traversal descriptions? ๏ Builds on the same infrastructure as Traversals - Expanders • but not on the full Traversal system ๏ Uses graph pattern matching for traversing the graph • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b START x=... matching with x-->z, y-->z, Red: pattern graph Blue: actual graph Green: start node Purple: matches 14 Monday, May 21, 2012
  • 28. Cypher - Just convenient traversal descriptions? ๏ Builds on the same infrastructure as Traversals - Expanders • but not on the full Traversal system ๏ Uses graph pattern matching for traversing the graph • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b START x=... matching with x-->z, y-->z, Red: pattern graph Blue: actual graph Green: start node Purple: matches 14 Monday, May 21, 2012
  • 29. What about gremlin? ๏ gremlin is a third party language, built by Marko Rodriguez of Tinkerpop (a group of people who like to hack on graphs) ๏ Originally based on the idea of using xpath to describe traversals: ./HAS_CART/CONTAINS_ITEM/PURCHASED/PURCHASED but bastardized to distinguish between nodes and relationships: ./outE[label=HAS_CART]/inV Traversals are close /outE[label=CONTAINS_ITEM]/inV to xpath, which is why xpath like /inE[label=PURCHASED]/outV descriptions of traversals seemed /outE[label=PURCHASED]/inV like a good idea. ๏ xpath is not complete enough to express full algorithms, it needs a host language, gremlin originally defined its own. This changed Groovy as a more complete host language and abandoned xpath in favor of method chaining [ replace ‘/’ with ‘.’ ] 15 Monday, May 21, 2012
  • 30. Gremlin compared to Cypher ๏start me=node:people(name={myname}) match me-[:HAS_CART]->cart-[:CONTAINS_ITEM]->item item<-[:PURCHASED]-user-[:PURCHASED]->recommendation return recommendation ๏ Cypher is declarative, describes what data to get - its shape ๏ Gremlin is imperative, prescribes how to get the data ๏ Cypher has more opportunities for optimization by the engine ๏ Gremlin can implement pagerank, Cypher can’t (yet?) 16 Monday, May 21, 2012
  • 31. Outline Traversals Core API Cypher Transactions involve t wo parts: The (thread local) changes being done by an active transaction, Node/Relationship and the transaction replay log Thread local diffs Object cache for recovery. FS Cache HA Record files Transaction log Disk(s) 17 Monday, May 21, 2012
  • 32. Transaction Isolation ๏ Mutating operations are not written when performed ๏ They are stored in a thread confined transaction state object ๏ This prevents other threads from seeing uncommitted changes from the transactions of other threads ๏ When Transaction.finish() is invoked the transaction is either committed or rolled back ๏ Rollback is simple: discard the transaction state object 18 Monday, May 21, 2012
  • 33. Transactions & Durability ๏ Commit is: • Changes made in the transaction are collected as commands • Commands are sorted to get predictable update order ‣This prevents concurrent readers from seeing inconsistent data when the changes are applied to the store • Write changes (in sorted order) to the transaction log • Mark the transaction as committed in the log • Apply the changes (in sorted order) to the store files 19 Monday, May 21, 2012
  • 34. Recovery ๏ Transaction commands dictate state, they don’t modify state • i.e. SET property "count" to 5 • rather than ADD 1 to property "count" ๏ Thus: Applying the same command twice yields the same state ๏ Recovery simply replays all transactions since the last safe point ๏ If tx A mutates node1.name, then tx B also mutates node1.name that doesn’t matter, because the database is not recovered until all transactions have been replayed 20 Monday, May 21, 2012
  • 35. Outline Traversals Core API Cypher Node/Relationship Thread local diffs Object cache FS Cache HA Record files Transaction log High Availability in Neo4j builds on top of the transaction replay Disk(s) 21 Monday, May 21, 2012
  • 36. Outline The transaction logs are shared bet ween all instances in an High Availability setup, all other parts operate on the local data just like in the standalone case. Traversals Core API Cypher Local Node/Relationship Thread local diffs Object cache FS Cache HA Record files Transaction log Shared Disk(s) 22 Monday, May 21, 2012
  • 37. • HA - the parts to it: ๏ Based on streaming transactions between servers ๏ All transactions are committed through the master • Then (eventually) applied to the slaves • Eventuality synchronizationupdate intervalby interaction or when defined by the is mandated ๏ When writing to a slave: • Locks coordinated through the master • Transaction data buffered on theget a txid applied first on the master to slave then applied with the same txid on the slave 23 Monday, May 21, 2012
  • 38. Creating new Nodes and Relationships ๏ New Nodes/Relationships don’t need locks, so they don’t need a transaction synced with master until the transaction commit ๏ They do need an ID that is unique and equal among all instances ๏ Each instance allocates IDs in blocks from the master, then assigns them to new Nodes/Relationships locally • This batch allocation can be seen in (Enterprise) 1000 as Node/Relationship counts jumping in steps of WebAdmin 24 Monday, May 21, 2012
  • 39. HA synchronization points ๏ Transactions are uniquely identified by monotonically increasing ID ๏ All Requests from slave send the current latest txid on that slave ๏ Responses from master send back a stream of transactions that have occurred since then, along with the actual result ๏ Transaction is replayed just like when committed / recovered ๏ Nodes/Relationships touched during this application are evicted from cache to avoid cache staleness ๏ Transaction commands are only sorted when created, before stored/transmitted, thus consistency is preserved during all application phases 25 Monday, May 21, 2012
  • 40. Locking semantics of HA ๏ To be granted a lock the slave must have the latest version of the Node/Relationship it is trying to lock • This ensures consistency • The implementation of “Latest version ofentireNode/ Relationship” is “Latest version of the the graph” • The slave must thus sync transactions from the master 26 Monday, May 21, 2012
  • 41. Master election ๏ Each instance communicates/coordinates: • its latest transaction id (including the master id for that tx) • the id forclock value for when the txid was written that instance • (logical) ๏ Election chooses: 1. The instance with highest txid 2. IF multiple: The instance that was master for that tx 3. IF unavailable: The instance with the lowest clock value 4. IF multiple: The instance with the lowest id ๏ Election happens when the current master cannot be reached • Any instance can choose to re-elect • Each instance runs the election protocol individually • Notify others when election chooses new master ๏ When elected, the new master broadcasts to all instances, 27 forcing them to bind to the new master Monday, May 21, 2012
  • 42. Thank you for listening! [email protected] Tobias Lindaaker twitter: @thobe, #neo4j (@neo4j) web: neo4j.org neotechnology.com Hacker @ Neo Technology my web: thobe.org Monday, May 21, 2012