SlideShare a Scribd company logo
Data Modelling at Scale
David Simons | @SwamWithTurtles
W H O A M I ?
• David Simons (@SwamWithTurtles)
• Data Architect at Ovo Energy
• Technical All-Rounder
• Kafka implementation & Cloud
integration at Citi
• Linking of the court and prison
services with the Ministry of Justice
• Organised our wedding seating plan
with Python.
A B R I E F ( B U T I M P O R TA N T ) A S I D E
Black Lives
Matter
(UK orgs, Worldwide Orgs)
Trans Rights are
Human Rights
(LGBT orgs, Trans specific orgs,
International orgs)
Open Source Projects that could use contributions
• Github Collection of Open Source Projects for Social Good
• Data Kind
• Kaggle (Data Science Volunteering that often has Social Good causes)
• Police Brutality Register
• Data Police Subreddit (Increasing Accessibility of Policing Data)
• Data for Black Lives
A G E N D A
• What is Data Modelling & Why
should I care?
• Data Modelling with Kafka
• Scaling your Data Model
W H AT I S D ATA M O D E L L I N G A N D
W H Y S H O U L D I C A R E ?
C H A P T E R O N E
Recommended Reading:
Data & Reality
— William Kent
C AU T I O N
P H I L O S O P H Y A H E A D
w h a t i s t h e
w o r l d ?
A C O M P L E X M E S S O F
H U M A N S A N D J O K E S
A N D E M O T I O N S A N D
F I C T I O N A N D I L L O G I C A N D
F A K E N E W S A N D T I M E A N D
E N T R O P Y A N D C O N F U S I O N A N D C O L O U R
A N D W O N D E R A N D FAT E A N D H A T R E D A N D L O V E A N D
WA R A N D N U A N C E A N D J E A L O U S Y A N D D O G S ( W H O A R E G O O D B O Y S )
A N D C AT S ( W H O A R E N O T ) A N D C O U N T R I E S A N D B O R D E R S A N D M U S I C A N D S O U N D A N D T I M E
A N D T I D E A N D D A R K N E S S A N D B O O K S A N D W O R D S A N D C O M P L E X I T I E S A N D R O L L E R C O A S T E R S A N D FAV O U R I T E F L AV O U R S O F I C E C R E A M A N D J O H N L E N N O N A N D G E N D E R ( O R M AY B E N O T. )
T H AT N O C O M P U T E R
C A N E V E R H O P E T O
C A P T U R E
d o e s o u r
s o f t w a r e n e e d t o
e x i s t i n t h e
w o r l d ?
M AY B E N O T T H E
W O R L D …
B U T A N Y S O F T WA R E
O F S U F F I C I E N T
C O M P L E X I T Y E X I S T S
W I T H I N A
{ b u s i n e s s | p r o b l e m |
w o r l d | d o m a i n |
i n d u s t r y }
w h a t s u b s e t o f t h e
w o r l d d o e s o u r
s o f t w a r e e x i s t s
i n ?
data model, n:
an agreed set of
assumptions and features
that distill the world (in
which our software
exists) into something we
can hope to capture
programatically
W H AT D O E S T H AT
M E A N F O R
T E C H N I C A L P E O P L E ?
B U T I ’ M N O T A P H I L O S O P H E R …
T Y P E S O F M O D E L
C O N C E P T U A L M O D E L
T H E R E A L W O R L D
L O G I C A L M O D E L
P H Y S I C A L M O D E L
• Expresses the subset of the domain in
terms of concepts and relations
independent of design concerns.
• Explicitly expresses how we have stored
our data in systems (column names, DB)
• Expresses the concepts in terms of data
structures or underlying technologies
More:
Usable, Low-Level,
Requires Technical
Expertise
More:
Accurate, Generic,
Conceptual
T H E Q U E S T I O N S
W E A S K …
• What kind of things do we deal with?
• For each kind of things, what aspects
of it do we care about? What are the
constraints of these aspects?
• When are two things the same thing?
• As something evolves, when does it
stop being the same thing?
• How do two things relate to each
other?
I S N ’ T T H I S E A S Y ?
D ATA M O D E L L I N G …
Are two of these the same thing?
Are two of these the same thing?
{
amount: 500,
currency: “USD”
}
{
amount: 500,
currency: “USD”
}
Are two of these the same thing?
Sugababes
2009
Sugababes
1998
Are two of these the same thing?
MBS
2011
Sugababes
1998
M A K E I T S T O P !
W H AT I S T H E
C O R R E C T D ATA
M O D E L ?
S I G N S O F A G O O D
D ATA M O D E L
• It is simple. It models what you need
and nothing else.
• It is built with your technology and
software system in mind
• It does not contradict the actual world
• It is extensible
• Non-technical people understand it.
Domain experts even chip in.
D ATA M O D E L L I N G W I T H K A F K A
C H A P T E R T W O
Recommended Reading:
Designing Event-Driven
Systems
— Ben Stopford
T Y P E S O F M O D E L
C O N C E P T U A L M O D E L
T H E R E A L W O R L D
L O G I C A L M O D E L
P H Y S I C A L M O D E L
• Expresses the subset of the domain in
terms of concepts and relations
independent of design concerns.
• Explicitly expresses how we have stored
our data in systems (column names, DB)
• Expresses the concepts in terms of data
structures or underlying technologies
More:
Usable, Low-Level,
Requires Technical
Expertise
More:
Accurate, Generic,
Conceptual
T Y P E S O F M O D E L
L O G I C A L M O D E L
• Expresses the concepts in terms of data
structures or underlying technologies
Kafka
???
SQL/RDBMS
What are the tables & for
which
entities? What are the
keys/
constraints? How do we
normalise everything?
Neo4j/Graph DBs
Graph modelling - what
are our entities? what are
their properties? how do
they relate (and what are
the relations’ properties)
(more details here)
Mongo/Document Stores
What are our entities?
Which ones get top-level
documents? What
document validations
should we enforce?
W H AT I S K A F K A ?
J U S T I N C A S E …
W H AT I S K A F K A ?
• Immutable log data store, with a
multicast/pub-sub message
interface
• It’s technically not these things but they may
be a helpful abstraction:
• Message Queue with a DB store
behind it
• Real-time Streaming with a catch-up
facility
• ESB without the rules and bloatedness
that make it bad.
W H Y D O E S T H I S S H A P E
O U R D ATA M O D E L S ?
• Easy Answer: It’s a different data
store and therefore our low-level
models will be shaped by its
implementation details
• But Kafka has taken off not despite
its implementation details but
because of it.
A N E W D ATA
M O D E L L I N G
PA R A D I G M
E V E N T S T R E A M I N G
E V E N T S T R E A M I N G
• Do not store the “state” of an object
as your primary model
• Instead store a sequence of events
that have transpired that will build up
that state.
M O T I VAT I O N
• You can construct this state in
different ways for different purposes
• Back-up/Restoration for free!
• You can reconstruct the state of the
system at a given moment in time
• Better support for distributed/highly
concurrent systems
S O M E FA M I L I A R
E X A M P L E S
• Git
• Bank Statements
• Blockchain
• Accounting Ledgers
W H AT D O E S T H I S
L O O K L I K E I N T H E
R E A L W O R L D ?
B U T …
E X A M P L E : D E C RY P T O
https://github.com/SwamWithTurtles/decrypto-be
https://github.com/SwamWithTurtles/decrypto-fe
• A multi-player board game.
• Players can see a subset of words and
must communicate them to their team
mates without being intercepted (by
being too literal).
• Challenge: Make a web-app version
of this game for people to play over
hangouts during lockdown.
E X A M P L E : D E C RY P T O
E X A M P L E : D E C RY P T O
E X A M P L E : D E C RY P T O
T Y P E S O F M O D E L
C O N C E P T U A L M O D E L
T H E R E A L W O R L D
L O G I C A L M O D E L
P H Y S I C A L M O D E L
• The list of events & their attributes
• The constraints of when they can occur
• The impact they have
• The classes/text keys for events
• The name and types of their attributes
?
W H AT G O O D I S A N
S T R E A M O F E V E N T S ?
B U T …
W H AT G O O D I S A N
S T R E A M O F E V E N T S ?
• (Some) Domain Experts
• Front-end Applications
• Human Reasoning
• Data Science/Analytics Teams
C A N I S T O R E M Y
D ATA E L S E W H E R E ?
• Yes!
• Pull into whatever high-fidelity data
store you want - e.g. Neo4j,
DynamoDB, ElasticSearch
Firebase……*
• * KSQLDB is attempting to solve this problem
• You can even use Kafka Connect but
be careful about the coupling of
physical data models.
P R A C T I C A L T I P S
G I V E M E S O M E …
S H O U L D I D O I T ?
• There is an overhead involved
(development, performance, resiliency). It
is not right for every team.
• Highly stateful services
• Highly concurrent service
• High throughput inputs
• Futureproofing
• Many different consumers of data.
[SPOILERS!]
S H O U L D I D O I T:
PA R T I I
• Event sourcing wants you to keep
events in an immutable log forever.
• GDPR frowns upon keeping personal
data forever.
H O W D O I D E F I N E
E V E N T S
• Events should be driven by domain
understanding from domain experts.
They should not be simple CRUD
statements (“UserAccountMapping
Created”)
• Events should correspond to actual,
definitive changes in state - not requests
to do so.
• Event Storming is the name given to a
domain modelling session in the event
sourcing world.
T H E T E C H N I C A L
B I T S
• Architectural Patterns: CQS/CQRS,
Event-Driven or Reactive
Programming
• Tooling to look into: RxJS (JS/front-
end), Akka (backend), Kafka, Event
Store (data layer)
• Recommended Videos :
• https://www.infoq.com/presentations/event-sourcing-jvm/
• https://www.infoq.com/presentations/event-driven-benefits-pitfalls/
• https://www.infoq.com/presentations/systems-event-driven/
S C A L I N G Y O U R D ATA M O D E L
C H A P T E R T H R E E
Recommended Reading:
Domain-Driven Design
— Eric Evans
W H AT A R E T H E
P R O B L E M S A S Y O U
G E T B I G G E R ?
W H AT A R E T H E
P R O B L E M S A S Y O U G E T
B I G G E R ?
• As your scope grows, the complexity
of your model increases
(“user”, “person” or “account” will often be the worst offender.)
• As your software grows,
discoverability, traceability and lineage
grows harder
• As your team grows, you will either
have many more meetings or will suffer
from breaking changes and poor
communication around your model.
S P L I T I T U P
T H E S O L U T I O N …
C AU T I O N
O P I N I O N P R E S E N T E D A S FA C T A H E A D
D ATA S C O P I N G U T O P I A
The rules
• Each piece of data must have exactly one
point of truth on your system.
• Models within other contexts can
duplicate concepts from other contexts as
long as they know who is the boss.
• Models within a concept should be
encapsulated and should not be impacted
by changes to other teams models*.
M A S T E RY O F D ATA
• Each piece of data must have exactly
one point of truth on your system
• Does everyone know who the
point of truth is? Is it defined or
documented?
• How do we ensure all changes
are registered in this system.
D E N O R M A L I S E D
M O D E L S
• Models within other contexts can —
perhaps even should — duplicate
concepts from other contexts in the
format they want. As long as they know
who is the boss.
• This means they must stay in sync
(including respecting of alterations)
• They should only get the data they
need but they should feel free to
transform it.
M O D E L
E N C A P S U L AT I O N
• Models within a concept should be
encapsulated and should not be
impacted by changes to other teams
models*.
• This includes validation - this
should only be applied where
data is mastered.
• *Possible exception: Changes to translation layers may
need to happen due to changes in physical model.
B O U N D E D C O N T E X T U T O P I A
Prisons
In-Court
Transcription
Court
Scheduling
Defendant
• Name
• CrimeType
• Availability
• Special Needs
Hearing
• Time
• Court Room
Inmate
• Name
• CrimeType
Person
• Name
Role
Type e.g. Defense Barrister,
Judge, Defendant
T H E P R O B L E M S … ( N O M O R E ! )
PROBLEM SOLVED BY…
• As your scope grows, the complexity of
your model increases
• Each piece of data must have exactly one
point of truth on your system.
• As your software grows,
discoverability, traceability and lineage
grows harder
• Models within other contexts can
duplicate concepts from other contexts
as long as they know who is the boss.
• As your team grows, you will either
have many more meetings or will suffer
from breaking changes and poor
communication around your model.
• Models within a concept should be
encapsulated and should not be
impacted by changes to other teams
models.
W H E R E A R E T H E
B O U N D A R I E S ?
B U T …
– E R I C E VA N S
“A bounded context delimits the applicability of a particular
model so that team members have a clear and shared understanding
of what has to be consistent and how it relates to other contexts.
Within that context, work to keep the model logically unified but
do not worry about applicability outside those bounds.”
“ B O U N D E D
C O N T E X T ” S M E L L S
• Too Big:
• Polysemes/False Cognates
• Duplicate Concepts
• Too Small:
• Data/Feature Envy
• Incomplete Model
W I T H I N Y O U R
C O N T E X T…
• See Chapters 1 and 2 of this talk.
T H I S M AY B E A G O O D
WAY T O S T R U C T U R E
Y O U R T E A M S
A N A S I D E …
T H I S I S M A D E M U C H
E A S I E R I N A N E V E N T
S O U R C I N G W O R L D .
I C L A I M T H AT …
I N T H E O L D S TAT E -
W O R L D
• How would we notify and push
changes?
• How do we translate information
between the different services?
• How do we decouple the physical
models?
I N S T E A D …
• All state changes are business-driven
events. Contexts can listen to these
and do what they want with them.
• New contexts can be spun up and
construct their state from past
events.
• Events are perfect candidates for
MQs or Kafka to unlock a push-based
system.
P U T T I N G I T A L L
T O G E T H E R
Split your domain model up into
“bounded contexts”.
This may incorporate multiple
teams or systems but should be a
reasonable size.
All stakeholders of this bounded
context should define the
boundary and understand what
the conceptual model (state,
entities, relationships). This should
be documented and
discoverable.
After that, they should event
storm event storm to drive
understanding of their data
model:
What (in the real world) can make the entities/
relationships in the conceptual data model change?
When can they happen? What info is needed to
action them?
Enter… Kafka.
These events should be published on Kafka. They represent
your team’s (internally) public interface and should be
documented/publicised.
This should be the source of truth.
You and any other team that cares about this event can now
use it to update their readable/high-fidelity state (e.g. RDBMS,
Elastic, Neo4j).
And it ended happily ever after.
I N S U M M A RY
• Data Modelling is the crystallisation of the assumptions we have made
about the real world within our domain. It is an imprecise science, but a
good model will allow frictionless progress.
• Event Sourcing asks what if we build our domain model around state
changes instead of state. Kafka is a great backbone for this kind of
architecture that is reliable, and futureproof and highly scalable.
• As we scale up, data modelling as a whole org is unsustainable. We break
our model into independently changeable sections that are called
bounded contexts. Kafka can acts well as a central nervous system.
Questions?David Simons | @SwamWithTurtles

More Related Content

What's hot (6)

PPTX
Big Data and Small Devices: What will it do for us and to us
John Tomizuka
 
PDF
The Ethics of Everybody Else
Tyler Schnoebelen
 
PDF
From the right process to a solid cultural change
Francesco Zaia
 
PDF
Metadata and the Power of Pattern-Finding
DATAVERSITY
 
PDF
GW Intro to Digital Communications Class 1
Geoff Livingston
 
PDF
#Winning at Instagram, or How to Learn to Stop Worrying and Love the Algorithm
Kate O'Neill
 
Big Data and Small Devices: What will it do for us and to us
John Tomizuka
 
The Ethics of Everybody Else
Tyler Schnoebelen
 
From the right process to a solid cultural change
Francesco Zaia
 
Metadata and the Power of Pattern-Finding
DATAVERSITY
 
GW Intro to Digital Communications Class 1
Geoff Livingston
 
#Winning at Instagram, or How to Learn to Stop Worrying and Love the Algorithm
Kate O'Neill
 

Similar to Data Modelling at Scale (20)

PDF
Domain-Driven Design
Bradley Holt
 
PPTX
Gilbane Boston 2012 Big Data 101
Peter O'Kelly
 
PDF
Choosing the right database
David Simons
 
PPT
discopen
Jisc
 
PPTX
Gilbane Boston 2011 big data
Peter O'Kelly
 
PDF
Big Data & the Enterprise
Ben Stopford
 
PPTX
Lecture 5- Data Collection and Storage.pptx
Brianc34
 
PPTX
AI Project Cycle Summary Class ninth please
lefreak320
 
PDF
Big data
roysonli
 
PDF
Big Data Rampage
Niko Vuokko
 
PPTX
The Big Data Stack
Zubair Nabi
 
ODP
Into the domain
Knoldus Inc.
 
PDF
Big Data & Artificial Intelligence
Zavain Dar
 
PDF
Emerging Dynamic TUW-ASE Summer 2015 - Distributed Systems and Challenges for...
Hong-Linh Truong
 
PPTX
Message passing & NoSQL (in English)
Tuomas Hietanen
 
PDF
Unlocked London - Technical Track
Wayne Walls
 
PDF
JIMS Rohini IT Flash Monthly Newsletter - October Issue
JIMS Rohini Sector 5
 
PDF
Binder1.pdf
RanumBagaskoro
 
PPTX
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
Soujanya V
 
PPTX
Data Analytics All 5 Units_all topics.pptx
k7322526
 
Domain-Driven Design
Bradley Holt
 
Gilbane Boston 2012 Big Data 101
Peter O'Kelly
 
Choosing the right database
David Simons
 
discopen
Jisc
 
Gilbane Boston 2011 big data
Peter O'Kelly
 
Big Data & the Enterprise
Ben Stopford
 
Lecture 5- Data Collection and Storage.pptx
Brianc34
 
AI Project Cycle Summary Class ninth please
lefreak320
 
Big data
roysonli
 
Big Data Rampage
Niko Vuokko
 
The Big Data Stack
Zubair Nabi
 
Into the domain
Knoldus Inc.
 
Big Data & Artificial Intelligence
Zavain Dar
 
Emerging Dynamic TUW-ASE Summer 2015 - Distributed Systems and Challenges for...
Hong-Linh Truong
 
Message passing & NoSQL (in English)
Tuomas Hietanen
 
Unlocked London - Technical Track
Wayne Walls
 
JIMS Rohini IT Flash Monthly Newsletter - October Issue
JIMS Rohini Sector 5
 
Binder1.pdf
RanumBagaskoro
 
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
Soujanya V
 
Data Analytics All 5 Units_all topics.pptx
k7322526
 
Ad

More from David Simons (12)

PDF
Four Architectural Patterns
David Simons
 
PDF
Decoupled APIs through Microservices
David Simons
 
PDF
Non-Functional Requirements
David Simons
 
PPTX
Build Tools & Maven
David Simons
 
PDF
Graph Modelling
David Simons
 
PDF
Graph theory in Practise
David Simons
 
PDF
Decoupled APIs through microservices
David Simons
 
PDF
TDD: What is it good for?
David Simons
 
PDF
Domain Driven Design: A Precis
David Simons
 
PDF
10 d bs in 30 minutes
David Simons
 
PPTX
Using Clojure to Marry Neo4j and Open Democracy
David Simons
 
PDF
Exploring Election Results with Neo4J
David Simons
 
Four Architectural Patterns
David Simons
 
Decoupled APIs through Microservices
David Simons
 
Non-Functional Requirements
David Simons
 
Build Tools & Maven
David Simons
 
Graph Modelling
David Simons
 
Graph theory in Practise
David Simons
 
Decoupled APIs through microservices
David Simons
 
TDD: What is it good for?
David Simons
 
Domain Driven Design: A Precis
David Simons
 
10 d bs in 30 minutes
David Simons
 
Using Clojure to Marry Neo4j and Open Democracy
David Simons
 
Exploring Election Results with Neo4J
David Simons
 
Ad

Recently uploaded (20)

PDF
July Patch Tuesday
Ivanti
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
July Patch Tuesday
Ivanti
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 

Data Modelling at Scale

  • 1. Data Modelling at Scale David Simons | @SwamWithTurtles
  • 2. W H O A M I ? • David Simons (@SwamWithTurtles) • Data Architect at Ovo Energy • Technical All-Rounder • Kafka implementation & Cloud integration at Citi • Linking of the court and prison services with the Ministry of Justice • Organised our wedding seating plan with Python.
  • 3. A B R I E F ( B U T I M P O R TA N T ) A S I D E Black Lives Matter (UK orgs, Worldwide Orgs) Trans Rights are Human Rights (LGBT orgs, Trans specific orgs, International orgs) Open Source Projects that could use contributions • Github Collection of Open Source Projects for Social Good • Data Kind • Kaggle (Data Science Volunteering that often has Social Good causes) • Police Brutality Register • Data Police Subreddit (Increasing Accessibility of Policing Data) • Data for Black Lives
  • 4. A G E N D A • What is Data Modelling & Why should I care? • Data Modelling with Kafka • Scaling your Data Model
  • 5. W H AT I S D ATA M O D E L L I N G A N D W H Y S H O U L D I C A R E ? C H A P T E R O N E Recommended Reading: Data & Reality — William Kent
  • 6. C AU T I O N P H I L O S O P H Y A H E A D
  • 7. w h a t i s t h e w o r l d ?
  • 8. A C O M P L E X M E S S O F H U M A N S A N D J O K E S A N D E M O T I O N S A N D F I C T I O N A N D I L L O G I C A N D F A K E N E W S A N D T I M E A N D E N T R O P Y A N D C O N F U S I O N A N D C O L O U R A N D W O N D E R A N D FAT E A N D H A T R E D A N D L O V E A N D WA R A N D N U A N C E A N D J E A L O U S Y A N D D O G S ( W H O A R E G O O D B O Y S ) A N D C AT S ( W H O A R E N O T ) A N D C O U N T R I E S A N D B O R D E R S A N D M U S I C A N D S O U N D A N D T I M E A N D T I D E A N D D A R K N E S S A N D B O O K S A N D W O R D S A N D C O M P L E X I T I E S A N D R O L L E R C O A S T E R S A N D FAV O U R I T E F L AV O U R S O F I C E C R E A M A N D J O H N L E N N O N A N D G E N D E R ( O R M AY B E N O T. ) T H AT N O C O M P U T E R C A N E V E R H O P E T O C A P T U R E
  • 9. d o e s o u r s o f t w a r e n e e d t o e x i s t i n t h e w o r l d ?
  • 10. M AY B E N O T T H E W O R L D … B U T A N Y S O F T WA R E O F S U F F I C I E N T C O M P L E X I T Y E X I S T S W I T H I N A { b u s i n e s s | p r o b l e m | w o r l d | d o m a i n | i n d u s t r y }
  • 11. w h a t s u b s e t o f t h e w o r l d d o e s o u r s o f t w a r e e x i s t s i n ?
  • 12. data model, n: an agreed set of assumptions and features that distill the world (in which our software exists) into something we can hope to capture programatically
  • 13. W H AT D O E S T H AT M E A N F O R T E C H N I C A L P E O P L E ? B U T I ’ M N O T A P H I L O S O P H E R …
  • 14. T Y P E S O F M O D E L C O N C E P T U A L M O D E L T H E R E A L W O R L D L O G I C A L M O D E L P H Y S I C A L M O D E L • Expresses the subset of the domain in terms of concepts and relations independent of design concerns. • Explicitly expresses how we have stored our data in systems (column names, DB) • Expresses the concepts in terms of data structures or underlying technologies More: Usable, Low-Level, Requires Technical Expertise More: Accurate, Generic, Conceptual
  • 15. T H E Q U E S T I O N S W E A S K … • What kind of things do we deal with? • For each kind of things, what aspects of it do we care about? What are the constraints of these aspects? • When are two things the same thing? • As something evolves, when does it stop being the same thing? • How do two things relate to each other?
  • 16. I S N ’ T T H I S E A S Y ? D ATA M O D E L L I N G …
  • 17. Are two of these the same thing?
  • 18. Are two of these the same thing? { amount: 500, currency: “USD” } { amount: 500, currency: “USD” }
  • 19. Are two of these the same thing? Sugababes 2009 Sugababes 1998
  • 20. Are two of these the same thing? MBS 2011 Sugababes 1998
  • 21. M A K E I T S T O P !
  • 22. W H AT I S T H E C O R R E C T D ATA M O D E L ?
  • 23. S I G N S O F A G O O D D ATA M O D E L • It is simple. It models what you need and nothing else. • It is built with your technology and software system in mind • It does not contradict the actual world • It is extensible • Non-technical people understand it. Domain experts even chip in.
  • 24. D ATA M O D E L L I N G W I T H K A F K A C H A P T E R T W O Recommended Reading: Designing Event-Driven Systems — Ben Stopford
  • 25. T Y P E S O F M O D E L C O N C E P T U A L M O D E L T H E R E A L W O R L D L O G I C A L M O D E L P H Y S I C A L M O D E L • Expresses the subset of the domain in terms of concepts and relations independent of design concerns. • Explicitly expresses how we have stored our data in systems (column names, DB) • Expresses the concepts in terms of data structures or underlying technologies More: Usable, Low-Level, Requires Technical Expertise More: Accurate, Generic, Conceptual
  • 26. T Y P E S O F M O D E L L O G I C A L M O D E L • Expresses the concepts in terms of data structures or underlying technologies Kafka ??? SQL/RDBMS What are the tables & for which entities? What are the keys/ constraints? How do we normalise everything? Neo4j/Graph DBs Graph modelling - what are our entities? what are their properties? how do they relate (and what are the relations’ properties) (more details here) Mongo/Document Stores What are our entities? Which ones get top-level documents? What document validations should we enforce?
  • 27. W H AT I S K A F K A ? J U S T I N C A S E …
  • 28. W H AT I S K A F K A ? • Immutable log data store, with a multicast/pub-sub message interface • It’s technically not these things but they may be a helpful abstraction: • Message Queue with a DB store behind it • Real-time Streaming with a catch-up facility • ESB without the rules and bloatedness that make it bad.
  • 29. W H Y D O E S T H I S S H A P E O U R D ATA M O D E L S ? • Easy Answer: It’s a different data store and therefore our low-level models will be shaped by its implementation details • But Kafka has taken off not despite its implementation details but because of it.
  • 30. A N E W D ATA M O D E L L I N G PA R A D I G M E V E N T S T R E A M I N G
  • 31. E V E N T S T R E A M I N G • Do not store the “state” of an object as your primary model • Instead store a sequence of events that have transpired that will build up that state.
  • 32. M O T I VAT I O N • You can construct this state in different ways for different purposes • Back-up/Restoration for free! • You can reconstruct the state of the system at a given moment in time • Better support for distributed/highly concurrent systems
  • 33. S O M E FA M I L I A R E X A M P L E S • Git • Bank Statements • Blockchain • Accounting Ledgers
  • 34. W H AT D O E S T H I S L O O K L I K E I N T H E R E A L W O R L D ? B U T …
  • 35. E X A M P L E : D E C RY P T O https://github.com/SwamWithTurtles/decrypto-be https://github.com/SwamWithTurtles/decrypto-fe • A multi-player board game. • Players can see a subset of words and must communicate them to their team mates without being intercepted (by being too literal). • Challenge: Make a web-app version of this game for people to play over hangouts during lockdown.
  • 36. E X A M P L E : D E C RY P T O
  • 37. E X A M P L E : D E C RY P T O
  • 38. E X A M P L E : D E C RY P T O
  • 39. T Y P E S O F M O D E L C O N C E P T U A L M O D E L T H E R E A L W O R L D L O G I C A L M O D E L P H Y S I C A L M O D E L • The list of events & their attributes • The constraints of when they can occur • The impact they have • The classes/text keys for events • The name and types of their attributes ?
  • 40. W H AT G O O D I S A N S T R E A M O F E V E N T S ? B U T …
  • 41. W H AT G O O D I S A N S T R E A M O F E V E N T S ? • (Some) Domain Experts • Front-end Applications • Human Reasoning • Data Science/Analytics Teams
  • 42. C A N I S T O R E M Y D ATA E L S E W H E R E ? • Yes! • Pull into whatever high-fidelity data store you want - e.g. Neo4j, DynamoDB, ElasticSearch Firebase……* • * KSQLDB is attempting to solve this problem • You can even use Kafka Connect but be careful about the coupling of physical data models.
  • 43. P R A C T I C A L T I P S G I V E M E S O M E …
  • 44. S H O U L D I D O I T ? • There is an overhead involved (development, performance, resiliency). It is not right for every team. • Highly stateful services • Highly concurrent service • High throughput inputs • Futureproofing • Many different consumers of data. [SPOILERS!]
  • 45. S H O U L D I D O I T: PA R T I I • Event sourcing wants you to keep events in an immutable log forever. • GDPR frowns upon keeping personal data forever.
  • 46. H O W D O I D E F I N E E V E N T S • Events should be driven by domain understanding from domain experts. They should not be simple CRUD statements (“UserAccountMapping Created”) • Events should correspond to actual, definitive changes in state - not requests to do so. • Event Storming is the name given to a domain modelling session in the event sourcing world.
  • 47. T H E T E C H N I C A L B I T S • Architectural Patterns: CQS/CQRS, Event-Driven or Reactive Programming • Tooling to look into: RxJS (JS/front- end), Akka (backend), Kafka, Event Store (data layer) • Recommended Videos : • https://www.infoq.com/presentations/event-sourcing-jvm/ • https://www.infoq.com/presentations/event-driven-benefits-pitfalls/ • https://www.infoq.com/presentations/systems-event-driven/
  • 48. S C A L I N G Y O U R D ATA M O D E L C H A P T E R T H R E E Recommended Reading: Domain-Driven Design — Eric Evans
  • 49. W H AT A R E T H E P R O B L E M S A S Y O U G E T B I G G E R ?
  • 50. W H AT A R E T H E P R O B L E M S A S Y O U G E T B I G G E R ? • As your scope grows, the complexity of your model increases (“user”, “person” or “account” will often be the worst offender.) • As your software grows, discoverability, traceability and lineage grows harder • As your team grows, you will either have many more meetings or will suffer from breaking changes and poor communication around your model.
  • 51. S P L I T I T U P T H E S O L U T I O N …
  • 52. C AU T I O N O P I N I O N P R E S E N T E D A S FA C T A H E A D
  • 53. D ATA S C O P I N G U T O P I A The rules • Each piece of data must have exactly one point of truth on your system. • Models within other contexts can duplicate concepts from other contexts as long as they know who is the boss. • Models within a concept should be encapsulated and should not be impacted by changes to other teams models*.
  • 54. M A S T E RY O F D ATA • Each piece of data must have exactly one point of truth on your system • Does everyone know who the point of truth is? Is it defined or documented? • How do we ensure all changes are registered in this system.
  • 55. D E N O R M A L I S E D M O D E L S • Models within other contexts can — perhaps even should — duplicate concepts from other contexts in the format they want. As long as they know who is the boss. • This means they must stay in sync (including respecting of alterations) • They should only get the data they need but they should feel free to transform it.
  • 56. M O D E L E N C A P S U L AT I O N • Models within a concept should be encapsulated and should not be impacted by changes to other teams models*. • This includes validation - this should only be applied where data is mastered. • *Possible exception: Changes to translation layers may need to happen due to changes in physical model.
  • 57. B O U N D E D C O N T E X T U T O P I A Prisons In-Court Transcription Court Scheduling Defendant • Name • CrimeType • Availability • Special Needs Hearing • Time • Court Room Inmate • Name • CrimeType Person • Name Role Type e.g. Defense Barrister, Judge, Defendant
  • 58. T H E P R O B L E M S … ( N O M O R E ! ) PROBLEM SOLVED BY… • As your scope grows, the complexity of your model increases • Each piece of data must have exactly one point of truth on your system. • As your software grows, discoverability, traceability and lineage grows harder • Models within other contexts can duplicate concepts from other contexts as long as they know who is the boss. • As your team grows, you will either have many more meetings or will suffer from breaking changes and poor communication around your model. • Models within a concept should be encapsulated and should not be impacted by changes to other teams models.
  • 59. W H E R E A R E T H E B O U N D A R I E S ? B U T …
  • 60. – E R I C E VA N S “A bounded context delimits the applicability of a particular model so that team members have a clear and shared understanding of what has to be consistent and how it relates to other contexts. Within that context, work to keep the model logically unified but do not worry about applicability outside those bounds.”
  • 61. “ B O U N D E D C O N T E X T ” S M E L L S • Too Big: • Polysemes/False Cognates • Duplicate Concepts • Too Small: • Data/Feature Envy • Incomplete Model
  • 62. W I T H I N Y O U R C O N T E X T… • See Chapters 1 and 2 of this talk.
  • 63. T H I S M AY B E A G O O D WAY T O S T R U C T U R E Y O U R T E A M S A N A S I D E …
  • 64. T H I S I S M A D E M U C H E A S I E R I N A N E V E N T S O U R C I N G W O R L D . I C L A I M T H AT …
  • 65. I N T H E O L D S TAT E - W O R L D • How would we notify and push changes? • How do we translate information between the different services? • How do we decouple the physical models?
  • 66. I N S T E A D … • All state changes are business-driven events. Contexts can listen to these and do what they want with them. • New contexts can be spun up and construct their state from past events. • Events are perfect candidates for MQs or Kafka to unlock a push-based system.
  • 67. P U T T I N G I T A L L T O G E T H E R
  • 68. Split your domain model up into “bounded contexts”. This may incorporate multiple teams or systems but should be a reasonable size.
  • 69. All stakeholders of this bounded context should define the boundary and understand what the conceptual model (state, entities, relationships). This should be documented and discoverable.
  • 70. After that, they should event storm event storm to drive understanding of their data model: What (in the real world) can make the entities/ relationships in the conceptual data model change? When can they happen? What info is needed to action them?
  • 71. Enter… Kafka. These events should be published on Kafka. They represent your team’s (internally) public interface and should be documented/publicised. This should be the source of truth.
  • 72. You and any other team that cares about this event can now use it to update their readable/high-fidelity state (e.g. RDBMS, Elastic, Neo4j).
  • 73. And it ended happily ever after.
  • 74. I N S U M M A RY • Data Modelling is the crystallisation of the assumptions we have made about the real world within our domain. It is an imprecise science, but a good model will allow frictionless progress. • Event Sourcing asks what if we build our domain model around state changes instead of state. Kafka is a great backbone for this kind of architecture that is reliable, and futureproof and highly scalable. • As we scale up, data modelling as a whole org is unsustainable. We break our model into independently changeable sections that are called bounded contexts. Kafka can acts well as a central nervous system.
  • 75. Questions?David Simons | @SwamWithTurtles