SlideShare a Scribd company logo
Building an Open Source, Real-Time,
Billion Object Spatio-Temporal Search
Plaform
2016 International Workshop on Cloud Computing and Big Data
Benjamin Lewis, David Strohschein, Paolo Corti, David Smiley
Center for Geographic Analysis, Harvard University
Background
● Big data is everywhere: sensors (weather, pollution…), mobile devices,
social platform activities, software logs, etc.
● Data are generally streaming, so they are temporal
● Most of those data are spatial as well
● Traditional RDBMS, desktop statistics and visualization packages have
difficulty handling big data
● Current solutions involve “massive parallel software running on a large
number of servers”
Use case
● We work in a research university so we need to provide big data to students and
researchers
● Our goal is to lower barriers to interactive data exploration
● Some systems support visualization of large spatio-temporal datasets but don’t handle
search well
● Many search applications (most search engines) handle text but do not support the
geographic dimension.
● Great need for tool to allow user to interactively search large collections and visualize
them geographically. To support such increasingly common datasets, a new kind of
map server and client is needed.
● Project funded by the Sloan Foundation in partnership with Dataverse team at
Harvard IQSS
Solution
● A general solution. Prototype
with geotagged tweets (tweets
containing GPS coordinates
from originating device)
● Platform adaptable to other
big data spatial time streams
(weather and pollution
sensors, geoRSS feeds etc...)
● Integrate the new platform
within Harvard WorldMap
and Dataverse systems
Objective
● Create a missing piece of geo-infrastructure and make it
available
● Demonstrate possibility of addressing scalability limitations
with non-exotic software and hardware
● Make setting up platforms for big spatio-temporal
visualization as easy as setting up a standard GIS stack
Streaming big data
Geotagged tweets
● Geotagged tweets: tweets containing GPS coordinates from originating
device
● Currently about 2% of tweets are geotagged, about 8 million per day
● The CGA has been harvesting geo-tweets since October 2012 using the
Twitter API
● Billion Object Platform(BOP) will provide a client and API to browse and
search the latest 1 billion geotagged tweets (about 3 months range)
● Command line tools to extract older geotagged tweets from archives
The BOP (Billion Object Platform)
● General purpose, open source platform to support exploration of large collections
of spatio-temporal entities
● Built on top of a search engine
● Supports exploration, visualization, extraction via a RESTful API
● Queryable by time, space, text
● Responsive
● Spatial heatmap to represent the distribution of results (spatial faceting: results
per cell in a grid)
● Support temporal histograms (temporal faceting: results per date time range)
● Support word clouds as a mechanism to enhance results browsing by topic
● Support downloads of subsets for registered users (up to 10,000 features)
● Sentiment stamping
Solution Stack
● Apache Lucene: an indexing and search library
● Apache Solr: a search web server platform built on top of
Lucene
● Apache Kafka: a message broker written in Scala to provide
a platform for handling real-time data streams
● Apache ZooKeeper: enables highly reliable distributed
coordination
● Swagger: a framework for building APIs
● scikit-learn library: Machine Learning in Python
● OpenLayers: a javascript mapping client
● AngularJS: a javascript framework
Search engine features
● Faceted searches (category, space and time)
● Stemming: ability to detect words derived from a common root
● Synonyms detection and controlled vocabulary such as thesauri and taxonomies
● Weighted results
● Wildcard and fuzzy search: provide results for a given term and its common
variations
● Boolean queries: search results using terms and boolean operators such as AND,
OR, NOT…
● Hit highlighting: provides immediate suggestions to the user typing the text to
search
● Stop words: words filtered out during the processing of text
Client to enable data exploration and extraction
API to streaming geotagged tweets
Sentiment Analysis
● Sentiment analysis is a field of study which identifies the opinion of people
expressed in a text using natural language processing tools
● Social media such as Twitter provides a constant source of textual data, many
with an opinion, which can be analyzed using Sentiment Analysis tools.
● Using the scikit-learn library (Machine Learning in Python) we sentiment stamp
as positive or negative each tweet
HHypermap
Similar approach to BOP
(Solr/Lucene): provides a
searchable registry of map
service layers from OGC
and Esri public endpoints

More Related Content

What's hot (6)

PDF
Location based services for Nokia X and Nokia Asha using Geo2tag
Microsoft Mobile Developer
 
PDF
Using python to analyze spatial data
Kudos S.A.S
 
PDF
CKANへの空間情報機能拡張実装の試み
Yoichi Kayama
 
PDF
Working with OpenStreetMap using Apache Spark and Geotrellis
Rob Emanuele
 
PDF
GeoMesa LocationTech DC
CCRinc
 
PPT
Building a Spatial Database in PostgreSQL
Kudos S.A.S
 
Location based services for Nokia X and Nokia Asha using Geo2tag
Microsoft Mobile Developer
 
Using python to analyze spatial data
Kudos S.A.S
 
CKANへの空間情報機能拡張実装の試み
Yoichi Kayama
 
Working with OpenStreetMap using Apache Spark and Geotrellis
Rob Emanuele
 
GeoMesa LocationTech DC
CCRinc
 
Building a Spatial Database in PostgreSQL
Kudos S.A.S
 

Viewers also liked (14)

PPTX
2016 New Lighting Technology Ivan Tchakarov
Ivan Tchakarov
 
PPTX
Las plantas
karin rojas
 
PDF
Idiomatic Gradle Plugin Writing
Schalk Cronjé
 
PPTX
Clivaje y elecciones de 1851 - CHILE
Tavita Vargas
 
DOCX
Pritam Naik Resume
pritam naik
 
PPTX
Trabajo práctico ayudantía 2011
Tavita Vargas
 
PDF
ZOO_DIGITAL_300414 HR
Lars Clausen
 
PPTX
Your application ever up-to-date? Go continuous delivery
Davide Benvegnù
 
PPTX
Nuevas Tecnologias
Pamela de Leon
 
PDF
DocDoc's Guide To Digital Marketing
Jon Samsel
 
PDF
Gradle in 45min - JBCN2-16 version
Schalk Cronjé
 
PPTX
Voxxed Belgrade 2016
Karina Popova
 
PPTX
Java Docs
Pallavi Srivastava
 
PPTX
Кастомная разработка в области E-Commerce
DZ Systems
 
2016 New Lighting Technology Ivan Tchakarov
Ivan Tchakarov
 
Las plantas
karin rojas
 
Idiomatic Gradle Plugin Writing
Schalk Cronjé
 
Clivaje y elecciones de 1851 - CHILE
Tavita Vargas
 
Pritam Naik Resume
pritam naik
 
Trabajo práctico ayudantía 2011
Tavita Vargas
 
ZOO_DIGITAL_300414 HR
Lars Clausen
 
Your application ever up-to-date? Go continuous delivery
Davide Benvegnù
 
Nuevas Tecnologias
Pamela de Leon
 
DocDoc's Guide To Digital Marketing
Jon Samsel
 
Gradle in 45min - JBCN2-16 version
Schalk Cronjé
 
Voxxed Belgrade 2016
Karina Popova
 
Кастомная разработка в области E-Commerce
DZ Systems
 
Ad

Similar to Building an Open Source, Real-Time, Billion Object Spatio-Temporal Search Platform (20)

PDF
Map4rdf - Faceted Browser for Geospatial Datasets
Boris Villazón-Terrazas
 
PPT
GeoNetwork workshop introduction mapwindow conference 2012 Velp
pvangenuchten
 
PPTX
Geotrends For 2011 And Beyond
Ian White
 
PPTX
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...
Micah Altman
 
PDF
Harvard Hypermap: An Open Source Framework for Making the World’s Geospatial ...
Paolo Corti
 
PDF
Visualization and Level-of-detail of Metadata for Interactive Exploration of ...
Cybera Inc.
 
PDF
The role of geospatial information in a hyper connected society
Maria Antonia Brovelli
 
PDF
The role of geospatial information in a hyper connected society
Maria Antonia Brovelli
 
PDF
The role of geospatial information in a hyper connected society
Maria Antonia Brovelli
 
PDF
CartoHeritage 2011: Georeferencer & MapRank Search
Petr Pridal
 
PDF
Philippine Geospatial Forum Presentation 20130311
esambale
 
PPTX
Dublinked tech workshop_15_dec2011
Dublinked .
 
PPT
Linked Open Geodata Keynote by Andreas Langegger
Andreas Langegger
 
PDF
What is a Data Commons and Why Should You Care?
Robert Grossman
 
PPTX
H-Hypermap Heatmap Analytics at Scale
David Smiley
 
ODP
Citizen science, vgi, geo crowd sourcing, big geo data how they matter to th...
Maria Antonia Brovelli
 
PPTX
reegle - a new key portal for open energy data
reeep
 
PDF
GIS in the Rockies Geospatial Revolution
Peter Batty
 
PDF
Q4 2016 GeoTrellis Presentation
Rob Emanuele
 
Map4rdf - Faceted Browser for Geospatial Datasets
Boris Villazón-Terrazas
 
GeoNetwork workshop introduction mapwindow conference 2012 Velp
pvangenuchten
 
Geotrends For 2011 And Beyond
Ian White
 
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...
Micah Altman
 
Harvard Hypermap: An Open Source Framework for Making the World’s Geospatial ...
Paolo Corti
 
Visualization and Level-of-detail of Metadata for Interactive Exploration of ...
Cybera Inc.
 
The role of geospatial information in a hyper connected society
Maria Antonia Brovelli
 
The role of geospatial information in a hyper connected society
Maria Antonia Brovelli
 
The role of geospatial information in a hyper connected society
Maria Antonia Brovelli
 
CartoHeritage 2011: Georeferencer & MapRank Search
Petr Pridal
 
Philippine Geospatial Forum Presentation 20130311
esambale
 
Dublinked tech workshop_15_dec2011
Dublinked .
 
Linked Open Geodata Keynote by Andreas Langegger
Andreas Langegger
 
What is a Data Commons and Why Should You Care?
Robert Grossman
 
H-Hypermap Heatmap Analytics at Scale
David Smiley
 
Citizen science, vgi, geo crowd sourcing, big geo data how they matter to th...
Maria Antonia Brovelli
 
reegle - a new key portal for open energy data
reeep
 
GIS in the Rockies Geospatial Revolution
Peter Batty
 
Q4 2016 GeoTrellis Presentation
Rob Emanuele
 
Ad

More from Paolo Corti (11)

PDF
State of GeoNode 2019
Paolo Corti
 
PPTX
Making Temporal Search Central in a Spatial Data Infrastructure
Paolo Corti
 
PDF
Maintaining spatial data infrastructures (SDIs) using distributed task queues
Paolo Corti
 
PDF
Status of WorldMap, 2016
Paolo Corti
 
PPT
GeoNode per il Supporto alle Emergenze Umanitarie
Paolo Corti
 
PDF
GeoNode intro and demo
Paolo Corti
 
PPT
GeoNode for Humanitarian Crisis and Risk Reduction
Paolo Corti
 
PDF
Geonode 2.0
Paolo Corti
 
PDF
L'utilizzo di software fee and open source nello European Forest Fire Informa...
Paolo Corti
 
PDF
Fire news management in the context of the European Forest Fire Information S...
Paolo Corti
 
PDF
Developing Geospatial software with Python, Part 1
Paolo Corti
 
State of GeoNode 2019
Paolo Corti
 
Making Temporal Search Central in a Spatial Data Infrastructure
Paolo Corti
 
Maintaining spatial data infrastructures (SDIs) using distributed task queues
Paolo Corti
 
Status of WorldMap, 2016
Paolo Corti
 
GeoNode per il Supporto alle Emergenze Umanitarie
Paolo Corti
 
GeoNode intro and demo
Paolo Corti
 
GeoNode for Humanitarian Crisis and Risk Reduction
Paolo Corti
 
Geonode 2.0
Paolo Corti
 
L'utilizzo di software fee and open source nello European Forest Fire Informa...
Paolo Corti
 
Fire news management in the context of the European Forest Fire Information S...
Paolo Corti
 
Developing Geospatial software with Python, Part 1
Paolo Corti
 

Recently uploaded (20)

PPTX
Prompt Like a Pro. Leveraging Salesforce Data to Power AI Workflows.pptx
Dele Amefo
 
PDF
ERP Consulting Services and Solutions by Contetra Pvt Ltd
jayjani123
 
PPTX
Build a Custom Agent for Agentic Testing.pptx
klpathrudu
 
PPTX
From spreadsheets and delays to real-time control
SatishKumar2651
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
PDF
Ready Layer One: Intro to the Model Context Protocol
mmckenna1
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PDF
Is Framer the Future of AI Powered No-Code Development?
Isla Pandora
 
PDF
AI Prompts Cheat Code prompt engineering
Avijit Kumar Roy
 
PPTX
Function & Procedure: Function Vs Procedure in PL/SQL
Shani Tiwari
 
PPTX
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
PPTX
iaas vs paas vs saas :choosing your cloud strategy
CloudlayaTechnology
 
PDF
IObit Driver Booster Pro 12.4.0.585 Crack Free Download
henryc1122g
 
PDF
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PPTX
Transforming Insights: How Generative AI is Revolutionizing Data Analytics
LetsAI Solutions
 
PPTX
Smart Doctor Appointment Booking option in odoo.pptx
AxisTechnolabs
 
PDF
Dipole Tech Innovations – Global IT Solutions for Business Growth
dipoletechi3
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
Prompt Like a Pro. Leveraging Salesforce Data to Power AI Workflows.pptx
Dele Amefo
 
ERP Consulting Services and Solutions by Contetra Pvt Ltd
jayjani123
 
Build a Custom Agent for Agentic Testing.pptx
klpathrudu
 
From spreadsheets and delays to real-time control
SatishKumar2651
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
Ready Layer One: Intro to the Model Context Protocol
mmckenna1
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
Is Framer the Future of AI Powered No-Code Development?
Isla Pandora
 
AI Prompts Cheat Code prompt engineering
Avijit Kumar Roy
 
Function & Procedure: Function Vs Procedure in PL/SQL
Shani Tiwari
 
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
iaas vs paas vs saas :choosing your cloud strategy
CloudlayaTechnology
 
IObit Driver Booster Pro 12.4.0.585 Crack Free Download
henryc1122g
 
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
Transforming Insights: How Generative AI is Revolutionizing Data Analytics
LetsAI Solutions
 
Smart Doctor Appointment Booking option in odoo.pptx
AxisTechnolabs
 
Dipole Tech Innovations – Global IT Solutions for Business Growth
dipoletechi3
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 

Building an Open Source, Real-Time, Billion Object Spatio-Temporal Search Platform

  • 1. Building an Open Source, Real-Time, Billion Object Spatio-Temporal Search Plaform 2016 International Workshop on Cloud Computing and Big Data Benjamin Lewis, David Strohschein, Paolo Corti, David Smiley Center for Geographic Analysis, Harvard University
  • 2. Background ● Big data is everywhere: sensors (weather, pollution…), mobile devices, social platform activities, software logs, etc. ● Data are generally streaming, so they are temporal ● Most of those data are spatial as well ● Traditional RDBMS, desktop statistics and visualization packages have difficulty handling big data ● Current solutions involve “massive parallel software running on a large number of servers”
  • 3. Use case ● We work in a research university so we need to provide big data to students and researchers ● Our goal is to lower barriers to interactive data exploration ● Some systems support visualization of large spatio-temporal datasets but don’t handle search well ● Many search applications (most search engines) handle text but do not support the geographic dimension. ● Great need for tool to allow user to interactively search large collections and visualize them geographically. To support such increasingly common datasets, a new kind of map server and client is needed. ● Project funded by the Sloan Foundation in partnership with Dataverse team at Harvard IQSS
  • 4. Solution ● A general solution. Prototype with geotagged tweets (tweets containing GPS coordinates from originating device) ● Platform adaptable to other big data spatial time streams (weather and pollution sensors, geoRSS feeds etc...) ● Integrate the new platform within Harvard WorldMap and Dataverse systems
  • 5. Objective ● Create a missing piece of geo-infrastructure and make it available ● Demonstrate possibility of addressing scalability limitations with non-exotic software and hardware ● Make setting up platforms for big spatio-temporal visualization as easy as setting up a standard GIS stack
  • 7. Geotagged tweets ● Geotagged tweets: tweets containing GPS coordinates from originating device ● Currently about 2% of tweets are geotagged, about 8 million per day ● The CGA has been harvesting geo-tweets since October 2012 using the Twitter API ● Billion Object Platform(BOP) will provide a client and API to browse and search the latest 1 billion geotagged tweets (about 3 months range) ● Command line tools to extract older geotagged tweets from archives
  • 8. The BOP (Billion Object Platform) ● General purpose, open source platform to support exploration of large collections of spatio-temporal entities ● Built on top of a search engine ● Supports exploration, visualization, extraction via a RESTful API ● Queryable by time, space, text ● Responsive ● Spatial heatmap to represent the distribution of results (spatial faceting: results per cell in a grid) ● Support temporal histograms (temporal faceting: results per date time range) ● Support word clouds as a mechanism to enhance results browsing by topic ● Support downloads of subsets for registered users (up to 10,000 features) ● Sentiment stamping
  • 9. Solution Stack ● Apache Lucene: an indexing and search library ● Apache Solr: a search web server platform built on top of Lucene ● Apache Kafka: a message broker written in Scala to provide a platform for handling real-time data streams ● Apache ZooKeeper: enables highly reliable distributed coordination ● Swagger: a framework for building APIs ● scikit-learn library: Machine Learning in Python ● OpenLayers: a javascript mapping client ● AngularJS: a javascript framework
  • 10. Search engine features ● Faceted searches (category, space and time) ● Stemming: ability to detect words derived from a common root ● Synonyms detection and controlled vocabulary such as thesauri and taxonomies ● Weighted results ● Wildcard and fuzzy search: provide results for a given term and its common variations ● Boolean queries: search results using terms and boolean operators such as AND, OR, NOT… ● Hit highlighting: provides immediate suggestions to the user typing the text to search ● Stop words: words filtered out during the processing of text
  • 11. Client to enable data exploration and extraction
  • 12. API to streaming geotagged tweets
  • 13. Sentiment Analysis ● Sentiment analysis is a field of study which identifies the opinion of people expressed in a text using natural language processing tools ● Social media such as Twitter provides a constant source of textual data, many with an opinion, which can be analyzed using Sentiment Analysis tools. ● Using the scikit-learn library (Machine Learning in Python) we sentiment stamp as positive or negative each tweet
  • 14. HHypermap Similar approach to BOP (Solr/Lucene): provides a searchable registry of map service layers from OGC and Esri public endpoints