Innovation and
Reinvention Driving
Transformation
OCTOBER 9, 2018
2018 HPCC Systems® Community
Day
Gavin Halliday
A First Look at HPCC Systems 7.0, Innovation in Action
Renewing the foundations
• File processing
• ECLWatch workunit interface
• Visualization Framework
• DESDL
• Configuration manager
HPCC 7.0 2
Usability and
Productivity
ECL Watch
Goals
• Highlight important information
• Make it easier to understand queries
• Improved support for very large queries
Examples:
• Gantt
• Graph Viewer
• Timings
• Log data visualizer
HPCC 7.0 4
Gantt chart
HPCC 7.0 5
New Graph Viewer
HPCC 7.0 6
New Graph Viewer
HPCC 7.0 7
Stats and Timings
HPCC 7.0 8
Visualization Framework
• Version 2.0 now available
• https://github.com/hpcc-systems/Visualization
• Rebranded as hpcc-js in the node npm repository
• New documentation, demos and gallery
• Includes non visualization items like ESP comms layer
• Dashy beta
• Not tied to HPCC Systems
• Visualizer Bundle 1.1
HPCC 7.0 9
ECL libraries
• Ecl Library extensions
• Date – timestamps, time zones, formatting
• Unicode – words, prefixes and suffixes
• Maths – infinity, fmod
• Bundles
• Data Patterns
• ML – Gradient boosted trees, boosted forests
• Visualizer
HPCC 7.0 10
ESP improvements
• DESDL improvements
• Custom mappings
• Fully integrated into ESP
• Mixing DESDL and ESDL in one service
• Allow disconnection from Dali
• Support for persistent connections.
HPCC 7.0 11
ECL Compiler
• Activities in other languages.
EXPORT streamed dataset(r) myDataset(unsigned numRows = numRows) :=
EMBED(javascript : activity) …
• Multi-line string constants
message := '''One
Two
Three''';
• Code generator improvements
• Faster archive generation
• Faster syntax checking
HPCC 7.0 12
Interoperability
Spark
• “An open source distributed general-purpose cluster-computing framework”
• Reading from spark
• Files and indexes.
• Filter rows
• Select fields required
• N to M parallel reads
• Writing from spark
• File security
• Spark cluster installation
HPCC 7.0 14
Log Data Visualizations
HPCC 7.0 15
Log Data Visualizations
HPCC 7.0 16
Log Data Visualizations
HPCC 7.0 17
https://hpccsystems.com/blog/ELK_visualizations
VS Code
HPCC 7.0 18
https://code.visualstudio.com/
VS Code
HPCC 7.0 19
Security
User Security
• Session management
• Avoid resending credentials
• Users can log out
• Allow sessions lock and time out
• Minimize time passwords retained
HPCC 7.0 21
System security
• Spark
• File access rights
• Dafilesrv authentication of requests
• The cloud
• Verifying components
• Encryption in transit
• ROXIE HTTPS support
HPCC 7.0 22
Performance
Thor
• Keyed Join (HPCC-16476)
HPCC 7.0 24
Thor
• LOOP
• Synchronization overhead
• LOCAL LOOP bodies
• Child Queries
• Reduced overhead
• Improvements to buffering
• Faster Startup
HPCC 7.0 25
Index improvements
HPCC 7.0 26
•60K rows
•0.02% of totalHourly
•1.4M rows
•0.6% of totalDaily
•10M rows
•4% of totalWeekly
•43M rows
•17% of totalMonthly
•520M rows
•100% of totalHistorical
• Example database containing 250M unique items with 1000 updates each minute
Index improvements
• Bloom filters
• Supports multiple filters per index
• User configurable probability
• Automatically created.
• Richard’s blog post hpccsystems.com/blog/bloom-filters
• Hash distributed keys.
• When distribution fields are filtered with equalities
• Easier to create co-distributed keys
• Lower overhead calculating the part containing a match
HPCC 7.0 27
Finally
• WsSQL – now part of the core
• Over 1,000 pull requests since 6.4
HPCC 7.0 28
Talk to us!
• Bloom filters - Richard Chapman
• DESDL - Yanrui Ma
• ELK - Rodrigo Pastrana
• Thor - Jake Cobbett-Smith
• Visualizations - Gordon Smith
• Security - Tony Fishbeck
• Spark - Rodrigo Pastrana
• Config Manager - Ken Rowland
HPCC 7.0 29

More Related Content

PDF
openATTIC Technology Overview - Ceph Management
PDF
NFVO based on ManageIQ - OPNFV Summit 2016 Demo
PPTX
High Availability - Brett Thurber - ManageIQ Design Summit 2016
PPTX
Nova Updates - Kilo Edition
PDF
Ceph Management and Monitoring with Dashboard v2 - Lenz Grimmer
PPTX
"What's New With Globus" Webinar: Spring 2018
PPTX
6.0 is coming
PPTX
RedisConf18 - Redis Cluster Provisioning with Kubernetes Service-Catalog Exte...
openATTIC Technology Overview - Ceph Management
NFVO based on ManageIQ - OPNFV Summit 2016 Demo
High Availability - Brett Thurber - ManageIQ Design Summit 2016
Nova Updates - Kilo Edition
Ceph Management and Monitoring with Dashboard v2 - Lenz Grimmer
"What's New With Globus" Webinar: Spring 2018
6.0 is coming
RedisConf18 - Redis Cluster Provisioning with Kubernetes Service-Catalog Exte...

What's hot (20)

PDF
OSDC 2018 | Monitoring Kubernetes at Scale by Monica Sarbu
PDF
OpenNebulaConf2017EU: Transforming an Old Supercomputer into a Cloud Platform...
PDF
Kubecon 2019_eu-k8s-secrets-csi
PDF
HNSciCloud Info Day, 7 Sept 2016, Functional Requirements by Helge Meinhard
PDF
Everything you wanted to know about RadosGW - Orit Wasserman, Matt Benjamin
PDF
openATTIC & Ceph Management @ Suse Monthly Open Source Talks - 2016-06-07
PDF
Orchestrating Shared Networks, Physical LB and DNS on Cloudstack
PPTX
Cloud Networking - Greg Blomquist, Scott Drennan, Lokesh Jain - ManageIQ Desi...
PDF
Storage Monitoring in openATTIC - Monitoring Workshop - 2016-09-07
PPTX
Kubernetes Fundamentals on Azure 2017
PDF
ManageIQ Overview at Management and Orchestration Developer (MODM) Meet-up
PDF
Sprint 38 review
PDF
OSMC 2018 | SLA Monitoring mit Icinga & Prometheus by Moritz Tanzer
PDF
Serhiy Kalinets "Building .NET Services for Kubernetes"
PDF
Ceph and Storage Management with openATTIC, Ceph Tech Talks 2016-06-23
PDF
Cortex: Horizontally Scalable, Highly Available Prometheus
PPTX
CoreOS fest 2016 Summary - DevOps BP 2016 June
PDF
OpenNebulaConf2017EU: FairShare Scheduling by Valentina Zaccolo, INDIGO
PDF
Intro to creating kubernetes operators
PDF
Kong in 1.x Territory
OSDC 2018 | Monitoring Kubernetes at Scale by Monica Sarbu
OpenNebulaConf2017EU: Transforming an Old Supercomputer into a Cloud Platform...
Kubecon 2019_eu-k8s-secrets-csi
HNSciCloud Info Day, 7 Sept 2016, Functional Requirements by Helge Meinhard
Everything you wanted to know about RadosGW - Orit Wasserman, Matt Benjamin
openATTIC & Ceph Management @ Suse Monthly Open Source Talks - 2016-06-07
Orchestrating Shared Networks, Physical LB and DNS on Cloudstack
Cloud Networking - Greg Blomquist, Scott Drennan, Lokesh Jain - ManageIQ Desi...
Storage Monitoring in openATTIC - Monitoring Workshop - 2016-09-07
Kubernetes Fundamentals on Azure 2017
ManageIQ Overview at Management and Orchestration Developer (MODM) Meet-up
Sprint 38 review
OSMC 2018 | SLA Monitoring mit Icinga & Prometheus by Moritz Tanzer
Serhiy Kalinets "Building .NET Services for Kubernetes"
Ceph and Storage Management with openATTIC, Ceph Tech Talks 2016-06-23
Cortex: Horizontally Scalable, Highly Available Prometheus
CoreOS fest 2016 Summary - DevOps BP 2016 June
OpenNebulaConf2017EU: FairShare Scheduling by Valentina Zaccolo, INDIGO
Intro to creating kubernetes operators
Kong in 1.x Territory
Ad

Similar to A First Look at HPCC Systems 7.0, Innovation in Action (20)

PPTX
Path to 8.0
PPTX
Innovation with Connection, The new HPCC Systems Plugins and Modules
PDF
HPCC Systems 6.0.0 Highlights
PDF
Building scalbale cloud native apps with .NET 8
PPTX
The Download: Tech Talks by the HPCC Systems Community, Episode 11
PPTX
Kubernetes meetup bangalore december 2017 - v02
PDF
Technical Introduction to RHEL8
PDF
Webinar: What's new in CDAP 3.5?
PDF
Chef and OpenStack Workshop from ChefConf 2013
PPTX
KubeCon USA 2017 brief Overview - from Kubernetes meetup Bangalore
PDF
Red Hat Storage Roadmap
PDF
Red Hat Storage Roadmap
PDF
Introduction to Apache Mesos and DC/OS
PPTX
Tech-Spark: SQL Server on Linux
PPTX
HPCC Platform + Visualization
PPTX
DEVNET-1136 Cisco ONE Enterprise Cloud Suite for Infrastructure Management.
PDF
Vijfhart thema-avond-oracle-12c-new-features
PDF
OCP Telco Engineering Workshop at BCE2017
PPTX
Running secured Spark job in Kubernetes compute cluster and integrating with ...
PPTX
Introduction to the Container Network Interface (CNI)
Path to 8.0
Innovation with Connection, The new HPCC Systems Plugins and Modules
HPCC Systems 6.0.0 Highlights
Building scalbale cloud native apps with .NET 8
The Download: Tech Talks by the HPCC Systems Community, Episode 11
Kubernetes meetup bangalore december 2017 - v02
Technical Introduction to RHEL8
Webinar: What's new in CDAP 3.5?
Chef and OpenStack Workshop from ChefConf 2013
KubeCon USA 2017 brief Overview - from Kubernetes meetup Bangalore
Red Hat Storage Roadmap
Red Hat Storage Roadmap
Introduction to Apache Mesos and DC/OS
Tech-Spark: SQL Server on Linux
HPCC Platform + Visualization
DEVNET-1136 Cisco ONE Enterprise Cloud Suite for Infrastructure Management.
Vijfhart thema-avond-oracle-12c-new-features
OCP Telco Engineering Workshop at BCE2017
Running secured Spark job in Kubernetes compute cluster and integrating with ...
Introduction to the Container Network Interface (CNI)
Ad

More from HPCC Systems (20)

PPTX
Natural Language to SQL Query conversion using Machine Learning Techniques on...
PPT
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
PPTX
Towards Trustable AI for Complex Systems
PPTX
Welcome
PPTX
Closing / Adjourn
PPTX
Community Website: Virtual Ribbon Cutting
PPTX
Release Cycle Changes
PPTX
Geohashing with Uber’s H3 Geospatial Index
PPTX
Advancements in HPCC Systems Machine Learning
PPTX
Docker Support
PPTX
Expanding HPCC Systems Deep Neural Network Capabilities
PPTX
Leveraging Intra-Node Parallelization in HPCC Systems
PPTX
DataPatterns - Profiling in ECL Watch
PPTX
Leveraging the Spark-HPCC Ecosystem
PPTX
Work Unit Analysis Tool
PPTX
Community Award Ceremony
PPTX
Dapper Tool - A Bundle to Make your ECL Neater
PPTX
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
PPTX
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
PPTX
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...
Natural Language to SQL Query conversion using Machine Learning Techniques on...
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Towards Trustable AI for Complex Systems
Welcome
Closing / Adjourn
Community Website: Virtual Ribbon Cutting
Release Cycle Changes
Geohashing with Uber’s H3 Geospatial Index
Advancements in HPCC Systems Machine Learning
Docker Support
Expanding HPCC Systems Deep Neural Network Capabilities
Leveraging Intra-Node Parallelization in HPCC Systems
DataPatterns - Profiling in ECL Watch
Leveraging the Spark-HPCC Ecosystem
Work Unit Analysis Tool
Community Award Ceremony
Dapper Tool - A Bundle to Make your ECL Neater
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...

Recently uploaded (20)

PPT
Chinku Sharma Internship in the summer internship project
PPTX
MBA JAPAN: 2025 the University of Waseda
PPTX
Machine Learning and working of machine Learning
PDF
ahaaaa shbzjs yaiw jsvssv bdjsjss shsusus s
PPTX
ch20 Database System Architecture by Rizvee
PDF
Session 11 - Data Visualization Storytelling (2).pdf
PPT
statistics analysis - topic 3 - describing data visually
PPTX
Fundementals of R Programming_Class_2.pptx
PPTX
DATA MODELING, data model concepts, types of data concepts
PDF
REPORT CARD OF GRADE 2 2025-2026 MATATAG
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PPT
dsa Lec-1 Introduction FOR THE STUDENTS OF bscs
PPTX
Business_Capability_Map_Collection__pptx
PPT
expt-design-lecture-12 hghhgfggjhjd (1).ppt
PPTX
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
PPT
statistic analysis for study - data collection
PDF
Navigating the Thai Supplements Landscape.pdf
PDF
technical specifications solar ear 2025.
PDF
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
PPTX
1 hour to get there before the game is done so you don’t need a car seat for ...
Chinku Sharma Internship in the summer internship project
MBA JAPAN: 2025 the University of Waseda
Machine Learning and working of machine Learning
ahaaaa shbzjs yaiw jsvssv bdjsjss shsusus s
ch20 Database System Architecture by Rizvee
Session 11 - Data Visualization Storytelling (2).pdf
statistics analysis - topic 3 - describing data visually
Fundementals of R Programming_Class_2.pptx
DATA MODELING, data model concepts, types of data concepts
REPORT CARD OF GRADE 2 2025-2026 MATATAG
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
dsa Lec-1 Introduction FOR THE STUDENTS OF bscs
Business_Capability_Map_Collection__pptx
expt-design-lecture-12 hghhgfggjhjd (1).ppt
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
statistic analysis for study - data collection
Navigating the Thai Supplements Landscape.pdf
technical specifications solar ear 2025.
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
1 hour to get there before the game is done so you don’t need a car seat for ...

A First Look at HPCC Systems 7.0, Innovation in Action

  • 1. Innovation and Reinvention Driving Transformation OCTOBER 9, 2018 2018 HPCC Systems® Community Day Gavin Halliday A First Look at HPCC Systems 7.0, Innovation in Action
  • 2. Renewing the foundations • File processing • ECLWatch workunit interface • Visualization Framework • DESDL • Configuration manager HPCC 7.0 2
  • 4. ECL Watch Goals • Highlight important information • Make it easier to understand queries • Improved support for very large queries Examples: • Gantt • Graph Viewer • Timings • Log data visualizer HPCC 7.0 4
  • 9. Visualization Framework • Version 2.0 now available • https://github.com/hpcc-systems/Visualization • Rebranded as hpcc-js in the node npm repository • New documentation, demos and gallery • Includes non visualization items like ESP comms layer • Dashy beta • Not tied to HPCC Systems • Visualizer Bundle 1.1 HPCC 7.0 9
  • 10. ECL libraries • Ecl Library extensions • Date – timestamps, time zones, formatting • Unicode – words, prefixes and suffixes • Maths – infinity, fmod • Bundles • Data Patterns • ML – Gradient boosted trees, boosted forests • Visualizer HPCC 7.0 10
  • 11. ESP improvements • DESDL improvements • Custom mappings • Fully integrated into ESP • Mixing DESDL and ESDL in one service • Allow disconnection from Dali • Support for persistent connections. HPCC 7.0 11
  • 12. ECL Compiler • Activities in other languages. EXPORT streamed dataset(r) myDataset(unsigned numRows = numRows) := EMBED(javascript : activity) … • Multi-line string constants message := '''One Two Three'''; • Code generator improvements • Faster archive generation • Faster syntax checking HPCC 7.0 12
  • 14. Spark • “An open source distributed general-purpose cluster-computing framework” • Reading from spark • Files and indexes. • Filter rows • Select fields required • N to M parallel reads • Writing from spark • File security • Spark cluster installation HPCC 7.0 14
  • 17. Log Data Visualizations HPCC 7.0 17 https://hpccsystems.com/blog/ELK_visualizations
  • 18. VS Code HPCC 7.0 18 https://code.visualstudio.com/
  • 21. User Security • Session management • Avoid resending credentials • Users can log out • Allow sessions lock and time out • Minimize time passwords retained HPCC 7.0 21
  • 22. System security • Spark • File access rights • Dafilesrv authentication of requests • The cloud • Verifying components • Encryption in transit • ROXIE HTTPS support HPCC 7.0 22
  • 24. Thor • Keyed Join (HPCC-16476) HPCC 7.0 24
  • 25. Thor • LOOP • Synchronization overhead • LOCAL LOOP bodies • Child Queries • Reduced overhead • Improvements to buffering • Faster Startup HPCC 7.0 25
  • 26. Index improvements HPCC 7.0 26 •60K rows •0.02% of totalHourly •1.4M rows •0.6% of totalDaily •10M rows •4% of totalWeekly •43M rows •17% of totalMonthly •520M rows •100% of totalHistorical • Example database containing 250M unique items with 1000 updates each minute
  • 27. Index improvements • Bloom filters • Supports multiple filters per index • User configurable probability • Automatically created. • Richard’s blog post hpccsystems.com/blog/bloom-filters • Hash distributed keys. • When distribution fields are filtered with equalities • Easier to create co-distributed keys • Lower overhead calculating the part containing a match HPCC 7.0 27
  • 28. Finally • WsSQL – now part of the core • Over 1,000 pull requests since 6.4 HPCC 7.0 28
  • 29. Talk to us! • Bloom filters - Richard Chapman • DESDL - Yanrui Ma • ELK - Rodrigo Pastrana • Thor - Jake Cobbett-Smith • Visualizations - Gordon Smith • Security - Tony Fishbeck • Spark - Rodrigo Pastrana • Config Manager - Ken Rowland HPCC 7.0 29

Editor's Notes

  • #2: Good afternoon. In this presentation I am going to guide you through some of the main changes in the new version of the platform. If something catches your eye and you want to find out more, please come and chat afterwards in one of the breaks. Hopefully by the end you’ll all be dying to try it out for yourselves. [20]
  • #3: So, each major version of the platform is a chance for us to make significant changes to some of the foundations. The changes in 7.0 have enabled us to introduce various new features, but just as importantly they provide the scope for improvements in future releases. Let’s take the first of these as an example. The file changes came about through a combination of different requirements: First of all we wanted to make it easier for ECL developers when file formats change. Previously if the format of file changed, then you needed to update your own copy of the ECL definition before you could read it. It would be much better if you could continue to use the old definition until it was convenient for you to update your sources. Secondly, it can be slow reading files and indexes between clusters because the network capacity between them is often much smaller than within a cluster. If the data being transferred could be reduced by filtering and projecting remotely, it should progress much faster. Thirdly, there was a need to improve integration with other platforms particularly Spark. So we revamped the file processing code to make it more flexible. As a bonus in future versions, it will make it easier to read other file formats, and even reduce the size of the generated c++ code. I’ll return to some of the others items in this list later, but for the rest of this presentation I’m going to group the changes into 4 main areas.. [1:40]
  • #4: The first area is changes that improve your day to day experience as a developer. [10]
  • #5: EclWatch is something that all ECL developers spend quite a lot of time using – whether it is directly in a browser web page, or embedded within the eclide. We wanted to bring important information to your attention. For instance if something is wrong with your query or with the system it should be clearly presented to you, ideally on a dashboard, rather than needing to go and hunt for it. We also wanted to give you better tools to understand your queries, to dig into the detail, for example where is the time going, and what was happening at a particular point in your query. Let’s look at a few of the changes in more detail. [50]
  • #6: The workunit timings and graph pages have gained a gantt chart at the top. It includes all the events in a workunit’s lifetime, tooltips provide extra details and you can zoom in on any part of the chart. Here are 3 different examples. The first example comes from a system that is busy. It isn’t always obvious why your job took a long time to run. Was it the compiler was slow, thor was busy, or it is just a slow job. Here you can quickly see that although the workunit took about 80 seconds to execute, almost one minute of that time was taken up waiting for a Thor to become available before the graph could run. The second example is that same chart zoomed in to highlight the time taken compiling a query, with a tooltip highlighting details from one of the stages. The final example is from a workunit with multiple workflow actions like persists, or independents. You can quickly see where the time has gone, and the order the graphs and subgraphs were executed in. [1:00]
  • #7: A new java script graph viewer was introduced in 6.0, and in 7.0 it has been fully integrated into Gordon’s visualisation framework. As well as meaning it is available for anyone to use in their visualisations, it also allows other components of the visualisation framework to be easily included in the graph. For the moment Gordon has used that to add little tweaks like icons for the activity types, but I suspect he has many other ideas. [30]
  • #8: One problem with large queries is that the graphs can be unmanageable and take forever to display. One significant change is the graph viewer can now request a much smaller subset – for instance clicking on a subgraph in the timings list brings you to this view – which can be rendered much more quickly. [20]
  • #9: Our goal for improving the timings tab is simple enough – to make it easy to examine the performance of your query. Unfortunately it isn’t immediately obvious the best way to present all the information that is available, but hopefully the changes we have made will be a step in the right direction. This example shows 4 different timings for a graph that reads from disk, sorts, and then writes to disk. The purple bars represent the total time within that activity, and the other colour bars represent times for different tasks within the activity. It helps gives a better idea of where the time is going and why. Again this is another area I expect to change and improve in future versions. So please let us know what sorts of comparisons would be useful to you, and how you would like them displayed. [50]
  • #10: Many of these changes in eclwatch rely on the improvements to the visualisation framework, which I think is worth highlighting in its own right. If you are producing any visualisations – with or without HPCC – it would be well worth your time investigating it further. For those who don’t know the visualisation framework is a separate open source project, held in its own github repository. It provides visualisations that can pull data from various sources especially big data. It is designed to work well with all common java script frameworks, and is published in the node npm repository, which makes it trivial to include in any project. There are really two different components to the library – visualisations and communications. The visualisation side provides great functionality – like the gantt charts and graph viewer that you saw earlier. But the framework really comes in to its own when it is used in combination with HPCC. For instance you can directly render the results of your roxie query to a chart embedded on a web page. If you are including visualizations in an ecl queries, then go along to the breakout session that Gordon is hosting later will cover the new version of the visualizer bundle in much more detail. [1:20]
  • #11: I am not going to delve into any detail on the changes within the ecl library. What I want to bring to your attention is that there are improvements in each of these areas. So whether you need to split Unicode strings into words, or process dates in different timezones, there may well be changes in 7.0 that make your job easier. We have already heard details from Dan and Roger about some of the bundle changes, and more about the visualizer is coming up in the following breakout. [30]
  • #12: The ESP improvements really help those who are developing web services. Dynamic ESDL has been around since 5.0, allowing service definitions to be directly deployed to esp. But up until now quite a few services could not take advantage of it because the query received from esp needed to be modified before being passed on to roxie - and that modification required the use of custom c++. In 7.0 a big improvement is the introduction of custom transforms. Along with the esdl definition you can include a specification in an xml file that takes inputs like the request, security values, etc and uses them to modify the query that gets sent to roxie. What it means to the web service developer is that that custom c++ code can now be replaced with an xml definition. That is probably worthwhile it itself – reducing the scope for mistakes. Even better it means that the vast majority of services can now use DESDL and be deployed directly from the command line without having to compile c++. Perhaps most significantly you avoid the need to bring esp down, deploy the compiled mappoiong code, and then bring it up again every time a new service definition is required. DESDL is now fully integrated into ESP – it is really more like an ESP v2. It is now just another way of configuring ESP services. A few other improvements to esp have allowed greater control when they are acting as stand alone web servers. For instance being able to connect and disconnect from dali means that operations has control over when service definitions are updated, and allows them isolate esp from other parts of the system. [2:00]
  • #13: Version 6 added support for embedded languages like python, or MySQL, but their use was a bit restricted. For example there was no EMBED equivalent to an output statement that takes a stream of input records and is executed in parallel over all the nodes. The new activity attribute on an EMBED now allows you to achieve that. Other changes in the compiler focus on improving working with a local repository. Some examples include speeding up local syntax checking and generating the archives that are sent to eclccserver, and providing support for auto completion in editors. [40]
  • #14: We don’t have the resources (or the skills) to solve to every problem within the HPCC code base. Instead Richard’s team concentrates on improving and extending our core functionality, but also providing you with the ability to integrate other open source projects into your solutions. Allowing other languages to create activities is part of those improvements, what else have we done? [30]
  • #15: You have probably heard of it, but what is Spark? According to Wikipedia it is “An open source distributed general-purpose cluster-computing framework”. That sounds awfully like HPCC, so why would you want to use it? They are similar, but HPCC and Spark have different strengths and development communities. For example Spark is particularly strong in the machine learning community, and many researchers use it to develop new machine learning algorithms. If you want to apply that work to your data you will be much more successful running those algorithms on spark, rather than trying to port them to HPCC. Another reason to use Spark might be familiarity. If your data analysts are already using spark, with a development environment they are familiar with, then they will want to continue using it. But if a group want to use Spark, and all your data is on HPCC, you have a problem. Well no longer. Version 7 allows Spark to read both files and indexes from HPCC. This allows you to use HPCC for the bulk of your data processing, and use Spark for the areas that particularly suit it. You can then export your results back to HPCC ready to be processed along with the rest of your data. If you want to experiment, then to make life even easier there will also be an optional package which will install and configure a Spark cluster on the same nodes that are used to run HPCC. Of course in 5 years time there may well be a new trendy platform. If so we will make sure that HPCC can also integrate with that platform, whatever it may be. [1:45]
  • #16: The log files generated by the system contain really useful information, but it can be a real pain in the neck to get at. Version 7 makes it easy to integrate an ELK stack with the system, including the ability to add Kibana dashboards into eclwatch. This integration is highly configurable, and can be useful for many different roles. For example operations can track system health, segfaults, and many other significant events. Developers can search log entries and identify problems. Here, for example, is a dashboard that shows the summary status of a complete cluster. [40]
  • #17: This example on the other hand provides details about a single machine within the cluster. [10]
  • #18: And this dashboard item can track the number of transactions per minute going through esp. If you want to know more, there is a blog post to get you started that contains various recipes for extracting different pieces of information from the logs and then visualising them within eclwatch. [20]
  • #19: A bit of a change of focus. What is VS code and why do I care? Well, if you’re writing ECL on a windows machine then eclide provides a good development environment. If you’re not then what can you do? VS Code provides the cross platform equivalent. For those who haven’t heard of it VS code is a lightweight source code editor, which is gaining widespread adoption. It is designed from the start to be highly customizable and extensible. It has numerous downloadable extensions for different languages, different source control systems, spell checkers, and much much more. Gordon has developed an ECL extension which allows you use vscode in a very similar way to eclide. It is fully functional, even including auto completion, and he is actively developing it. A few brave souls might even be tempted to swap from eclide to VS Code – especially if you are writing code in multiple languages, or particularly value its customizability. [60]
  • #20: Here is an example of what it looks like when you are editing ecl code. You can see a tree of attributes on the right, the syntax colouring in the editor and integration of the compiler errors just like ecl ide. If you want to find out more then go to Arjuna’s breakout session later today. [20]
  • #21: Improving security is a continual task. It was improved in 6.0, and I’m sure it will be in the list of improvements for 8.0, and the foreseeable future. So what has changed? [15]
  • #22: Previously there were a couple of potential problems with the way that browsers connect to eclwatch. The scheme used for authenticating users meant the user name and password were sent with each request, and because the browser sends them automatically there wasn’t a natural way to log out or connect as a different user. This has now changed so the user and password is authenticated once, and after that the connection continues using a session cookie. What practical difference will it make? You now see a different dialog to request the username and password, and once logged in there are options in the top right corner to log out and lock your session, and sessions will lock automatically after a period of inactivity. [45]
  • #23: Adding the capability for Spark to read Thor files is great, but it raises some security issues. There is no point verifying ECL users have the rights to access files, if Spark users can read any file they want. So along with the spark integration, work needed to be done to ensure the access rights are checked and enforced consistently. And the move to host environments in the cloud also poses extra security challenges. Depending on your level of paranoia you may want the system to Verify that you are really talking to the server you think you are. Signing messages to verify the source of the message is who they claim to be. Add encryption in transit to ensure that no one can read the data being sent between components Version 7 contains several changes to improve this situation – for instance roxie now supports https which allows end to end encryption for roxie queries in the cloud. [55]
  • #24: Finally of the four, performance is another long term goal that is always going to be on the improvements list. Here are a few areas that are worth highlighting: [10]
  • #25: Thor has historically been very good at performing standard joins, but not so good at keyed joins. Indeed, sometimes it has been quicker to perform a full join against a index than a keyed join. To tackle this Jake has completely reimplemented keyed joins in Thor. To give you some idea of the improvement, here is a graph of the timings from the performance suite. As you can see it is fairly dramatic! There are more details in the jira issue if you are interested. Obviously your mileage is going to vary, but I would be very surprised if you did not see a fairly dramatic improvement in your own examples. [40]
  • #26: Some of the extensions to the ML library have really stretched (and sometimes broken) the LOOP activity. As a result there are fixes to the code generator and improvements to Thor, particularly reducing the synchronization between the slave nodes. The other entries on this slide are all examples of improvements to performance, which have come about in response to issues that have been reported. Hopefully they will benefit many users. [35]
  • #27: The final performance improvement involves indexes. Indexes are used by roxie queries to provide quick access to data. They are however read only and do not support incremental updates, and if they are large they can be slow to build. That causes a problem if the data you are storing is constantly being updated. The common solution to this problem is to use a superindex. This is where a collection of indexes with the same structure are treated as a single index. Those sub indexes are updated at different frequencies – for example on this diagram hourly, daily, weekly, monthly ,yearly. [If have also included some typical figures for numbers of rows]. This scheme retains the quick access to the data, but also allows quick updates since the hourly index takes a fraction of the time to build because it is much smaller. This approach does though have a disadvantage. Now instead of searching a single index file for a match, the system has to search all 5 of the sub indexes. And since only a small proportion of the records are changed each hour most of those searches are not going to find any matches. [1:20]
  • #28: This is where bloom filters help. They allow the system to quickly exclude indexes from consideration. That means that most of the time that 5 index look up will be reduced to 2 or 3. If you want to understand how they work, and how you use them from ECL, then Richard has written a great blog post for you to read. Hash distributed keys are linked because they will help you to build those incremental updates. They provide a simpler way to build distributed keys that are consistently distributed, and don’t develop problems with skew over time. [35]
  • #29: There have been a lot of bug fixes, improvements and new features. When I last looked there were more than 1,000 changes that were not part of the 6.x series. [40]
  • #30: So, while you are at the conference please make the most of your opportunity to talk to the developers. Come and ask us questions, give us feedback and suggest your crazy new ideas. If you want to know who to talk to, here are some suggestions to get you started. [15]