Sergio Fernández (Redlink GmbH)
November 14th, 2016 - Sevilla
Moven
Machine/Deep Learning Models Distribution
Relying on the Maven Infrastructure
SSIX aims
to exploit the
predictive power
of Social Media
on Financial
Markets
High-Level Technical Architecture
Analysis Pipeline DashboardData Collection
RESTfulAPI
Storage
...apps
X-Scores
Further details at http://ssix-project.eu/
Models in SSIX and Redlink
In Redlink, particularly in the SSIX project, we deal with quite
deep neural networks that produce very large models (several
Gigabytes).
Therefore we thought how to address two problems:
1. How to properly manage its distribution and versioning?
2. How to automatizate its testing?
Moven
moven = models + maven
https://bitbucket.org/ssix-project/moven
The thesis why we started to work on Moven was the lack of proper state-of-the-art
technology for addressing the two needs described before (distributing and testability).
Some examples:
● TensorFlow public models use a regular git repository
● In Spark ML most of the people use a shared storage (e.g., HDFS)
● OpenNLP also bundle them as JARs
● Freeling uses a share folder from the native installation packages
● Some other proprietary methods...
As Maven does a great work for software artifact, we decided to reuse that
infrastructure for models too.
Moven key features
● Model agnostic
● Publication based on a regular Maven plugin
● Distribution relying on the existing Maven infrastructure
○ benefiting of all the features provided by existing tooling (access control,
mirroring, etc)
● Retrieval current supported in:
○ Java (Maven of course)
○ Python (relying on jip)
● Built-in gzip-based compression
Related work
There are some interesting work related with our goals:
● StandordNLP has recently (>3.5.2) changed to bundle the modules as Maven
artifacts
● TensorFlow Serving helps to deploy new algorithms for TensorFlow models
● PipelineIO combines several technologies (Spark, NetflixOSS, etc; they call it the
PANCAKE STACK) to provide models distribution, including incremental training,
among many other features (more details).
Publish Moven models
<plugin>
<groupId>io.redlink.ssix.moven</groupId>
<artifactId>moven-maven-plugin</artifactId>
<version>0.1.0-SNAPSHOT</version>
<executions>
<execution>
<phase>process-resources</phase>
<goals>
<goal>copy-models</goal>
</goals>
</execution>
</executions>
</plugin>
Create a regular Maven artifact,
placing the models at
src/main/models, just including
a plugin configuration:
Then normally deploy your artifacts with mvn deploy
Using your Moven models
From Java:
● Declare the dependency to your
models at your pom.xml
● Then models will be available in
your classpath:
● Also exposed via HTTP as static
resources when the JAR is deployed
in any Servlet >=3.0 container
(inspired by James Ward and the
WebJars project).
From Python:
● Install it: pip install moven
● Declare at models.txt your
models in your project (as we do
with requirements.txt) with a
syntax similar to Groovy's Grape:
● Execute moven models.txt to
retrieve all models to ./moven
organized by artifactId.
● Thought-out specifically for
container deployments
this.getClass().getClassLoader()
.getResourceAsStream("META-INF/resources/models/foo.ex")
io.redlink.ssix.moven:moven-syntaxnet-example:1.0-SNAPSHOT
let’s play
https://www.flickr.com/photos/gsfc/3533864222
Current status and future
Moven is still in a very early stage, but already being used in
production in SSIX and other Redlink projects.
We will keep exploring such approaches to find a way to better
manage the lifecycle of the models that drive our information
extraction (Natural Language Processing, Machine Learning, Deep
Learning, etc) stack.
For example, we want to target more specific needs in some
concrete environments, such as Apache Spark and/or Apache
Beam Runners API.
Gracias!
Sergio Fernández
Software Engineer
sergio.fernandez@redlink.co
https://www.wikier.org/
Redlink GmbH
http://redlink.co
Coworking Salzburg
Jakob Haringer Straße 3
5020 Salzburg (Austria)
project partially funded by the European Union’s Horizon 2020 research
and innovation programme, under grant agreement no. 645425

Moven - Apache Big Data Europe 2016 - SSIX Project

  • 1.
    Sergio Fernández (RedlinkGmbH) November 14th, 2016 - Sevilla Moven Machine/Deep Learning Models Distribution Relying on the Maven Infrastructure
  • 2.
    SSIX aims to exploitthe predictive power of Social Media on Financial Markets
  • 3.
    High-Level Technical Architecture AnalysisPipeline DashboardData Collection RESTfulAPI Storage ...apps X-Scores Further details at http://ssix-project.eu/
  • 4.
    Models in SSIXand Redlink In Redlink, particularly in the SSIX project, we deal with quite deep neural networks that produce very large models (several Gigabytes). Therefore we thought how to address two problems: 1. How to properly manage its distribution and versioning? 2. How to automatizate its testing?
  • 5.
    Moven moven = models+ maven https://bitbucket.org/ssix-project/moven The thesis why we started to work on Moven was the lack of proper state-of-the-art technology for addressing the two needs described before (distributing and testability). Some examples: ● TensorFlow public models use a regular git repository ● In Spark ML most of the people use a shared storage (e.g., HDFS) ● OpenNLP also bundle them as JARs ● Freeling uses a share folder from the native installation packages ● Some other proprietary methods... As Maven does a great work for software artifact, we decided to reuse that infrastructure for models too.
  • 6.
    Moven key features ●Model agnostic ● Publication based on a regular Maven plugin ● Distribution relying on the existing Maven infrastructure ○ benefiting of all the features provided by existing tooling (access control, mirroring, etc) ● Retrieval current supported in: ○ Java (Maven of course) ○ Python (relying on jip) ● Built-in gzip-based compression
  • 7.
    Related work There aresome interesting work related with our goals: ● StandordNLP has recently (>3.5.2) changed to bundle the modules as Maven artifacts ● TensorFlow Serving helps to deploy new algorithms for TensorFlow models ● PipelineIO combines several technologies (Spark, NetflixOSS, etc; they call it the PANCAKE STACK) to provide models distribution, including incremental training, among many other features (more details).
  • 8.
  • 9.
    Using your Movenmodels From Java: ● Declare the dependency to your models at your pom.xml ● Then models will be available in your classpath: ● Also exposed via HTTP as static resources when the JAR is deployed in any Servlet >=3.0 container (inspired by James Ward and the WebJars project). From Python: ● Install it: pip install moven ● Declare at models.txt your models in your project (as we do with requirements.txt) with a syntax similar to Groovy's Grape: ● Execute moven models.txt to retrieve all models to ./moven organized by artifactId. ● Thought-out specifically for container deployments this.getClass().getClassLoader() .getResourceAsStream("META-INF/resources/models/foo.ex") io.redlink.ssix.moven:moven-syntaxnet-example:1.0-SNAPSHOT
  • 10.
  • 11.
    Current status andfuture Moven is still in a very early stage, but already being used in production in SSIX and other Redlink projects. We will keep exploring such approaches to find a way to better manage the lifecycle of the models that drive our information extraction (Natural Language Processing, Machine Learning, Deep Learning, etc) stack. For example, we want to target more specific needs in some concrete environments, such as Apache Spark and/or Apache Beam Runners API.
  • 12.
  • 13.
    Sergio Fernández Software Engineer [email protected] https://www.wikier.org/ RedlinkGmbH http://redlink.co Coworking Salzburg Jakob Haringer Straße 3 5020 Salzburg (Austria) project partially funded by the European Union’s Horizon 2020 research and innovation programme, under grant agreement no. 645425