Deep learning with DL4J - Hadoop Summit 2015

Enterprise Deep Learning with DL4J
Skymind
Josh Patterson
Hadoop Summit 2015

Presenter: Josh Patterson
Past
Research in Swarm Algorithms
Real-time optimization techniques in mesh sensor networks
TVA / NERC
Smartgrid, Sensor Collection, and Big Data
Cloudera
Today
Patterson Consulting
josh@pattersonconsultingtn.com
Skymind (Advisor)
josh@skymind.io / @jpatanooga
Architecture, Committer on DL4J

Topics
• What is Deep Learning?
• What is DL4J?
• Enterprise Grade Deep Learning Workflows

We Want to be able to recognize
Handwriting
This is a Hard Problem

Automated Feature Engineering
• Deep Learning can be thought of as workflows for
automated feature construction
– Where previously we’d consider each stage in the
workflow as unique technique
• Many of the techniques have been around for
years
– But now are being chained together in a way that
automates exotic feature engineering
• As LeCun says:
– “machines that learn to represent the world”

Deep learning with DL4J - Hadoop Summit 2015

These are the features learned at each neuron in a Restricted Boltzmann Machine
(RBMS)
These features are passed to higher levels of RBMs to learn more complicated things.
Part of the
“7” digit

Deep Learning Architectures
• Deep Belief Networks
– Most common architecture
• Convolutional Neural Networks
– State of the art in image classification
• Recurrent Networks
– Timeseries
• Recursive Networks
– Text / image
– Can break down scenes

DL4J
Next Generation Deep Learning with

DL4J
• “The Hadoop of Deep Learning”
– Command line driven
– Java, Scala, and Python APIs
– ASF 2.0 Licensed
• Java implementation
– Parallelization (Yarn, Spark)
– GPU support
• Also Supports multi-GPU per host
• Runtime Neutral
– Local
– Hadoop / YARN
– Spark
– AWS
• https://github.com/deeplearning4j/deeplearning4j

Issues in Machine Learning
• Data Gravity
– We need to process the data in workflows where the data lives
• If you move data you don’t have big data
– Even if the data is not “big” we still want simpler workflows
• Integration Issues
– Ingest, ETL, Vectorization, Modeling, Evaluation, and
Deployment issues
– Most ML tools are built with previous generation architectures
in mind
• Legacy Architectures
– Parallel iterative algorithm architectures are not common

DL4J Suite of Tools
• DL4J
– Main library for deep learning
• Canova
– Vectorization library
• ND4J
– Linear Algebra framework
– Swappable backends (JBLAS, GPUs):
• http://www.slideshare.net/agibsonccc/future-of-ai-on-the-jvm
• Arbiter
– Model evaluation and testing platform

DEEP LEARNING AND SPARK
Enterprise Grade

DL4J and Parallelization
17
Model
Training Data
Worker 1
Master
Partial
Model
Global Model
Worker 2
Partial Model
Worker N
Partial
Model
Split 1 Split 2 Split 3
…
Traditional Serial Training Modern Parallel Engine
(Hadoop / Spark)

DL4J Spark / GPUs via API
public class SparkGpuExample {
public static void main(String[] args) throws Exception {
Nd4j.MAX_ELEMENTS_PER_SLICE = Integer.MAX_VALUE;
Nd4j.MAX_SLICES_TO_PRINT = Integer.MAX_VALUE;
// set to test mode
SparkConf sparkConf = new SparkConf()
.setMaster("local[*]").set(SparkDl4jMultiLayer.AVERAGE_EACH_ITERATION,"false")
.set("spark.akka.frameSize", "100")
.setAppName("mnist");
System.out.println("Setting up Spark Context...");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.momentum(0.9).iterations(10)
.weightInit(WeightInit.DISTRIBUTION).batchSize(10000)
.dist(new NormalDistribution(0, 1)).lossFunction(LossFunctions.LossFunction.RMSE_XENT)
.nIn(784).nOut(10).layer(new RBM())
.list(4).hiddenLayerSizes(600, 500, 400)
.override(3, new ClassifierOverride()).build();
System.out.println("Initializing network");
SparkDl4jMultiLayer master = new SparkDl4jMultiLayer(sc,conf);
DataSet d = new MnistDataSetIterator(60000,60000).next();
List<DataSet> next = d.asList();
JavaRDD<DataSet> data = sc.parallelize(next);
MultiLayerNetwork network2 = master.fitDataSet(data);
Evaluation evaluation = new Evaluation();
evaluation.eval(d.getLabels(),network2.output(d.getFeatureMatrix()));
System.out.println("Averaged once " + evaluation.stats());
INDArray params = network2.params();
Nd4j.writeTxt(params,"params.txt",",");
FileUtils.writeStringToFile(new File("conf.json"), network2.getLayerWiseConfigurations().toJson());
}
}

Turn on GPUs and Spark
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>dl4j-spark</artifactId>
<version>${dl4j.version}</version>
</dependency>
<dependency>
<groupId>org.nd4j</groupId>
<artifactId>nd4j-jcublas-7.0</artifactId>
<version>${nd4j.version}</version>
</dependency>

Building Deep Learning Workflows
• We need to get data from a raw format into a
baseline raw vector
– Model the data
– Evaluate the Model
• Traditionally these are all tied together in one
tool
– But this is a monolithic pattern
– We’d like to apply the unix principles here
• The DL4J Suite of Tools lets us do this

Modeling UCI Data: Iris
• We need to vectorize the data
– Possibly with some per column transformations
– Let’s use Canova
• We then need to build a deep learning model
over the data
– We’ll use the DL4J lib to do this
• Finally we’ll evaluate what happened
– This is where Arbiter comes in

Canova for Command Line Vectorization
• Library of tools to take
– Audio
– Video
– Image
– Text
– CSV data
• And convert the input data into vectors in a
standardized format
– Adaptable with custom input/output formats
• Open Source, ASF 2.0 Licensed
– https://github.com/deeplearning4j/Canova
– Part of DL4J suite

Vectorization with Canova
• Setup the configuration file
– Input Formats
– Output Formats
– Setup data types to vectorize
• Setup the schema transforms for the input
CSV data
• Generate the SVMLight vector data as the
output
– with the command line interface

Workflow Configuration (iris_conf.txt)
input.header.skip=false
input.statistics.debug.print=false
input.format=org.canova.api.formats.input.impl.LineInputFormat
input.directory=src/test/resources/csv/data/uci_iris_sample.txt
input.vector.schema=src/test/resources/csv/schemas/uci/iris.txt
output.directory=/tmp/iris_unit_test_sample.txt
output.format=org.canova.api.formats.output.impl.SVMLightOutputFormat

Iris Canova Vector Schema
@RELATION UCIIrisDataset
@DELIMITER ,
@ATTRIBUTE sepallength NUMERIC !NORMALIZE
@ATTRIBUTE sepalwidth NUMERIC !NORMALIZE
@ATTRIBUTE petallength NUMERIC !NORMALIZE
@ATTRIBUTE petalwidth NUMERIC !NORMALIZE
@ATTRIBUTE class STRING !LABEL

Model UCI Iris From CLI
./bin/canova vectorize -conf /tmp/iris_conf.txt
File path already exists, deleting the old file before proceeding...
Output vectors written to: /tmp/iris_svmlight.txt
./bin/dl4j train –conf /tmp/iris_conf.txt
[ …log output… ]
./bin/arbiter evaluate –conf /tmp/iris_conf.txt
[ …log output… ]

Skymind as DL4J Distribution
• Just as Redhat was to Linux
– A distribution of Linux with enterprise grade
packaging
• Just as Cloudera/Horton are to Hadoop
– A distribution of Apache Hadoop with enterprise
grade packaging
• Skymind is to DL4J
– A distribution of DL4J (+tool suite) with enterprise
grade packaging

Questions?
Thank you for your time and attention
“Deep Learning: A Practitioner’s Approach”
(Oreilly, October 2015)

Deep learning with DL4J - Hadoop Summit 2015

More Related Content

What's hot (20)

Similar to Deep learning with DL4J - Hadoop Summit 2015 (20)

More from Josh Patterson (14)

Recently uploaded (20)

Deep learning with DL4J - Hadoop Summit 2015

Editor's Notes