SlideShare a Scribd company logo
Analytics in the age of
the Internet of Things
Ludwine Probst @nivdul
Thank you!
?
me
Data Engineer
@nivdul
nivdul.wordpress.com
Women in Tech
Duchess France
@duchessfr
duchess-france.org
Paris chapter Leader
Internet of Things (IoT)
Analytics with Spark
more and more
aircraft
use case: sensor data from a cross-country flight
data points: several terabytes every hour per sensor
data analysis: batch mode or real time analysis
applications:
• flight performance (optimize plane fuel consumption,
• reduce maintenance costs…)
• detect anomalies
• prevent accidents
insurance
use case: data from a connected car key
applications:
• monitoring
• real time vehicle location
• drive safety
• driving score
Why should I care?
Because it can affect & change our business, our everyday life?
Collecting
Time series
112578291481000 -5.13
112578334541000 -5.05
112578339541000 -5.15
112578451484000 -5.48
112578491615000 -5.33
Some protocols…
• DDS – Device-to-Device communication – real-time
• MQTT – Device-to-Server – collect telemetry data
• XMPP – Device-to-Server – Instant Messaging scenarios
• AMQP – Server-to-Server – connecting devices to backend
…
Challenges
limited CPU
&
memory resources
low energy communication network
Storing
• flat file:
limited utility
• relational database:
limited design
rigidity
• NoSQL database:
scalability
faster & more flexible
Storing TS
IoT data pipeline
streaming
Storm
Now, the example!
http://www.cis.fordham.edu/wisdm/index.php
http://www.cis.fordham.edu/wisdm/includes/files/sensorKDD-2010.pdf
WISDM Lab’s study
The example
Goal: identify the physical activity that a user is performing
inspired by WISDM Lab’s study http://www.cis.fordham.edu/wisdm/index.php
The situation
The labeled data comes from an accelerometer (37 users)
Possible activities are:
walking, jogging, sitting, standing, downstairs and upstairs.
This is a classification problem here!
Some algorithms to use: Decision tree, Random Forest, Multinomial
logistic regression...
How can I predict the user’s
activity?
1. analyzing part:
collect & clean data from a csv file
store it in Cassandra
define & extract features using Spark
build the predictive model using MLlib
2. predicting part:
collect data in real-time (REST)
use the model to predict result
MLlib
https://github.com/nivdul/actitracker-cassandra-spark
Collect
&
store the data
The accelerometer
A sensor (in a smartphone)
compute acceleration over X,Y,Z
collect data every 50ms
Each acceleration contains:
• a timestamp (eg, 1428773040488)
• acceleration along the X axis (unit is m/s²)
• acceleration along the Y axis (unit is m/s²)
• acceleration along the Z axis (unit is m/s²)
Accelerometer Android app
REST Api collecting data coming from a phone application
Accelerometer Data Model
CREATE KEYSPACE actitracker WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
CREATE TABLE users (user_id int,
activity text,
timestamp bigint,
acc_x double,
acc_y double,
acc_z double,
PRIMARY KEY ((user_id,activity),timestamp));
COPY users FROM '/path_to_your_data/data.csv' WITH HEADER = true;
Accelerometer Data Model: logical view
8 walking 112578291481000 -5.13 8.15 1.31
8 walking 112578334541000 -5.05 8.16 1.31
8 walking 112578339541000 -5.15 8.16 1.36
8 walking 112578451484000 -5.48 8.17 1.31
8 walking 112578491615000 -5.33 8.16 1.18
activityuser_id
timestamp
acc_x acc_z
acc_y
graph from the Cityzen Data widget
Analyzing
https://github.com/nivdul/actitracker-cassandra-spark
is a large-scale in-memory data processing framework
• big data analytics in memory/disk
• complements Hadoop
• faster and more flexible
• Resilient Distributed Datasets (RDD)
interactive shell (scala & python)
Lambda
(Java 8)
Spark ecosystem
MLlib
• regression
• classification
• clustering
• optimization
• collaborative filtering
• feature extraction (TF-IDF, Word2Vec…)
is Apache Spark's scalable machine learning library
spark-cassandra-connector
Exposes Cassandra tables as Spark RDD
Identify features
repetitive static
VS
walking, jogging, up/down stairs standing, sitting
graph from the Cityzen Data widget
The activities : jogging
mean_x = 3.3
mean_y = -6.9
mean_z = 0.8
Y-axis: peaks spaced out
about 0.25 seconds
graph from the Cityzen Data widget
The activities : walking
mean_x = 1
mean_y = 10
mean_z = -0.3
Y-axis: peaks spaced
about 0.5 seconds
graph from the Cityzen Data widget
The activities : up/downstairs
Y-axis: peaks spaced about 0.75 seconds
graph from the Cityzen Data widget
up down
The activities : standing
graph from the Cityzen Data widget
standing
static activity: no peaks
sitting
The features
• Average acceleration (for each axis)
• Variance (for each axis)
• Average absolute difference (for each axis)
• Average resultant acceleration
• Average time between peaks (max) (for Y-axis)
Goal: compute these features for all the users (37) and activities (6) over few
seconds window
Clean
&
prepare the data
Analytics with Spark
retrieve the data from Cassandra
// define Spark context
SparkConf sparkConf = new SparkConf()
.setAppName("User's physical activity recognition")
.set("spark.cassandra.connection.host", "127.0.0.1")
.setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
// retrieve data from Cassandra and create an CassandraRDD
CassandraJavaRDD<CassandraRow> cassandraRowsRDD =
javaFunctions(sc).cassandraTable("actitracker", "users");
Compute the features
MLlib
Feature: mean
import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
import org.apache.spark.mllib.stat.Statistics;
private MultivariateStatisticalSummary summary;
public ExtractFeature(JavaRDD<Vector> data) {
this.summary = Statistics.colStats(data.rdd());
}
// return a Vector (mean_acc_x, mean_acc_y, mean_acc_z)
public Vector computeAvgAcc() {
return this.summary.mean();
}
Feature: avg time between peaks
// define the maximum using the max function from MLlib
double max = this.summary.max().toArray()[1];
// keep the timestamp of data point for which the value is greater than 0.9 * max
// and sort it !
// Here: data = RDD (ts, acc_y)
JavaRDD<Long> peaks = data.filter(record -> record[1] > 0.9 * max)
.map(record -> record[0])
.sortBy(time -> time, true, 1);
Feature: avg time between peaks
// retrieve the first and last element of the RDD (sorted)
Long firstElement = peaks.first();
Long lastElement = peaks.sortBy(time -> time, false, 1).first();
// compute the delta between each timestamp
JavaRDD<Long> firstRDD = peaks.filter(record -> record > firstElement);
JavaRDD<Long> secondRDD = peaks.filter(record -> record < lastElement);
JavaRDD<Vector> product = firstRDD.zip(secondRDD)
.map(pair -> pair._1() - pair._2())
// and keep it if the delta is != 0
.filter(value -> value > 0)
.map(line -> Vectors.dense(line));
// compute the mean of the delta
return Statistics.colStats(product.rdd()).mean().toArray()[0];
Choose algorithms
Random Forests
Decision Trees
Multiclass Logistic Regression
MLlib
Goal: identify the physical activity that a user is performing
Decision Trees
// Split data into 2 sets : training (60%) and test (40%)
JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.6, 0.4});
JavaRDD<LabeledPoint> trainingData = splits[0].cache();
JavaRDD<LabeledPoint> testData = splits[1];
Decision Trees// Decision Tree parameters
Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<>();
int numClasses = 4;
String impurity = "gini";
int maxDepth = 9;
int maxBins = 32;
// create model
final DecisionTreeModel model = DecisionTree.trainClassifier(
trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins);
// Evaluate model on training instances and compute training error
JavaPairRDD<Double, Double> predictionAndLabel =
testData.mapToPair(p -> new Tuple2<>(model.predict(p.features()), p.label()));
Double testErrDT = 1.0 * predictionAndLabel.filter(pl -> !pl._1().equals(pl._2())).count() / testData.count();
// Save model
model.save(sc, "actitracker");
Results
Predictions
http://www.commitstrip.com/en/2014/04/08/the-demo-effect-dear-old-murphy/?setLocale=1
Accelerometer Android app
REST Api collecting data coming from a phone application
An example: https://github.com/MiraLak/accelerometer-rest-to-cassandra
Predictions!
// load the model saved before
DecisionTreeModel model = DecisionTreeModel.load(sc.sc(), "actitracker");
// connection between Spark and Cassandra using the spark-cassandra-connector
CassandraJavaRDD<CassandraRow> cassandraRowsRDD = javaFunctions(sc).cassandraTable("accelerations",
"acceleration");
// retrieve data from Cassandra and create an CassandraRDD
JavaRDD<CassandraRow> data = cassandraRowsRDD.select("timestamp", "acc_x", "acc_y", "acc_z")
.where("user_id=?", "TEST_USER")
.withDescOrder()
.limit(250);
Vector feature = computeFeature(sc);
double prediction = model.predict(feature);
How can I use my
computations?
possible applications:
• adapt the music over your speed
• detects lack of activity
• smarter pacemakers
• smarter oxygen therapy
Conclusion
• http://cassandra.apache.org/
• http://planetcassandra.org/getting-started-with-time-series-data-modeling/
• https://github.com/datastax/spark-cassandra-connector
• https://github.com/MiraLak/AccelerometerAndroidApp
• https://github.com/MiraLak/accelerometer-rest-to-cassandra
• https://spark.apache.org/docs/1.3.0/
• https://github.com/nivdul/actitracker-cassandra-spark
• http://www.duchess-france.org/analyze-accelerometer-data-with-apache-spark-and-mllib/
http://www.cis.fordham.edu/wisdm/index.php
Some references
Thank you!

More Related Content

Similar to Analytics with Spark (20)

PDF
Human Activity Recognition Using AccelerometerData
IRJET Journal
 
PPTX
Final presentation MIS 637 A - Rishab Kothari
Stevens Institute of Technology
 
PDF
Predictive analysis on Activity Recognition System
ShankarPrasaadRajama
 
PDF
Predictiveanalysisonactivityrecognitionsystem 190131212500
Sragvi Anirudh
 
PPTX
Role of Analytics in Digital Business
Srinath Perera
 
PDF
Activity Monitoring Using Wearable Sensors and Smart Phone
DrAhmedZoha
 
PDF
IoT with Azure Machine Learning and InfluxDB
Ivo Andreev
 
PDF
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
PDF
112 portfpres.pdf
sash236
 
PDF
4_7268-76_IIOABJournal.pdf
RiyaDadlani1
 
PDF
4_7268-76_IIOABJournal.pdf
RiyaDadlani1
 
PDF
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
Mark Rittman
 
PPTX
Activity Recognition using Cell Phone Accelerometers
Ishara Amarasekera
 
PDF
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Applied Machin...
Thomas Ploetz
 
PDF
Energy analytics with Apache Spark workshop
QuantUniversity
 
PPTX
Predictive maintenance withsensors_in_utilities_
Tina Zhang
 
PDF
CWIN17 New-York / Unleash the possibilities of io t with spark and machine le...
Capgemini
 
PDF
IRJET- Design an Approach for Prediction of Human Activity Recognition us...
IRJET Journal
 
PDF
Maximizing Your ML Success with Innovative Feature Engineering
FeatureByte
 
PPTX
Introduction to WSO2 Data Analytics Platform
Srinath Perera
 
Human Activity Recognition Using AccelerometerData
IRJET Journal
 
Final presentation MIS 637 A - Rishab Kothari
Stevens Institute of Technology
 
Predictive analysis on Activity Recognition System
ShankarPrasaadRajama
 
Predictiveanalysisonactivityrecognitionsystem 190131212500
Sragvi Anirudh
 
Role of Analytics in Digital Business
Srinath Perera
 
Activity Monitoring Using Wearable Sensors and Smart Phone
DrAhmedZoha
 
IoT with Azure Machine Learning and InfluxDB
Ivo Andreev
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
112 portfpres.pdf
sash236
 
4_7268-76_IIOABJournal.pdf
RiyaDadlani1
 
4_7268-76_IIOABJournal.pdf
RiyaDadlani1
 
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
Mark Rittman
 
Activity Recognition using Cell Phone Accelerometers
Ishara Amarasekera
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Applied Machin...
Thomas Ploetz
 
Energy analytics with Apache Spark workshop
QuantUniversity
 
Predictive maintenance withsensors_in_utilities_
Tina Zhang
 
CWIN17 New-York / Unleash the possibilities of io t with spark and machine le...
Capgemini
 
IRJET- Design an Approach for Prediction of Human Activity Recognition us...
IRJET Journal
 
Maximizing Your ML Success with Innovative Feature Engineering
FeatureByte
 
Introduction to WSO2 Data Analytics Platform
Srinath Perera
 

Recently uploaded (20)

PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
deep dive data management sharepoint apps.ppt
novaprofk
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
Climate Action.pptx action plan for climate
justfortalabat
 
Ad

Analytics with Spark

  • 1. Analytics in the age of the Internet of Things Ludwine Probst @nivdul
  • 4. Women in Tech Duchess France @duchessfr duchess-france.org Paris chapter Leader
  • 8. aircraft use case: sensor data from a cross-country flight data points: several terabytes every hour per sensor data analysis: batch mode or real time analysis applications: • flight performance (optimize plane fuel consumption, • reduce maintenance costs…) • detect anomalies • prevent accidents
  • 9. insurance use case: data from a connected car key applications: • monitoring • real time vehicle location • drive safety • driving score
  • 10. Why should I care? Because it can affect & change our business, our everyday life?
  • 12. Time series 112578291481000 -5.13 112578334541000 -5.05 112578339541000 -5.15 112578451484000 -5.48 112578491615000 -5.33
  • 13. Some protocols… • DDS – Device-to-Device communication – real-time • MQTT – Device-to-Server – collect telemetry data • XMPP – Device-to-Server – Instant Messaging scenarios • AMQP – Server-to-Server – connecting devices to backend …
  • 14. Challenges limited CPU & memory resources low energy communication network
  • 16. • flat file: limited utility • relational database: limited design rigidity • NoSQL database: scalability faster & more flexible Storing TS
  • 20. The example Goal: identify the physical activity that a user is performing inspired by WISDM Lab’s study http://www.cis.fordham.edu/wisdm/index.php
  • 21. The situation The labeled data comes from an accelerometer (37 users) Possible activities are: walking, jogging, sitting, standing, downstairs and upstairs. This is a classification problem here! Some algorithms to use: Decision tree, Random Forest, Multinomial logistic regression...
  • 22. How can I predict the user’s activity? 1. analyzing part: collect & clean data from a csv file store it in Cassandra define & extract features using Spark build the predictive model using MLlib 2. predicting part: collect data in real-time (REST) use the model to predict result MLlib https://github.com/nivdul/actitracker-cassandra-spark
  • 24. The accelerometer A sensor (in a smartphone) compute acceleration over X,Y,Z collect data every 50ms Each acceleration contains: • a timestamp (eg, 1428773040488) • acceleration along the X axis (unit is m/s²) • acceleration along the Y axis (unit is m/s²) • acceleration along the Z axis (unit is m/s²)
  • 25. Accelerometer Android app REST Api collecting data coming from a phone application
  • 26. Accelerometer Data Model CREATE KEYSPACE actitracker WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; CREATE TABLE users (user_id int, activity text, timestamp bigint, acc_x double, acc_y double, acc_z double, PRIMARY KEY ((user_id,activity),timestamp)); COPY users FROM '/path_to_your_data/data.csv' WITH HEADER = true;
  • 27. Accelerometer Data Model: logical view 8 walking 112578291481000 -5.13 8.15 1.31 8 walking 112578334541000 -5.05 8.16 1.31 8 walking 112578339541000 -5.15 8.16 1.36 8 walking 112578451484000 -5.48 8.17 1.31 8 walking 112578491615000 -5.33 8.16 1.18 activityuser_id timestamp acc_x acc_z acc_y graph from the Cityzen Data widget
  • 29. is a large-scale in-memory data processing framework • big data analytics in memory/disk • complements Hadoop • faster and more flexible • Resilient Distributed Datasets (RDD) interactive shell (scala & python) Lambda (Java 8) Spark ecosystem
  • 30. MLlib • regression • classification • clustering • optimization • collaborative filtering • feature extraction (TF-IDF, Word2Vec…) is Apache Spark's scalable machine learning library
  • 32. Identify features repetitive static VS walking, jogging, up/down stairs standing, sitting graph from the Cityzen Data widget
  • 33. The activities : jogging mean_x = 3.3 mean_y = -6.9 mean_z = 0.8 Y-axis: peaks spaced out about 0.25 seconds graph from the Cityzen Data widget
  • 34. The activities : walking mean_x = 1 mean_y = 10 mean_z = -0.3 Y-axis: peaks spaced about 0.5 seconds graph from the Cityzen Data widget
  • 35. The activities : up/downstairs Y-axis: peaks spaced about 0.75 seconds graph from the Cityzen Data widget up down
  • 36. The activities : standing graph from the Cityzen Data widget standing static activity: no peaks sitting
  • 37. The features • Average acceleration (for each axis) • Variance (for each axis) • Average absolute difference (for each axis) • Average resultant acceleration • Average time between peaks (max) (for Y-axis) Goal: compute these features for all the users (37) and activities (6) over few seconds window
  • 40. retrieve the data from Cassandra // define Spark context SparkConf sparkConf = new SparkConf() .setAppName("User's physical activity recognition") .set("spark.cassandra.connection.host", "127.0.0.1") .setMaster("local[*]"); JavaSparkContext sc = new JavaSparkContext(sparkConf); // retrieve data from Cassandra and create an CassandraRDD CassandraJavaRDD<CassandraRow> cassandraRowsRDD = javaFunctions(sc).cassandraTable("actitracker", "users");
  • 42. Feature: mean import org.apache.spark.mllib.stat.MultivariateStatisticalSummary; import org.apache.spark.mllib.stat.Statistics; private MultivariateStatisticalSummary summary; public ExtractFeature(JavaRDD<Vector> data) { this.summary = Statistics.colStats(data.rdd()); } // return a Vector (mean_acc_x, mean_acc_y, mean_acc_z) public Vector computeAvgAcc() { return this.summary.mean(); }
  • 43. Feature: avg time between peaks // define the maximum using the max function from MLlib double max = this.summary.max().toArray()[1]; // keep the timestamp of data point for which the value is greater than 0.9 * max // and sort it ! // Here: data = RDD (ts, acc_y) JavaRDD<Long> peaks = data.filter(record -> record[1] > 0.9 * max) .map(record -> record[0]) .sortBy(time -> time, true, 1);
  • 44. Feature: avg time between peaks // retrieve the first and last element of the RDD (sorted) Long firstElement = peaks.first(); Long lastElement = peaks.sortBy(time -> time, false, 1).first(); // compute the delta between each timestamp JavaRDD<Long> firstRDD = peaks.filter(record -> record > firstElement); JavaRDD<Long> secondRDD = peaks.filter(record -> record < lastElement); JavaRDD<Vector> product = firstRDD.zip(secondRDD) .map(pair -> pair._1() - pair._2()) // and keep it if the delta is != 0 .filter(value -> value > 0) .map(line -> Vectors.dense(line)); // compute the mean of the delta return Statistics.colStats(product.rdd()).mean().toArray()[0];
  • 45. Choose algorithms Random Forests Decision Trees Multiclass Logistic Regression MLlib Goal: identify the physical activity that a user is performing
  • 46. Decision Trees // Split data into 2 sets : training (60%) and test (40%) JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.6, 0.4}); JavaRDD<LabeledPoint> trainingData = splits[0].cache(); JavaRDD<LabeledPoint> testData = splits[1];
  • 47. Decision Trees// Decision Tree parameters Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<>(); int numClasses = 4; String impurity = "gini"; int maxDepth = 9; int maxBins = 32; // create model final DecisionTreeModel model = DecisionTree.trainClassifier( trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins); // Evaluate model on training instances and compute training error JavaPairRDD<Double, Double> predictionAndLabel = testData.mapToPair(p -> new Tuple2<>(model.predict(p.features()), p.label())); Double testErrDT = 1.0 * predictionAndLabel.filter(pl -> !pl._1().equals(pl._2())).count() / testData.count(); // Save model model.save(sc, "actitracker");
  • 50. Accelerometer Android app REST Api collecting data coming from a phone application An example: https://github.com/MiraLak/accelerometer-rest-to-cassandra
  • 51. Predictions! // load the model saved before DecisionTreeModel model = DecisionTreeModel.load(sc.sc(), "actitracker"); // connection between Spark and Cassandra using the spark-cassandra-connector CassandraJavaRDD<CassandraRow> cassandraRowsRDD = javaFunctions(sc).cassandraTable("accelerations", "acceleration"); // retrieve data from Cassandra and create an CassandraRDD JavaRDD<CassandraRow> data = cassandraRowsRDD.select("timestamp", "acc_x", "acc_y", "acc_z") .where("user_id=?", "TEST_USER") .withDescOrder() .limit(250); Vector feature = computeFeature(sc); double prediction = model.predict(feature);
  • 52. How can I use my computations? possible applications: • adapt the music over your speed • detects lack of activity • smarter pacemakers • smarter oxygen therapy
  • 54. • http://cassandra.apache.org/ • http://planetcassandra.org/getting-started-with-time-series-data-modeling/ • https://github.com/datastax/spark-cassandra-connector • https://github.com/MiraLak/AccelerometerAndroidApp • https://github.com/MiraLak/accelerometer-rest-to-cassandra • https://spark.apache.org/docs/1.3.0/ • https://github.com/nivdul/actitracker-cassandra-spark • http://www.duchess-france.org/analyze-accelerometer-data-with-apache-spark-and-mllib/ http://www.cis.fordham.edu/wisdm/index.php Some references Thank you!