SlideShare a Scribd company logo
From Big Data to
Smart data
Jie (Jack) Yang | April 2016
—What is Big Data?
—Challenge of Big Data processing
—Smart Learning framework
—Applications
—Conclusions
Outline
—No single standard definition
—5-V information assets that require innovative
techniques, algorithms, and analytics that enable
decision making, and process automation
Big Data definition
1 – Scale (Volume)
12+ TBs
of tweet data
every day
25+ TBs of
log data every
day
?TBsof
dataeveryday
2+ billion people on
the Web by end
2011
30 billion RFID tags
today
(1.3B in 2005)
4.6 billion
camera
phones
world wide
100s of
millions of
GPS enabled
devices sold
annually
76 million smart meters in
2009…
200M by 2014
The ability to manage, analyse, summarise, visualise,
and discover knowledge from the collected data in a
timely and scalable manner
2 – Speed (Velocity)
Social media and networks
(millions of active users)
Mobile devices
(tracking objects all the time)
Infrastructure sensors and/or
instruments
(measuring all kinds of data)
Various formats, types and structures:
— Text
— Numerical
— Multi-dim arrays
— Images, audio, video, sequences
— Time series
— Graph (network)
— Streaming data
— etc
3 – Complexity (Varity)
4 – Uncertainty (Veracity)
5 – Benefit (Value)
Value ($, time, performance)
Beer & Diaper (Woolworths in Illawarra)
“A number of convenience store clerks noticed that men
often bought beer at the same time they bought diapers.
The store mined its receipts and proved the clerks'
observations correct. So, the store began stocking
diapers next to the beer coolers, and sales skyrocketed”
Asimple example
Hardware
—Choose machines
—System failure
Challenge of Big Data processing
16 Cores, 32G RAM, $AUD6000+
Software
—Different data sources
—Really slow
—Memory issue (out of memory, for 52 million
records)
Challenge of Big Data processing
Smart Learning Framework
Data harvesting
data partners
Data mining
Data storage
Data streaming
Data visualisation
Hardware
—Money wise
—Tolerance to hardware failure
Smart Learning Framework
16 Cores, 32G RAM, $AUD6000+ 4 Cores, 8G RAM, $AUD 600+
Main features
— Collection across different platforms and formats
• APIs
• Web crawling
— 1 master and 6 workers
• distributing–working–waiting–reactivating
process
— Data volume (per day)
• 20K+ records user activities
• 25K+ records from social platforms
• 200K+ tweets around AU and EU
Data harvesting
Main features
— save data into different formats
• Pure TXT / CSV
• (NO)SQL
— Query across all
— Fast respond
Data storage
SELECT * FROM
(SELECT * FROM /web/logs/CSV) t0
JOIN
( SELECT country, count(*)
FROM mysql.web.users
GROUP BY country) t1
JOIN
(SELECT timestamp
FROM s3.root.clicks.json
WHERE user_id = 'jdoe‘) t2
Main features
— Preprocessing (filtering, cleansing, feature
extraction)
— Event simulation
— Saving to DBs
— Running ML jobs on the fly
• Receiver throughput = 3kb /sec
• Consumer throughput = 2kb /sec
• Consumer latency = 0.23 sec
Data streaming
Main features (35 online training jobs per day)
— Supervised (with a human assisting in classification) /
unsupervised machine learning techniques, to assist with
classification, clustering and prediction;
— Geospatial analysis: K-pop cluster in geographical regions;
— Network analysis to understand social connections between
consumers and producers;
— Other analysis including:
• More sophisticated number crunching of comments, such as
time series analysis to examine trends;
• Natural language processing techniques to assist with
sentiment analysis.
Data mining
Student behaviour analysis (OLPC, until Feb 2016):
— 153+ schools
— 20K+ active laptops
— 4.2M+ activity records
Application 1
0
1000
2000
3000
1.2M 2.6M 4.2M
Most popular Apps (per school) App usage (per school)
0
1000
2000
3000
1.2M 2.6M 4.2M
Car parking
Application 2
Car parking
— Every 2 minutes
— 604800 records (May to Oct 2015)
— Temporal and spatial features
Application 2
Application 2
Average classification accuracy (%) as a function of the size of the selected samples.
Average computational time (second)
Social media analysis
— 70K+ films
— 228K+ users (2M + friendships)
— 1M+ reviews
— 13 features
Application 3
— User profile vs film preference
— User profile vs topics
Application 3
— Network analysis
— Opinion leadership
Application 3
4K nodes + 7K edges 76 nodes + 253 edges
Jie Yang; Jun Ma, A structure optimization algorithm of neural networks for large-scale data sets, Fuzz-IEEE,2014;
Jie Yang; Jun Ma, A Sparsity-Based Training Algorithm for Least Squares SVM, IEEE SSCI, 2014;
Jie Yang, Jun Ma, A big-data processing framework for uncertainties in Transportation data, Fuzz-IEEE, 2015
Jie Yang, Jun Ma, and Sarah K. Howard, A Structure Optimization Algorithm of Neural Networks for Pattern Learning from Educational Data, Springer
Studies in Computational Intelligence ANN Modelling, 2015
Jie Yang; Jun Ma, A hybrid gene expression programming algorithm based on orthogonal design, International Journal of Computational Intelligence
Systems, 2015
Jie Yang, Brian Yecies, Mining Chinese Social Media UGC A SmartLearning Framework For Analyzing Douban Movie Reviews, Journal of Big Data,
2016
Jie Yang; Jun Ma, A structure optimization framework for feed-forward neural networks using sparse representation, Knowledge-Based Systems, 2016;
Jie Yang; Jun Ma, Sarah K. Howard, Exploring Technology Integration in Education using Fuzzy Representation and Feature Selection, Fuzz-IEEE,
2016
Brian Yecies, Jie Yang, Matthew Berryman, Kai Soh, Marketing Bait: Using SMART Data to Identify E-guanxi Among China’s ‘Internet Aborigines,
Film Marketing in a Global Era, 2015
Brian Yecies, Jie Yang, Matthew Berryman, Aegyung Shim, and Kai Soh, Korean Female Writer-Directors and SMART Analysis of Douban
commentary Among China’s Digital Natives, Women Screenwriters: An International Guide, 2015
Brian Yecies, Jie Yang, Matthew Berryman, Aegyung Shim, and Kai Soh, Korean Female Writer–Directors and SMART Analysis of Douban
Commentary Among China’s Digital Natives, Participations: International Journal of Audience Research, 2016
Sarah K. Howard, Jun Ma, Jie Yang, Kate Thompson, The use of data mining to explore factors of technology integration in learning and teaching,
EARLI 2015
Sarah K. Howard, Ellie Rennie, Jun Ma, Jie Yang, Big Data, Big Theory: Moving Beyond New Empiricism to Generate Powerful Explanations, The
New Data “Revolution” in Sociology, 2016
Jun Ma, Jie Yang, Rohan W. Denagamage and Murad Safadi, A Conceptual Model for Clustering Local Government Areas using Complex Fuzzy Sets,
Fuzz-IEEE, 2016
Publications
— OLPC (ARC-Linkage)
— NSW-DER
— CAAR
— China-South Korean Foundation
— Healthcare (Pubmed, Seer)
— Tourism business project (UTS)
— MTR
Projects and grants
— Big Data processing:
• Data collection; streaming data; data storage; and Machine
learning
• Open source libraries
— Other domains:
• Public transportation
• Business Intelligence
• Health care
Conclusions
Thankyou

More Related Content

PPTX
Data science
SwapnilDahake2
 
PPTX
Data Science Innovations : Democratisation of Data and Data Science
suresh sood
 
PDF
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...
Connected Data World
 
PDF
Everis big data_wilson_v1.4
wilson_lucas
 
PPTX
Big Data and Classification
303Computing
 
PDF
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
Edward Curry
 
PDF
Data Mining and Big Data Challenges and Research Opportunities
Kathirvel Ayyaswamy
 
PDF
Personalized News and Video Recomendation System at LinkSure
Leanne Hwee
 
Data science
SwapnilDahake2
 
Data Science Innovations : Democratisation of Data and Data Science
suresh sood
 
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...
Connected Data World
 
Everis big data_wilson_v1.4
wilson_lucas
 
Big Data and Classification
303Computing
 
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
Edward Curry
 
Data Mining and Big Data Challenges and Research Opportunities
Kathirvel Ayyaswamy
 
Personalized News and Video Recomendation System at LinkSure
Leanne Hwee
 

What's hot (20)

PPTX
Data Science Courses - BigData VS Data Science
DataMites
 
PPTX
Spark
suresh sood
 
PDF
What is Data Science
Ioannis Kourouklides
 
PPTX
Kostas Kastrantas | Business Opportunities with Linked Open Data
semanticsconference
 
PDF
Big Data Analytics : Understanding for Research Activity
Andry Alamsyah
 
PPT
Data mining
Ahmed Moussa
 
PDF
Big Data: Beyond the hype, Delivering value
Edward Curry
 
PPT
Data mining
Alisha Korpal
 
PDF
Lect 1 introduction
hktripathy
 
PDF
Intro to big data and applications - day 1
Parviz Vakili
 
PPTX
Thomas Vavra | New Ways of Handling Old Data
semanticsconference
 
PDF
Social Big Data in Government
Adegboyega Ojo
 
PPTX
Introduction to Big Data Analytics
Utkarsh Sharma
 
PPTX
Introduction to Big Data & Analytics
Prasad Chitta
 
PDF
Big Data Paper
Andile Ngcaba
 
PPTX
Metadata
Kuldeep Ghetiya
 
PPTX
Introduction to Big Data
Akshata Humbe
 
PDF
The Evidence Hub: Harnessing the Collective Intelligence of Communities to Bu...
Anna De Liddo
 
PDF
Data analytics course in bangalore
Umeshchandra Reddy Tera
 
PPTX
000 introduction to big data analytics 2021
Dendej Sawarnkatat
 
Data Science Courses - BigData VS Data Science
DataMites
 
What is Data Science
Ioannis Kourouklides
 
Kostas Kastrantas | Business Opportunities with Linked Open Data
semanticsconference
 
Big Data Analytics : Understanding for Research Activity
Andry Alamsyah
 
Data mining
Ahmed Moussa
 
Big Data: Beyond the hype, Delivering value
Edward Curry
 
Data mining
Alisha Korpal
 
Lect 1 introduction
hktripathy
 
Intro to big data and applications - day 1
Parviz Vakili
 
Thomas Vavra | New Ways of Handling Old Data
semanticsconference
 
Social Big Data in Government
Adegboyega Ojo
 
Introduction to Big Data Analytics
Utkarsh Sharma
 
Introduction to Big Data & Analytics
Prasad Chitta
 
Big Data Paper
Andile Ngcaba
 
Metadata
Kuldeep Ghetiya
 
Introduction to Big Data
Akshata Humbe
 
The Evidence Hub: Harnessing the Collective Intelligence of Communities to Bu...
Anna De Liddo
 
Data analytics course in bangalore
Umeshchandra Reddy Tera
 
000 introduction to big data analytics 2021
Dendej Sawarnkatat
 
Ad

Viewers also liked (20)

PDF
The Market Research Software Survey
Research Magazine
 
PPTX
Frank sebastian-chaparro-zarta
kexal97
 
PDF
9707 s14 ms_23
Saadia Riaz
 
ODP
Wireshark tips
Alejandro E Brito Monedero
 
PPTX
Engineering Services Forum L&T & A123 Systems
Information Services Group (ISG)
 
DOC
Creative waystousepodcastsintheclassroom
kosovoireland
 
PPTX
PetaJakarta.org Student Presentation
SMART Infrastructure Facility
 
DOCX
Sistem informasi computer
Muhammad Love Kian
 
PDF
Plan de estudio curso básico
Iván Aguirre
 
PDF
Corso Fare Marketing sul Web_Varese - Slide introduttive
sFormati Digitali
 
PPTX
Don't Be a Settler
Information Services Group (ISG)
 
PDF
UCDAAP9NOV16
John Deatrick
 
PDF
Scan0001
CONGVANDEN_PKT
 
PPTX
#DBS2016 Digital Labor Disruption
Information Services Group (ISG)
 
PPT
Riesgos físicos
BTARequiem
 
PDF
Lewitt paragraphs on conceptual art
CCricket
 
PDF
9707 s14 ms_13
Saadia Riaz
 
PDF
Onet m6 52 thai
bussayamas1618
 
PPTX
Christmas Store Slideshow
acklandartmuseum
 
PPSX
Sur de Dublin
Carlos Colomer
 
The Market Research Software Survey
Research Magazine
 
Frank sebastian-chaparro-zarta
kexal97
 
9707 s14 ms_23
Saadia Riaz
 
Engineering Services Forum L&T & A123 Systems
Information Services Group (ISG)
 
Creative waystousepodcastsintheclassroom
kosovoireland
 
PetaJakarta.org Student Presentation
SMART Infrastructure Facility
 
Sistem informasi computer
Muhammad Love Kian
 
Plan de estudio curso básico
Iván Aguirre
 
Corso Fare Marketing sul Web_Varese - Slide introduttive
sFormati Digitali
 
UCDAAP9NOV16
John Deatrick
 
Scan0001
CONGVANDEN_PKT
 
#DBS2016 Digital Labor Disruption
Information Services Group (ISG)
 
Riesgos físicos
BTARequiem
 
Lewitt paragraphs on conceptual art
CCricket
 
9707 s14 ms_13
Saadia Riaz
 
Onet m6 52 thai
bussayamas1618
 
Christmas Store Slideshow
acklandartmuseum
 
Sur de Dublin
Carlos Colomer
 
Ad

Similar to SMART Seminar Series: "From Big Data to Smart data" (20)

PDF
IBM Smart Camp: Philippe Souidi on Big Data
Philippe Souidi
 
PDF
EDF2013: Invited Talk Julie Marguerite: Big data: a new world of opportunitie...
European Data Forum
 
PDF
Big Data
Putchong Uthayopas
 
PDF
Big data-and-creativity v.1
Kim Flintoff
 
PDF
Computational intelligence for big data analytics bda 2013
oj08
 
PPTX
Big Data By Vijay Bhaskar Semwal
IIIT Allahabad
 
PPTX
20211011112936_PPT01-Introduction to Big Data.pptx
SyauqiAsyhabira1
 
PPTX
University Public Driven Applications - Big Data and Organizational Design
maria chiara pettenati
 
PPT
Lecture 5 - Big Data and Hadoop Intro.ppt
almaraniabwmalk
 
PPTX
Data mining with big data implementation
Sandip Tipayle Patil
 
PPTX
Big data business case
Karthik Padmanabhan ( MLE℠)
 
PPTX
Pemanfaatan Big Data Dalam Riset 2023.pptx
elisarosa29
 
PDF
Big Data
Mehmet Burak Akgün
 
PDF
Big Data Analytics Introduction chapter.pdf
NikulZinzuvadiya
 
PPTX
computer projecttttttttttttttttttttttttttttttttttttttttt
SugatShakya5
 
PPTX
SKILLWISE-BIGDATA ANALYSIS
Skillwise Consulting
 
PPTX
Age Friendly Economy - Introduction to Big Data
AgeFriendlyEconomy
 
PPTX
Bigdata and Hadoop with applications
Padma Metta
 
PPT
Big Data Ecosystem for Data-Driven Decision Making
Abzetdin Adamov
 
PPTX
Ppt for Application of big data
Prashant Sharma
 
IBM Smart Camp: Philippe Souidi on Big Data
Philippe Souidi
 
EDF2013: Invited Talk Julie Marguerite: Big data: a new world of opportunitie...
European Data Forum
 
Big data-and-creativity v.1
Kim Flintoff
 
Computational intelligence for big data analytics bda 2013
oj08
 
Big Data By Vijay Bhaskar Semwal
IIIT Allahabad
 
20211011112936_PPT01-Introduction to Big Data.pptx
SyauqiAsyhabira1
 
University Public Driven Applications - Big Data and Organizational Design
maria chiara pettenati
 
Lecture 5 - Big Data and Hadoop Intro.ppt
almaraniabwmalk
 
Data mining with big data implementation
Sandip Tipayle Patil
 
Big data business case
Karthik Padmanabhan ( MLE℠)
 
Pemanfaatan Big Data Dalam Riset 2023.pptx
elisarosa29
 
Big Data Analytics Introduction chapter.pdf
NikulZinzuvadiya
 
computer projecttttttttttttttttttttttttttttttttttttttttt
SugatShakya5
 
SKILLWISE-BIGDATA ANALYSIS
Skillwise Consulting
 
Age Friendly Economy - Introduction to Big Data
AgeFriendlyEconomy
 
Bigdata and Hadoop with applications
Padma Metta
 
Big Data Ecosystem for Data-Driven Decision Making
Abzetdin Adamov
 
Ppt for Application of big data
Prashant Sharma
 

More from SMART Infrastructure Facility (20)

PPTX
SMART Seminar Series: "Cognitive Illusions in Virtual Reality: What do I mean...
SMART Infrastructure Facility
 
PDF
SMART Seminar Series: "Trusted Autonomous Systems as System of Systems". Pres...
SMART Infrastructure Facility
 
PPSX
SMART Seminar Series: "User-centric digital collaboration to build resilient ...
SMART Infrastructure Facility
 
PDF
SMART Seminar Series: "The Evolution of the Metric System: From Precious Lump...
SMART Infrastructure Facility
 
PDF
SMART Seminar Series: "Using AI and edge computing devices for traffic flow m...
SMART Infrastructure Facility
 
PPTX
SMART Seminar Series: "Blockchain and its Applications". Presented by Prof Wi...
SMART Infrastructure Facility
 
PPTX
SMART Seminar Series: "From an IoT cloud based architecture to Edge for dynam...
SMART Infrastructure Facility
 
PPTX
SMART Seminar Series: "Is bus bunching serious in Sydney? Preliminary finding...
SMART Infrastructure Facility
 
PDF
SMART Seminar Series: "Keep it SMART, keep it simple! – Challenging complexit...
SMART Infrastructure Facility
 
PDF
SMART Seminar Series: "Risk-based bridge assessment under changing load-deman...
SMART Infrastructure Facility
 
PPTX
SMART Seminar Series: "Deep Learning: Fundamentals and Practice". Presented b...
SMART Infrastructure Facility
 
PPTX
SMART Seminar Series: "Infrastructure Resilience: Planning for Future Extreme...
SMART Infrastructure Facility
 
PPTX
SMART Seminar Series: "Potential use of drones for infrastructure inspection ...
SMART Infrastructure Facility
 
PDF
SMART Seminar Series: "A journey in the zoo of Turing patterns: the topology ...
SMART Infrastructure Facility
 
PPTX
SMART Seminar Series: "Human behaviour modelling and simulation for crisis ma...
SMART Infrastructure Facility
 
PPTX
SMART Seminar Series: "Dealing with uncertainty: With the observer in the loo...
SMART Infrastructure Facility
 
PDF
SMART Seminar Series: "Smart Cities: The Good, The Bad & The Ugly"
SMART Infrastructure Facility
 
PDF
SMART Seminar Series: "How to improve the order of evolutionary models in age...
SMART Infrastructure Facility
 
PPTX
SMART Seminar Series: "OneM2M – Towards end-to-end interoperability of the IoT"
SMART Infrastructure Facility
 
PPTX
SMART Seminar Series: "Blue-Green vs. Grey-Black infrastructure – which is be...
SMART Infrastructure Facility
 
SMART Seminar Series: "Cognitive Illusions in Virtual Reality: What do I mean...
SMART Infrastructure Facility
 
SMART Seminar Series: "Trusted Autonomous Systems as System of Systems". Pres...
SMART Infrastructure Facility
 
SMART Seminar Series: "User-centric digital collaboration to build resilient ...
SMART Infrastructure Facility
 
SMART Seminar Series: "The Evolution of the Metric System: From Precious Lump...
SMART Infrastructure Facility
 
SMART Seminar Series: "Using AI and edge computing devices for traffic flow m...
SMART Infrastructure Facility
 
SMART Seminar Series: "Blockchain and its Applications". Presented by Prof Wi...
SMART Infrastructure Facility
 
SMART Seminar Series: "From an IoT cloud based architecture to Edge for dynam...
SMART Infrastructure Facility
 
SMART Seminar Series: "Is bus bunching serious in Sydney? Preliminary finding...
SMART Infrastructure Facility
 
SMART Seminar Series: "Keep it SMART, keep it simple! – Challenging complexit...
SMART Infrastructure Facility
 
SMART Seminar Series: "Risk-based bridge assessment under changing load-deman...
SMART Infrastructure Facility
 
SMART Seminar Series: "Deep Learning: Fundamentals and Practice". Presented b...
SMART Infrastructure Facility
 
SMART Seminar Series: "Infrastructure Resilience: Planning for Future Extreme...
SMART Infrastructure Facility
 
SMART Seminar Series: "Potential use of drones for infrastructure inspection ...
SMART Infrastructure Facility
 
SMART Seminar Series: "A journey in the zoo of Turing patterns: the topology ...
SMART Infrastructure Facility
 
SMART Seminar Series: "Human behaviour modelling and simulation for crisis ma...
SMART Infrastructure Facility
 
SMART Seminar Series: "Dealing with uncertainty: With the observer in the loo...
SMART Infrastructure Facility
 
SMART Seminar Series: "Smart Cities: The Good, The Bad & The Ugly"
SMART Infrastructure Facility
 
SMART Seminar Series: "How to improve the order of evolutionary models in age...
SMART Infrastructure Facility
 
SMART Seminar Series: "OneM2M – Towards end-to-end interoperability of the IoT"
SMART Infrastructure Facility
 
SMART Seminar Series: "Blue-Green vs. Grey-Black infrastructure – which is be...
SMART Infrastructure Facility
 

Recently uploaded (20)

PDF
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PPTX
Presentation on animal welfare a good topic
kidscream385
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PDF
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
INFO8116 -Big data architecture and analytics
guddipatel10
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
Presentation on animal welfare a good topic
kidscream385
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
INFO8116 -Big data architecture and analytics
guddipatel10
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 

SMART Seminar Series: "From Big Data to Smart data"

  • 1. From Big Data to Smart data Jie (Jack) Yang | April 2016
  • 2. —What is Big Data? —Challenge of Big Data processing —Smart Learning framework —Applications —Conclusions Outline
  • 3. —No single standard definition —5-V information assets that require innovative techniques, algorithms, and analytics that enable decision making, and process automation Big Data definition
  • 4. 1 – Scale (Volume) 12+ TBs of tweet data every day 25+ TBs of log data every day ?TBsof dataeveryday 2+ billion people on the Web by end 2011 30 billion RFID tags today (1.3B in 2005) 4.6 billion camera phones world wide 100s of millions of GPS enabled devices sold annually 76 million smart meters in 2009… 200M by 2014
  • 5. The ability to manage, analyse, summarise, visualise, and discover knowledge from the collected data in a timely and scalable manner 2 – Speed (Velocity) Social media and networks (millions of active users) Mobile devices (tracking objects all the time) Infrastructure sensors and/or instruments (measuring all kinds of data)
  • 6. Various formats, types and structures: — Text — Numerical — Multi-dim arrays — Images, audio, video, sequences — Time series — Graph (network) — Streaming data — etc 3 – Complexity (Varity)
  • 7. 4 – Uncertainty (Veracity)
  • 8. 5 – Benefit (Value) Value ($, time, performance)
  • 9. Beer & Diaper (Woolworths in Illawarra) “A number of convenience store clerks noticed that men often bought beer at the same time they bought diapers. The store mined its receipts and proved the clerks' observations correct. So, the store began stocking diapers next to the beer coolers, and sales skyrocketed” Asimple example
  • 10. Hardware —Choose machines —System failure Challenge of Big Data processing 16 Cores, 32G RAM, $AUD6000+
  • 11. Software —Different data sources —Really slow —Memory issue (out of memory, for 52 million records) Challenge of Big Data processing
  • 12. Smart Learning Framework Data harvesting data partners Data mining Data storage Data streaming Data visualisation
  • 13. Hardware —Money wise —Tolerance to hardware failure Smart Learning Framework 16 Cores, 32G RAM, $AUD6000+ 4 Cores, 8G RAM, $AUD 600+
  • 14. Main features — Collection across different platforms and formats • APIs • Web crawling — 1 master and 6 workers • distributing–working–waiting–reactivating process — Data volume (per day) • 20K+ records user activities • 25K+ records from social platforms • 200K+ tweets around AU and EU Data harvesting
  • 15. Main features — save data into different formats • Pure TXT / CSV • (NO)SQL — Query across all — Fast respond Data storage SELECT * FROM (SELECT * FROM /web/logs/CSV) t0 JOIN ( SELECT country, count(*) FROM mysql.web.users GROUP BY country) t1 JOIN (SELECT timestamp FROM s3.root.clicks.json WHERE user_id = 'jdoe‘) t2
  • 16. Main features — Preprocessing (filtering, cleansing, feature extraction) — Event simulation — Saving to DBs — Running ML jobs on the fly • Receiver throughput = 3kb /sec • Consumer throughput = 2kb /sec • Consumer latency = 0.23 sec Data streaming
  • 17. Main features (35 online training jobs per day) — Supervised (with a human assisting in classification) / unsupervised machine learning techniques, to assist with classification, clustering and prediction; — Geospatial analysis: K-pop cluster in geographical regions; — Network analysis to understand social connections between consumers and producers; — Other analysis including: • More sophisticated number crunching of comments, such as time series analysis to examine trends; • Natural language processing techniques to assist with sentiment analysis. Data mining
  • 18. Student behaviour analysis (OLPC, until Feb 2016): — 153+ schools — 20K+ active laptops — 4.2M+ activity records Application 1 0 1000 2000 3000 1.2M 2.6M 4.2M Most popular Apps (per school) App usage (per school) 0 1000 2000 3000 1.2M 2.6M 4.2M
  • 20. Car parking — Every 2 minutes — 604800 records (May to Oct 2015) — Temporal and spatial features Application 2
  • 21. Application 2 Average classification accuracy (%) as a function of the size of the selected samples. Average computational time (second)
  • 22. Social media analysis — 70K+ films — 228K+ users (2M + friendships) — 1M+ reviews — 13 features Application 3
  • 23. — User profile vs film preference — User profile vs topics Application 3
  • 24. — Network analysis — Opinion leadership Application 3 4K nodes + 7K edges 76 nodes + 253 edges
  • 25. Jie Yang; Jun Ma, A structure optimization algorithm of neural networks for large-scale data sets, Fuzz-IEEE,2014; Jie Yang; Jun Ma, A Sparsity-Based Training Algorithm for Least Squares SVM, IEEE SSCI, 2014; Jie Yang, Jun Ma, A big-data processing framework for uncertainties in Transportation data, Fuzz-IEEE, 2015 Jie Yang, Jun Ma, and Sarah K. Howard, A Structure Optimization Algorithm of Neural Networks for Pattern Learning from Educational Data, Springer Studies in Computational Intelligence ANN Modelling, 2015 Jie Yang; Jun Ma, A hybrid gene expression programming algorithm based on orthogonal design, International Journal of Computational Intelligence Systems, 2015 Jie Yang, Brian Yecies, Mining Chinese Social Media UGC A SmartLearning Framework For Analyzing Douban Movie Reviews, Journal of Big Data, 2016 Jie Yang; Jun Ma, A structure optimization framework for feed-forward neural networks using sparse representation, Knowledge-Based Systems, 2016; Jie Yang; Jun Ma, Sarah K. Howard, Exploring Technology Integration in Education using Fuzzy Representation and Feature Selection, Fuzz-IEEE, 2016 Brian Yecies, Jie Yang, Matthew Berryman, Kai Soh, Marketing Bait: Using SMART Data to Identify E-guanxi Among China’s ‘Internet Aborigines, Film Marketing in a Global Era, 2015 Brian Yecies, Jie Yang, Matthew Berryman, Aegyung Shim, and Kai Soh, Korean Female Writer-Directors and SMART Analysis of Douban commentary Among China’s Digital Natives, Women Screenwriters: An International Guide, 2015 Brian Yecies, Jie Yang, Matthew Berryman, Aegyung Shim, and Kai Soh, Korean Female Writer–Directors and SMART Analysis of Douban Commentary Among China’s Digital Natives, Participations: International Journal of Audience Research, 2016 Sarah K. Howard, Jun Ma, Jie Yang, Kate Thompson, The use of data mining to explore factors of technology integration in learning and teaching, EARLI 2015 Sarah K. Howard, Ellie Rennie, Jun Ma, Jie Yang, Big Data, Big Theory: Moving Beyond New Empiricism to Generate Powerful Explanations, The New Data “Revolution” in Sociology, 2016 Jun Ma, Jie Yang, Rohan W. Denagamage and Murad Safadi, A Conceptual Model for Clustering Local Government Areas using Complex Fuzzy Sets, Fuzz-IEEE, 2016 Publications
  • 26. — OLPC (ARC-Linkage) — NSW-DER — CAAR — China-South Korean Foundation — Healthcare (Pubmed, Seer) — Tourism business project (UTS) — MTR Projects and grants
  • 27. — Big Data processing: • Data collection; streaming data; data storage; and Machine learning • Open source libraries — Other domains: • Public transportation • Business Intelligence • Health care Conclusions