Descobrindo o tesouro
escondido nos seus
dados usando grafos.
ANA PAULA APPEL
1
2
About Me
u 1997~2000 à UFSCar (Computer Science)
u 2000~2003 à ICMC USP (Master – DB)
u 2003~2005 à Temporary Professor
u 2005~2010 à PhD ICMC USP/CMU US
u 2008 à CMU Internship
u 2010~2011 à UFES Prof / Pos-doc UFSCAR
u 2012~Now à IBM Research Lab
u IBM Master Inventor (2016)
u Member of Academy of Technology
3
Machine Learning
CS Theory
Data Mining
Database
systems
Graph
mining
Ana Paula Appel
P.hD. em Ciência da Computação
ICMC USP / Carnegie Mellon
apappel@br.ibm.com
@paulinhaappel
Data/Graph
Mining
Data Science 4
Science WITH Data or Science OF Data
u Science OF Data is an academic subject that studies data in all its
manifestations, together with methods and algorithms to manipulate, analyze,
visualize and enrich data.
u Science WITH Data occurs in other academic subjects, where analytics becomes
a major way to build models, design artefacts, and generally increase our
understanding of the subject in a data-driven way.
5
Wisdom through Data 6
Big Data, Analytics & Data Science
u Extracting insights from
massive amounts of
data to help make
better decisions and
predictions
7
Traditional Data Mining
u Data mining has rich history and methods for
analyzing ...
u ... tabular data
u ... textual data
u ... time series & streams
u ... market baskets
u What about relations and dependencies?
Bag of
features
8
u What if we want
to understand the
relationship about
each instance of
our data?
9
Networks
uWe should
go to
networks!!!!
10
11
How everything start
u Leonhard Euler, 1875
u Seven Bridges of Königsberg:
u “Is possible someone cross the 7
bridges without cross the same
bridge twice?”
u No, the graph need to be at
most two nodes with degree
odd;
u Born of Graph Theory
12
Milgram’s experiment
u Instructions:
u From Nebraska, given a target individual (stockbroker in Boston),
pass the message to a person you correspond with who is “closest”
to the target.
u Outcome:
u 64 of 296 chains reached the target
u 20% of initiated chains reached target average chain length = 6.5
u “Six degrees of separation”
13
Milgram’s experiment repeated 14
Complex Network 15
Complex Networks 16
Complex Networks 17
Complex Networks 18
Complex Networks 19
Complex Networks 20
Complex Networks 21
Graph for Fraud
Ian Molloy, Suresh Chari, Ulrich Finkler, Mark Wiggerman, Coen Jonker, Ted Habeck, Youngja
Park, Frank Jordens, Ron van Schaik: Graph Analytics for Real-Time Scoring of Cross-Channel
Transactional Fraud. Financial Cryptography 2016: 22-40
22
Malware Detection
Polonium: Tera-Scale Graph Mining for Malware Detection. Duen Horng (Polo) Chau,
Carey Nachenberg, Jeffrey Wilhelm, Adam Wright, Christos Faloutsos. The 2nd Workshop
on Large-scale Data Mining: Theory and Applications (LDMTA 2010). July 25, 2010.
Washington, DC.
23
Identifying Successful Investors in
the Startup Ecosystem
Identifying Successful Investors in the Startup Ecosystem. Srishti Gupta, Robert Pienta,
Acar Tamersoy, Duen Horng Chau, Rahul C. Basole. International Conference on
World Wide Web (WWW) 2015, May 18 -22, 2015. Florence, Italy
24
Opinion Spam 25
Complex Networks books 26
Complex Networks books 27
Tools
Database Engines Visualization
GRAPH ANALYTICS
28
Networks
Behind many systems there is an intricate
wiring diagram, a network, that defines the
interactions between the components
We will never understand these systems
unless we understand the networks behind
them!
29
Components of a network 30
Networks or Graphs?
u Network often refers to real systems
u Web, Social network, Metabolic network
u Language: Network, node, link
u Graph is mathematical representation of a
network
u Web graph, Social graph (a Facebook term)
u Language: Graph, vertex, edge
31
Network Elements: Edges
u Directed (also called arcs, links)
uA -> B
uA likes B, A gave a gift to B, A is B’s child
u Undirected
uA <-> B or A – B
uA and B like each other
uA and B are siblings
uA and B are co-authors
32
Computing Metrics
u Degree & Degree Distribution
u Connected Components
33
Nodes
u Node network properties
u from immediate connections
u In-degree
how many directed edges (arcs) are incident
on a node
u Out-degree
how many directed edges (arcs) originate at a
node
u degree (in or out)
number of edges incident on a node
u from the entire graph
u centrality (betweenness, closeness)
34
Node degree from Matrix Values 35
Is Everything Connected? 36
Connected Components
u Strongly connected components
u Each node within the component can be reached from every other
node in the component by following directed links
u B C D E
u A
u G H
u F
u Weakly connected components:
u every node can be reached from every other node by following links in
either direction
u ABCDE
u G H F
u In undirected networks one talks simply about ‘connected
components’
37
Giant Component
u if the largest component
encompasses a significant fraction of
the graph, it is called the giant
component
38
Why Networks?
u Universal language for describing complex data
u Networks from science, nature, and technology are more similar
than one would expect
u Shared vocabulary between fields
u Computer Science, Social science, Physics, Economics, Statitics,
Biology
u Data availability (computational challenges)
u Web/mobile, bio, health, and medical
u Impact!
u Social networking, Social media, Drug design
39
Networks: Size Matters
u Network data: Orders of magnitude
u 436-node network of email exchange at a corporate research
lab [Adamic-Adar, SocNets ‘03]
u 43,553-node network of email exchange at an university
[Kossinets-Wacs, Science ‘06]
u 4.4-million-node network of declared friendships on a blogging
community [Liben-Nowell et al., PNAS ‘05]
u 240-million-node network of communica)on on Microsod
Messenger [Leskovec-Horvitz, WWW ’08]
u 800-million-node Facebook network [Backstrom et al. ‘11]
40
Networks Really Matter
u If you want to understand the spread of
diseases, you need to figure out who will be in
contact with whom
u If you want to understand the structure of the
Web, you have to analyze the ‘links’.
u If you want to understand dissemination of
news or evolution of science, you have to
follow the flow.
41
Reasoning about Networks
u What do we hope to achieve from
studying networks?
uPatterns and statistical properties of network
data
uDesign principles and models
uUnderstand why networks are organized the
way they are
uPredict behavior of networked systems
42
Reasoning about Networks
u How do we reason about networks?
u Empirical: Study network data to find organizational
principles
u How do we measure and quantify networks?
u Mathematical models: Graph theory and statistical
models
u Models allow us to understand behaviors and distinguish
surprising from expected phenomena
u Algorithms for analyzing graphs
u Hard computational challenges
43
Networks: Structure & Process
u What do we study in networks?
u Structure and evolution:
u What is the structure of a network?
u Why and how did it come to have such structure?
u Processes and dynamics:
u Networks provide “skeleton” for spreading of information, behavior,
diseases
u How do information and diseases spread?
44
Networks: Online
u Communication networks:
u Intrusion detection, fraud
u Churn prediction
u Social networks:
u Link prediction, friend recommendation
u Social circle detection, community detection
u Social recommendations
u Identifying influential nodes, Information virality
u Information networks:
u Navigational aids
45
How it all fits together 46
Power Law
u A distribution is a power law if its follow:
u p(x) is a probability of x happen, where ‘a’ is a
proportionality constant and ‘y’ is the law exponent
47
A. Clauset, C.R. Shalizi, and M.E.J. Newman, "Power-law distributions
in empirical data" SIAM Review 51(4), 661-703 (2009).
Heavy tails: right skew
u Right skew
u normal distribution (not heavy tailed)
u e.g. heights of human males: centered around 180cm (5’11’’)
u Zipf’s or power-law distribution (heavy tailed)
u e.g. city population sizes: NYC 8 million, but many,
many small towns
48
Normal distribution (human heights) 49
Heavy tails: max to min ratio
u High ratio of max to min
uhuman heights
utallest man: 272cm (8’11”), shortest man: (1’10”)
ratio: 4.8
from the Guinness Book of world records
ucity sizes
uNYC: pop. 8 million, Duffield, Virginia pop. 52, ratio:
150,000
50
Degree distribution
u Many real network has
hubs: high connected
nodes
u A distribution is easily
distinguished between a
power law and
exponential using graphics
in log-lin and log-log axis
u A power law is a line in a
log-log plot
51
10
0
10
1
10
2
10
3
10
4
10
0
10
1
10
2
10
3
10
4
10
5
0 200 400 600 800 1000
10
-6
10
-5
10
-4
10
-3
10
-2
0 200 400 600 800 1000
0
0.5
1
1.5
2
2.5
3
3.5
x 10
-3
Degree Distribution
(Same data in different plots)
lin-lin log-lin
log-loglogpk
log k
kk
logpk
Power Law
Erdos Renyi vs. Scale Free 52
Poisson Network
A function is
scale free if:
f(ax) = c f(x)
(Erdos-Renyi random graph)
Degree distribution is a Poisson
Degree
distribution is a
Power Law
Scale free network
Exponential vs. Power-Law 53
Power laws are seemingly everywhere
note: these are cumulative distributions, more about this in a bit…
54
Transitivity, triadic closure, clustering
u Transitivity:
u if A is connected to B and B is connected to C.
u what is the probability that A is connected to C?
u my friends’ friends are likely to be my friends
55
Clustering
u Global clustering coefficient
u 3 x number of triangles in the graph
u number of connected triples of vertices
56
Local clustering coefficient (Watts&Strogatz
1998)
u For a vertex I
u The fraction pairs of neighbors of the node that are themselves
connected
u Let ni be the number of neighbors of vertex i
57
Local clustering coefficient (Watts&Strogatz
1998)
u Average over all n vertices
58
Networks and Complex System
u Complex systems are around us:
u Society is a collection of six billion individuals
u Communication systems link electronic devices
u Information and knowledge is organized and linked
u Interactions between thousands of genes regulate life
u Our thoughts are hidden in the connections between billions of neurons
in our brain
What do these systems have in common?
How can we represent them?
59
Betweenness Centrality
u The betweenness centrality of a node v is given by the
expression:
u where σst is the total number of shortest paths from node s to
node t and σst( v ) is the number of those paths that pass
through v.
60
Different notions of centrality
u In each of the following networks, X has higher centrality than Y
according to a particular measure
61
62
Real-world examples 63
Eigenvector Centrality in directed
networks
u PageRank (centrality) brings order to the Web:
u it's not just the pages that point to you, but how many pages point to
those pages, etc.
u more difficult to artificially inflate centrality with a recursive definition
64
PageRank: The “Flow” Model
u A “vote” from an important page is
worth more:
u Each link’s vote is proportional to
the importance of its source page
u If page i with importance ri has di
out-links, each link gets ri / di votes
u Page j’s own importance rj is the
sum of the votes on its in- links
65
Computing PageRank 66
Final PageRank Equation 67
PageRank & Eigenvector 68
Example 69
Pagerank Example 70
Community Structure
u How is the cluster structure of
complex network?
u How this structure scale from small
to large networks?
u How we think in cluster for large
networks?
71
Community Structure
u Muitas redes apresentam estrutura de
comunidades
u Grupos de nós que possuem um alto
número de arestas entre si do que com
outros grupos de nós
u As pessoas se dividem em grupo
naturalmente baseada em interesses,
idade, ocupação, …
u Como encontrar comunidades:
u Spectral clustering
u Clusterização hierárquico baseado em
conexão
u Modularity
u Partition
72
Rede de amizade de
crianças em uma
escola
Community Detection
u Redes de grupos fortemente
ligados
u Comunidades de rede:
u Conjuntos de nós com muitas
conexões dentro e para fora
poucos (o resto da rede)
73
Grupos,
comunidades,
módulos, clusters
Community Detection
u Intra Cluster Density
u Inter Cluster Density
74
δ(int)(S) > δ(ext)(S)
Detecção de Comunidades
u Como encontrar automaticamente
estes grupos de nós altamente
conectados?
u Idealmente a detecção automática
de clusters deve corresponder aos
grupos reais.
75
Girvan-Newman
u Detecção de cluster divisivo e hierarquico baseado na
noção de betweenness:
u Número de caminhos mínimos que passam por cada
aresta.
u Remover as aresta de modo decrescer o betweenness
76
Girvan, M. & Newman, M. E. J.
Community structure in social and
biological networks
Proc. Natl. Acad. Sci. USA, 2002, 99
Girvan-Newman 77
Girvan-Newman 78
Girvan-Newman 79
Communities 80
Example 81
Link Prediction
u Given a snapshot of a social network
u Question: infer which new interactions among its
members are likely to occur in the near future
[Liben-Nowell and Kleinberg, 2004, 2007]
82
Link Prediction
u Link prediction. A network is changing over time. Given a
snapshot of a network at time t, predict edges added in
the interval (t,t′)
u Link completion (missing links identification). Given a
network, infer links that are consistent with the structure,
but missing (find unobserved edges)
u Link reliability. Estimate the reliability of given links in the
graph.
u Predictions: link existence, link weight, link type
83
Link Prediction
u People you may know at
Facebook
u 92% from new friendship FB
are friends-from-friends
u Common Friends help
84
Scoring Algorithm
Link prediction by proximity scoring
1. For each pair of nodes compute proximity (similarity) score
c(v1,v2)
2. Sort all pairs by the decreasing score
3. Select top n pairs (or above some threshold) as new links
85
Common Neighbors
u The common-neighbors predictor captures the
notion that two strangers who have a common
friend may be introduced by that friend.
u This introduction has the effect of “closing a
triangle” in the graph and feels like a common
mechanism in real life.
86
Jaccard’s Coefficient
u The Jaccard coefficient—a similarity metric that is commonly used in
information retrieval— measures the probability that both x and y have
a feature f, for a randomly selected feature f that either x or y has.
u If we take “features” here to be neighbors, then this measure captures
the intuitively appealing notion that the proportion of the coauthors of x
who have also worked with y (and vice versa) is a good measure of the
similarity of x and y.
87
Adamic/Adar
u This measure refines the simple counting of
common features by weighting rarer features
more heavily.
u For x and y to be introduced by a common
friend z, person z will have to choose to
introduce the pair ⟨x,y⟩ from (choose |Γ(z)|
with 2) pairs of his friends; thus an unpopular
person (someone with not a lot of friends) may
be more likely to introduce a particular pair of
his friends to each other.
88
Preferential Attachment
u One well-known concept in social networks is that
users with many friends tend to create more
connections in the future. This is due to the fact
that in some social networks, like in finance, the
rich get richer. We estimate how ”rich” our two
vertices are by calculating the multiplication
between the number of friends (|Γ(x)|) or
followers each vertex has.
89
Katz
u This heuristic defines a measure that directly
sums over collection of paths, exponentially
damped by length to count short paths more
heavily.
u The Katz-measure is a variant of the shortest-
path measure.
u The idea behind the Katz-measure is that the
more paths there are between two vertices
and the shorter these paths are, the stronger
the connection.
90
Percolation on Complex Networks
u Percolation can be extended to
networks of arbitrary topology
u We say the network percolates
when a giant component forms
91
Scale-free networks are resilient with respect to
random attack
u gnutella network
u 20% of nodes removed
574 nodes in giant component 427 nodes in giant component
92
Targeted attacks are affective against scale-free
networks
u gnutella network,
u 22 most connected nodes removed (2.8% of the
nodes)
301 nodes in giant component574 nodes in giant component
93
Random Failures vs. Attacks
Source: Error and attack tolerance of complex networks. Réka Albert, Hawoong Jeong and Albert-László Barabási.
94
Assortativity
u Social networks are assortative:
u the gregarious people associate with other gregarious people
u the loners associate with other loners
u The Internet is disassortative:
Assortative:
hubs connect to hubs Random
Disassortative:
hubs are in the
periphery
95
Resilience:power grids and
cascading failures
u Vast system of electricity generation, transmission
& distribution is essentially a single network
u Power flows through all paths from source to sink
(flow calculations are important for other
networks, even social ones)
u All AC lines within an interconnect must be in
sync
u If frequency varies too much (as line approaches
capacity), a circuit breaker takes the generator
out of the system
u Larger flows are sent to neighboring parts of the
grid – triggering a cascading failure
Source: .wikipedia.org/wiki/File:UnitedStatesPowerGrid.jpg
96
(dis) information cascades
u Rumor spreading
u Urban legends
u Word of mouth (movies, products)
u Web is self-correcting:
u Satellite image hoax is first passed around,
then exposed, hoax fact is blogged about,
then written up on urbanlegends.about.com
Source: undetermined
97
98
Thanks
99
100
NetworkX
u subgraph(G, nbunch) - induce subgraph of G on nodes in nbunch
u union(G1,G2) - graph union
u disjoint_union(G1,G2) - graph union assuming all nodes are different
u cartesian_product(G1,G2) - return Cartesian product graph
u compose(G1,G2) - combine graphs identifying nodes common to both
u complement(G) - graph complement
u create_empty_copy(G) - return an empty copy of the same graph class
u convert_to_undirected(G) - return an undirected representation of G
u convert_to_directed(G) - return a directed representation of G
101
Famous Graphs
# small famous graphs
petersen=nx.petersen_graph()
tutte=nx.tutte_graph()
maze=nx.sedgewick_maze_graph()
tet=nx.tetrahedral_graph()
# classic graphs
K_5=nx.complete_graph(5)
K_3_5=nx.complete_bipartite_graph(3,5)
barbell=nx.barbell_graph(10,10)
lollipop=nx.lollipop_graph(10,20)
# random graphs
er=nx.erdos_renyi_graph(100,0.15)
ws=nx.watts_strogatz_graph(30,3,0.1)
ba=nx.barabasi_albert_graph(100,5)
red=nx.random_lobster(100,0.9,0.9)
102

More Related Content

PDF
Networks, Deep Learning (and COVID-19)
PDF
New prediction method for data spreading in social networks based on machine ...
PDF
[ADBIS 2021] - Optimizing Execution Plans in a Multistore
PDF
Greedy Incremental approach for unfolding of communities in massive networks
PDF
Data Science and Analytics Brown Bag
PPTX
VLDB 2015 Tutorial: On Uncertain Graph Modeling and Queries
PPTX
Dagstuhl seminar talk on querying big graphs
PDF
An information-theoretic, all-scales approach to comparing networks
Networks, Deep Learning (and COVID-19)
New prediction method for data spreading in social networks based on machine ...
[ADBIS 2021] - Optimizing Execution Plans in a Multistore
Greedy Incremental approach for unfolding of communities in massive networks
Data Science and Analytics Brown Bag
VLDB 2015 Tutorial: On Uncertain Graph Modeling and Queries
Dagstuhl seminar talk on querying big graphs
An information-theoretic, all-scales approach to comparing networks

What's hot (18)

PDF
PDF
Learning possibilistic networks from data: a survey
PPT
Webometrics 1.0 - from AltaVista to Small Worlds and Genre Drift
PDF
Brief bibliography of interestingness measure, bayesian belief network and ca...
PDF
Predicting_new_friendships_in_social_networks
DOC
Tacoma, WA 98422
PDF
How Learning Analytics can Support Teaching and Learning
PDF
MobiGIS 2016 workshop report: The Fifth ACM SIGSPATIAL International Workshop...
PDF
Rostislav Yavorsky - Research Challenges of Dynamic Socio-Semantic Networks
PPT
Contractor-Borner-SNA-SAC
PDF
PDF
ScientificCV
PDF
Ontology Building vs Data Harvesting and Cleaning for Smart-city Services
PPTX
Understanding City Traffic Dynamics Utilizing Sensor and Textual Observations
PDF
Us universities-colleges-topgre-com
PDF
Beauty as a Bridge to NodeXL
DOC
acm
PPT
bonino
Learning possibilistic networks from data: a survey
Webometrics 1.0 - from AltaVista to Small Worlds and Genre Drift
Brief bibliography of interestingness measure, bayesian belief network and ca...
Predicting_new_friendships_in_social_networks
Tacoma, WA 98422
How Learning Analytics can Support Teaching and Learning
MobiGIS 2016 workshop report: The Fifth ACM SIGSPATIAL International Workshop...
Rostislav Yavorsky - Research Challenges of Dynamic Socio-Semantic Networks
Contractor-Borner-SNA-SAC
ScientificCV
Ontology Building vs Data Harvesting and Cleaning for Smart-city Services
Understanding City Traffic Dynamics Utilizing Sensor and Textual Observations
Us universities-colleges-topgre-com
Beauty as a Bridge to NodeXL
acm
bonino
Ad

Similar to Descobrindo o tesouro escondido nos seus dados usando grafos. (20)

PPTX
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
PPT
Enabling Data-Intensive Science Through Data Infrastructures
PPTX
OII Summer Doctoral Programme 2010: Global brain by Meyer & Schroeder
PDF
Do Mechanical Turks Dream of Big Data?
PDF
Distributed Computing By Mobile Entities Current Research In Moving And Compu...
PPT
Internet and Bioinformatics for Biologists
PDF
La résolution de problèmes à l'aide de graphes
PPT
intro to sna.ppt
PDF
Computing And Information Technologies 1st George Antoniou Dorothy Deremer
PDF
Social Network Analysis BAsic Concepts, Methods and Theory
PDF
Network Science: Theory, Modeling and Applications
PDF
Digital Scholarship Intersection Scale Social Machines
PPTX
Semantic Sensor Networks and Linked Stream Data
PDF
Large-scale analysis of bibliometric networks
PPTX
Networks, Deep Learning and COVID-19
PPT
Socialnetworkanalysis (Tin180 Com)
PPTX
Building Effective Visualization Shiny WVF
ODP
The P4 of Networkacy
PPTX
Profiling Linked Open Data
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Enabling Data-Intensive Science Through Data Infrastructures
OII Summer Doctoral Programme 2010: Global brain by Meyer & Schroeder
Do Mechanical Turks Dream of Big Data?
Distributed Computing By Mobile Entities Current Research In Moving And Compu...
Internet and Bioinformatics for Biologists
La résolution de problèmes à l'aide de graphes
intro to sna.ppt
Computing And Information Technologies 1st George Antoniou Dorothy Deremer
Social Network Analysis BAsic Concepts, Methods and Theory
Network Science: Theory, Modeling and Applications
Digital Scholarship Intersection Scale Social Machines
Semantic Sensor Networks and Linked Stream Data
Large-scale analysis of bibliometric networks
Networks, Deep Learning and COVID-19
Socialnetworkanalysis (Tin180 Com)
Building Effective Visualization Shiny WVF
The P4 of Networkacy
Profiling Linked Open Data
Ad

Recently uploaded (20)

PPTX
transformers as a tool for understanding advance algorithms in deep learning
PPTX
inbound2857676998455010149.pptxmmmmmmmmm
PPTX
Reinforcement learning in artificial intelligence and deep learning
PPT
What is life? We never know the answer exactly
PPTX
indiraparyavaranbhavan-240418134200-31d840b3.pptx
PPTX
9 Bioterrorism.pptxnsbhsjdgdhdvkdbebrkndbd
PPT
BME 301 Lecture Note 1_2.ppt mata kuliah Instrumentasi
PDF
NU-MEP-Standards معايير تصميم جامعية .pdf
PPTX
lung disease detection using transfer learning approach.pptx
PDF
Delhi c@ll girl# cute girls in delhi with travel girls in delhi call now
PDF
book-34714 (2).pdfhjkkljgfdssawtjiiiiiujj
PPTX
Overview_of_Computing_Presentation.pptxxx
PDF
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
PPTX
Capstone Presentation a.pptx on data sci
PPTX
cyber row.pptx for cyber proffesionals and hackers
PPTX
DAA UNIT 1 for unit 1 time compixity PPT.pptx
PDF
Teal Blue Futuristic Metaverse Presentation.pdf
PDF
American Journal of Multidisciplinary Research and Review
PPTX
Bussiness Plan S Group of college 2020-23 Final
PPTX
DATA ANALYTICS COURSE IN PITAMPURA.pptx
transformers as a tool for understanding advance algorithms in deep learning
inbound2857676998455010149.pptxmmmmmmmmm
Reinforcement learning in artificial intelligence and deep learning
What is life? We never know the answer exactly
indiraparyavaranbhavan-240418134200-31d840b3.pptx
9 Bioterrorism.pptxnsbhsjdgdhdvkdbebrkndbd
BME 301 Lecture Note 1_2.ppt mata kuliah Instrumentasi
NU-MEP-Standards معايير تصميم جامعية .pdf
lung disease detection using transfer learning approach.pptx
Delhi c@ll girl# cute girls in delhi with travel girls in delhi call now
book-34714 (2).pdfhjkkljgfdssawtjiiiiiujj
Overview_of_Computing_Presentation.pptxxx
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
Capstone Presentation a.pptx on data sci
cyber row.pptx for cyber proffesionals and hackers
DAA UNIT 1 for unit 1 time compixity PPT.pptx
Teal Blue Futuristic Metaverse Presentation.pdf
American Journal of Multidisciplinary Research and Review
Bussiness Plan S Group of college 2020-23 Final
DATA ANALYTICS COURSE IN PITAMPURA.pptx

Descobrindo o tesouro escondido nos seus dados usando grafos.

  • 1. Descobrindo o tesouro escondido nos seus dados usando grafos. ANA PAULA APPEL 1
  • 2. 2
  • 3. About Me u 1997~2000 à UFSCar (Computer Science) u 2000~2003 à ICMC USP (Master – DB) u 2003~2005 à Temporary Professor u 2005~2010 à PhD ICMC USP/CMU US u 2008 à CMU Internship u 2010~2011 à UFES Prof / Pos-doc UFSCAR u 2012~Now à IBM Research Lab u IBM Master Inventor (2016) u Member of Academy of Technology 3 Machine Learning CS Theory Data Mining Database systems Graph mining Ana Paula Appel P.hD. em Ciência da Computação ICMC USP / Carnegie Mellon [email protected] @paulinhaappel Data/Graph Mining
  • 5. Science WITH Data or Science OF Data u Science OF Data is an academic subject that studies data in all its manifestations, together with methods and algorithms to manipulate, analyze, visualize and enrich data. u Science WITH Data occurs in other academic subjects, where analytics becomes a major way to build models, design artefacts, and generally increase our understanding of the subject in a data-driven way. 5
  • 7. Big Data, Analytics & Data Science u Extracting insights from massive amounts of data to help make better decisions and predictions 7
  • 8. Traditional Data Mining u Data mining has rich history and methods for analyzing ... u ... tabular data u ... textual data u ... time series & streams u ... market baskets u What about relations and dependencies? Bag of features 8
  • 9. u What if we want to understand the relationship about each instance of our data? 9
  • 11. 11
  • 12. How everything start u Leonhard Euler, 1875 u Seven Bridges of Königsberg: u “Is possible someone cross the 7 bridges without cross the same bridge twice?” u No, the graph need to be at most two nodes with degree odd; u Born of Graph Theory 12
  • 13. Milgram’s experiment u Instructions: u From Nebraska, given a target individual (stockbroker in Boston), pass the message to a person you correspond with who is “closest” to the target. u Outcome: u 64 of 296 chains reached the target u 20% of initiated chains reached target average chain length = 6.5 u “Six degrees of separation” 13
  • 22. Graph for Fraud Ian Molloy, Suresh Chari, Ulrich Finkler, Mark Wiggerman, Coen Jonker, Ted Habeck, Youngja Park, Frank Jordens, Ron van Schaik: Graph Analytics for Real-Time Scoring of Cross-Channel Transactional Fraud. Financial Cryptography 2016: 22-40 22
  • 23. Malware Detection Polonium: Tera-Scale Graph Mining for Malware Detection. Duen Horng (Polo) Chau, Carey Nachenberg, Jeffrey Wilhelm, Adam Wright, Christos Faloutsos. The 2nd Workshop on Large-scale Data Mining: Theory and Applications (LDMTA 2010). July 25, 2010. Washington, DC. 23
  • 24. Identifying Successful Investors in the Startup Ecosystem Identifying Successful Investors in the Startup Ecosystem. Srishti Gupta, Robert Pienta, Acar Tamersoy, Duen Horng Chau, Rahul C. Basole. International Conference on World Wide Web (WWW) 2015, May 18 -22, 2015. Florence, Italy 24
  • 29. Networks Behind many systems there is an intricate wiring diagram, a network, that defines the interactions between the components We will never understand these systems unless we understand the networks behind them! 29
  • 30. Components of a network 30
  • 31. Networks or Graphs? u Network often refers to real systems u Web, Social network, Metabolic network u Language: Network, node, link u Graph is mathematical representation of a network u Web graph, Social graph (a Facebook term) u Language: Graph, vertex, edge 31
  • 32. Network Elements: Edges u Directed (also called arcs, links) uA -> B uA likes B, A gave a gift to B, A is B’s child u Undirected uA <-> B or A – B uA and B like each other uA and B are siblings uA and B are co-authors 32
  • 33. Computing Metrics u Degree & Degree Distribution u Connected Components 33
  • 34. Nodes u Node network properties u from immediate connections u In-degree how many directed edges (arcs) are incident on a node u Out-degree how many directed edges (arcs) originate at a node u degree (in or out) number of edges incident on a node u from the entire graph u centrality (betweenness, closeness) 34
  • 35. Node degree from Matrix Values 35
  • 37. Connected Components u Strongly connected components u Each node within the component can be reached from every other node in the component by following directed links u B C D E u A u G H u F u Weakly connected components: u every node can be reached from every other node by following links in either direction u ABCDE u G H F u In undirected networks one talks simply about ‘connected components’ 37
  • 38. Giant Component u if the largest component encompasses a significant fraction of the graph, it is called the giant component 38
  • 39. Why Networks? u Universal language for describing complex data u Networks from science, nature, and technology are more similar than one would expect u Shared vocabulary between fields u Computer Science, Social science, Physics, Economics, Statitics, Biology u Data availability (computational challenges) u Web/mobile, bio, health, and medical u Impact! u Social networking, Social media, Drug design 39
  • 40. Networks: Size Matters u Network data: Orders of magnitude u 436-node network of email exchange at a corporate research lab [Adamic-Adar, SocNets ‘03] u 43,553-node network of email exchange at an university [Kossinets-Wacs, Science ‘06] u 4.4-million-node network of declared friendships on a blogging community [Liben-Nowell et al., PNAS ‘05] u 240-million-node network of communica)on on Microsod Messenger [Leskovec-Horvitz, WWW ’08] u 800-million-node Facebook network [Backstrom et al. ‘11] 40
  • 41. Networks Really Matter u If you want to understand the spread of diseases, you need to figure out who will be in contact with whom u If you want to understand the structure of the Web, you have to analyze the ‘links’. u If you want to understand dissemination of news or evolution of science, you have to follow the flow. 41
  • 42. Reasoning about Networks u What do we hope to achieve from studying networks? uPatterns and statistical properties of network data uDesign principles and models uUnderstand why networks are organized the way they are uPredict behavior of networked systems 42
  • 43. Reasoning about Networks u How do we reason about networks? u Empirical: Study network data to find organizational principles u How do we measure and quantify networks? u Mathematical models: Graph theory and statistical models u Models allow us to understand behaviors and distinguish surprising from expected phenomena u Algorithms for analyzing graphs u Hard computational challenges 43
  • 44. Networks: Structure & Process u What do we study in networks? u Structure and evolution: u What is the structure of a network? u Why and how did it come to have such structure? u Processes and dynamics: u Networks provide “skeleton” for spreading of information, behavior, diseases u How do information and diseases spread? 44
  • 45. Networks: Online u Communication networks: u Intrusion detection, fraud u Churn prediction u Social networks: u Link prediction, friend recommendation u Social circle detection, community detection u Social recommendations u Identifying influential nodes, Information virality u Information networks: u Navigational aids 45
  • 46. How it all fits together 46
  • 47. Power Law u A distribution is a power law if its follow: u p(x) is a probability of x happen, where ‘a’ is a proportionality constant and ‘y’ is the law exponent 47 A. Clauset, C.R. Shalizi, and M.E.J. Newman, "Power-law distributions in empirical data" SIAM Review 51(4), 661-703 (2009).
  • 48. Heavy tails: right skew u Right skew u normal distribution (not heavy tailed) u e.g. heights of human males: centered around 180cm (5’11’’) u Zipf’s or power-law distribution (heavy tailed) u e.g. city population sizes: NYC 8 million, but many, many small towns 48
  • 50. Heavy tails: max to min ratio u High ratio of max to min uhuman heights utallest man: 272cm (8’11”), shortest man: (1’10”) ratio: 4.8 from the Guinness Book of world records ucity sizes uNYC: pop. 8 million, Duffield, Virginia pop. 52, ratio: 150,000 50
  • 51. Degree distribution u Many real network has hubs: high connected nodes u A distribution is easily distinguished between a power law and exponential using graphics in log-lin and log-log axis u A power law is a line in a log-log plot 51 10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 10 5 0 200 400 600 800 1000 10 -6 10 -5 10 -4 10 -3 10 -2 0 200 400 600 800 1000 0 0.5 1 1.5 2 2.5 3 3.5 x 10 -3 Degree Distribution (Same data in different plots) lin-lin log-lin log-loglogpk log k kk logpk Power Law
  • 52. Erdos Renyi vs. Scale Free 52 Poisson Network A function is scale free if: f(ax) = c f(x) (Erdos-Renyi random graph) Degree distribution is a Poisson Degree distribution is a Power Law Scale free network
  • 54. Power laws are seemingly everywhere note: these are cumulative distributions, more about this in a bit… 54
  • 55. Transitivity, triadic closure, clustering u Transitivity: u if A is connected to B and B is connected to C. u what is the probability that A is connected to C? u my friends’ friends are likely to be my friends 55
  • 56. Clustering u Global clustering coefficient u 3 x number of triangles in the graph u number of connected triples of vertices 56
  • 57. Local clustering coefficient (Watts&Strogatz 1998) u For a vertex I u The fraction pairs of neighbors of the node that are themselves connected u Let ni be the number of neighbors of vertex i 57
  • 58. Local clustering coefficient (Watts&Strogatz 1998) u Average over all n vertices 58
  • 59. Networks and Complex System u Complex systems are around us: u Society is a collection of six billion individuals u Communication systems link electronic devices u Information and knowledge is organized and linked u Interactions between thousands of genes regulate life u Our thoughts are hidden in the connections between billions of neurons in our brain What do these systems have in common? How can we represent them? 59
  • 60. Betweenness Centrality u The betweenness centrality of a node v is given by the expression: u where σst is the total number of shortest paths from node s to node t and σst( v ) is the number of those paths that pass through v. 60
  • 61. Different notions of centrality u In each of the following networks, X has higher centrality than Y according to a particular measure 61
  • 62. 62
  • 64. Eigenvector Centrality in directed networks u PageRank (centrality) brings order to the Web: u it's not just the pages that point to you, but how many pages point to those pages, etc. u more difficult to artificially inflate centrality with a recursive definition 64
  • 65. PageRank: The “Flow” Model u A “vote” from an important page is worth more: u Each link’s vote is proportional to the importance of its source page u If page i with importance ri has di out-links, each link gets ri / di votes u Page j’s own importance rj is the sum of the votes on its in- links 65
  • 71. Community Structure u How is the cluster structure of complex network? u How this structure scale from small to large networks? u How we think in cluster for large networks? 71
  • 72. Community Structure u Muitas redes apresentam estrutura de comunidades u Grupos de nós que possuem um alto número de arestas entre si do que com outros grupos de nós u As pessoas se dividem em grupo naturalmente baseada em interesses, idade, ocupação, … u Como encontrar comunidades: u Spectral clustering u Clusterização hierárquico baseado em conexão u Modularity u Partition 72 Rede de amizade de crianças em uma escola
  • 73. Community Detection u Redes de grupos fortemente ligados u Comunidades de rede: u Conjuntos de nós com muitas conexões dentro e para fora poucos (o resto da rede) 73 Grupos, comunidades, módulos, clusters
  • 74. Community Detection u Intra Cluster Density u Inter Cluster Density 74 δ(int)(S) > δ(ext)(S)
  • 75. Detecção de Comunidades u Como encontrar automaticamente estes grupos de nós altamente conectados? u Idealmente a detecção automática de clusters deve corresponder aos grupos reais. 75
  • 76. Girvan-Newman u Detecção de cluster divisivo e hierarquico baseado na noção de betweenness: u Número de caminhos mínimos que passam por cada aresta. u Remover as aresta de modo decrescer o betweenness 76 Girvan, M. & Newman, M. E. J. Community structure in social and biological networks Proc. Natl. Acad. Sci. USA, 2002, 99
  • 82. Link Prediction u Given a snapshot of a social network u Question: infer which new interactions among its members are likely to occur in the near future [Liben-Nowell and Kleinberg, 2004, 2007] 82
  • 83. Link Prediction u Link prediction. A network is changing over time. Given a snapshot of a network at time t, predict edges added in the interval (t,t′) u Link completion (missing links identification). Given a network, infer links that are consistent with the structure, but missing (find unobserved edges) u Link reliability. Estimate the reliability of given links in the graph. u Predictions: link existence, link weight, link type 83
  • 84. Link Prediction u People you may know at Facebook u 92% from new friendship FB are friends-from-friends u Common Friends help 84
  • 85. Scoring Algorithm Link prediction by proximity scoring 1. For each pair of nodes compute proximity (similarity) score c(v1,v2) 2. Sort all pairs by the decreasing score 3. Select top n pairs (or above some threshold) as new links 85
  • 86. Common Neighbors u The common-neighbors predictor captures the notion that two strangers who have a common friend may be introduced by that friend. u This introduction has the effect of “closing a triangle” in the graph and feels like a common mechanism in real life. 86
  • 87. Jaccard’s Coefficient u The Jaccard coefficient—a similarity metric that is commonly used in information retrieval— measures the probability that both x and y have a feature f, for a randomly selected feature f that either x or y has. u If we take “features” here to be neighbors, then this measure captures the intuitively appealing notion that the proportion of the coauthors of x who have also worked with y (and vice versa) is a good measure of the similarity of x and y. 87
  • 88. Adamic/Adar u This measure refines the simple counting of common features by weighting rarer features more heavily. u For x and y to be introduced by a common friend z, person z will have to choose to introduce the pair ⟨x,y⟩ from (choose |Γ(z)| with 2) pairs of his friends; thus an unpopular person (someone with not a lot of friends) may be more likely to introduce a particular pair of his friends to each other. 88
  • 89. Preferential Attachment u One well-known concept in social networks is that users with many friends tend to create more connections in the future. This is due to the fact that in some social networks, like in finance, the rich get richer. We estimate how ”rich” our two vertices are by calculating the multiplication between the number of friends (|Γ(x)|) or followers each vertex has. 89
  • 90. Katz u This heuristic defines a measure that directly sums over collection of paths, exponentially damped by length to count short paths more heavily. u The Katz-measure is a variant of the shortest- path measure. u The idea behind the Katz-measure is that the more paths there are between two vertices and the shorter these paths are, the stronger the connection. 90
  • 91. Percolation on Complex Networks u Percolation can be extended to networks of arbitrary topology u We say the network percolates when a giant component forms 91
  • 92. Scale-free networks are resilient with respect to random attack u gnutella network u 20% of nodes removed 574 nodes in giant component 427 nodes in giant component 92
  • 93. Targeted attacks are affective against scale-free networks u gnutella network, u 22 most connected nodes removed (2.8% of the nodes) 301 nodes in giant component574 nodes in giant component 93
  • 94. Random Failures vs. Attacks Source: Error and attack tolerance of complex networks. Réka Albert, Hawoong Jeong and Albert-László Barabási. 94
  • 95. Assortativity u Social networks are assortative: u the gregarious people associate with other gregarious people u the loners associate with other loners u The Internet is disassortative: Assortative: hubs connect to hubs Random Disassortative: hubs are in the periphery 95
  • 96. Resilience:power grids and cascading failures u Vast system of electricity generation, transmission & distribution is essentially a single network u Power flows through all paths from source to sink (flow calculations are important for other networks, even social ones) u All AC lines within an interconnect must be in sync u If frequency varies too much (as line approaches capacity), a circuit breaker takes the generator out of the system u Larger flows are sent to neighboring parts of the grid – triggering a cascading failure Source: .wikipedia.org/wiki/File:UnitedStatesPowerGrid.jpg 96
  • 97. (dis) information cascades u Rumor spreading u Urban legends u Word of mouth (movies, products) u Web is self-correcting: u Satellite image hoax is first passed around, then exposed, hoax fact is blogged about, then written up on urbanlegends.about.com Source: undetermined 97
  • 98. 98
  • 100. 100
  • 101. NetworkX u subgraph(G, nbunch) - induce subgraph of G on nodes in nbunch u union(G1,G2) - graph union u disjoint_union(G1,G2) - graph union assuming all nodes are different u cartesian_product(G1,G2) - return Cartesian product graph u compose(G1,G2) - combine graphs identifying nodes common to both u complement(G) - graph complement u create_empty_copy(G) - return an empty copy of the same graph class u convert_to_undirected(G) - return an undirected representation of G u convert_to_directed(G) - return a directed representation of G 101
  • 102. Famous Graphs # small famous graphs petersen=nx.petersen_graph() tutte=nx.tutte_graph() maze=nx.sedgewick_maze_graph() tet=nx.tetrahedral_graph() # classic graphs K_5=nx.complete_graph(5) K_3_5=nx.complete_bipartite_graph(3,5) barbell=nx.barbell_graph(10,10) lollipop=nx.lollipop_graph(10,20) # random graphs er=nx.erdos_renyi_graph(100,0.15) ws=nx.watts_strogatz_graph(30,3,0.1) ba=nx.barabasi_albert_graph(100,5) red=nx.random_lobster(100,0.9,0.9) 102