SlideShare a Scribd company logo
Re-identification of Anonymized CDR
datasets Using Social network Data
Alket Cecaj, Marco Mamei, Nicola Bicocchi
University of studies of Modena and Reggio Emilia
PerCom 2014
More data..big opportunities of study
Dataset join and privacy issues
• Matching different users associated to the same real
person.
• Privacy issues: any kind of information can be inferred
● Join different datasets is the key for advanced forms of
context awareness
Related work
Anonymization..
and re-identification
• Gender, ZIP and full date of birth 63% of re-identification
• movie ratings from NetFlix Prize dataset
• Medical records of Massachusetts Hospital using a voters list
• re-identification of anonymous volunteers in a DNA study for Personal
Genome Project
In line with our domain
• Unique in the Crowd: the privacy bounds of Human Mobility
• Markov chain models for de-anonymization of geo-located data
Dataset join and privacy issues.
• Can we use data from social networks to re-
identify users for an anonymized dataset
such as a CDR one?
• Probabilistic approach to evaluate the re-
identification potential.
CDR and Social Data sets
CDR and Social Dataset - Distribution of events
● CDR
● on average 28 events/period , max = 330, min = 3
● 2.019321 users for final analysis
● Social dataset
● on average 20 events/period , max = 424, min = 3
● 700 users for final analysis
Matching users among datasets
● Time and space parameters for matching for example 10min of time
interval between events and cell radius as physical distance
● Clone of social dataset in order to check/verify the quantity of matchings
that were done by chance following Bonferroni’s principle.
● Exclusion of CDR users making events in the same time but in a long
distance much bigger that the cell radius.
Convergence to one ?
Distributions and Percentages
Probabilistic modelling
Given FTa,
U discrete random variable, having NU values Ui
i= 1...N
Overall results
Conclusions
Potential and/or limits of re-identification of users across
multiple mobility datasets.
Future research:
• the current model and overall approach needs refinement
• privacy concerns though mechanisms for preserving privacy and
data utility for a single aspect
• correlation among data sets represents a big opportunity to enrich the
information available to a pervasive application
Thank you for your attention.
Questions are welcome.
Re-identification of Anomized CDR datasets using Social networlk Data

More Related Content

PPTX
Information Fusion Methods for Location Data Analysis
Alket Cecaj
 
PPTX
Data fusion for city live event detection
Alket Cecaj
 
PDF
Deep Context-Awareness: Context Coupling and New Types of Context Information...
Hong-Linh Truong
 
PPT
On Physical Web models
Coldbeans Software
 
PDF
50120140506002
IAEME Publication
 
PPT
Visualizing Networked Collaboration
Ahmet Soylu
 
DOC
by Warren Jin
butest
 
PDF
On Crowd-sensing back-end
Coldbeans Software
 
Information Fusion Methods for Location Data Analysis
Alket Cecaj
 
Data fusion for city live event detection
Alket Cecaj
 
Deep Context-Awareness: Context Coupling and New Types of Context Information...
Hong-Linh Truong
 
On Physical Web models
Coldbeans Software
 
50120140506002
IAEME Publication
 
Visualizing Networked Collaboration
Ahmet Soylu
 
by Warren Jin
butest
 
On Crowd-sensing back-end
Coldbeans Software
 

What's hot (20)

PDF
"Grid Computing: BOINC Overview" por Rodrigo Neves, Nuno Mestre, Francisco Ma...
Núcleo de Electrónica e Informática da Universidade do Algarve
 
PPTX
Dacena
miss-lab
 
PDF
New prediction method for data spreading in social networks based on machine ...
TELKOMNIKA JOURNAL
 
DOCX
세계산학관협력총회 Watef 패널을 공지합니다
Han Woo PARK
 
PPT
2008 Annual Review Presentation
Bang Dinh
 
PDF
Resume sima das
Sima-Das
 
DOC
Dotnet ieee titles 2013 14
S3 Infotech IEEE Projects
 
PPT
Quality, Relevance and Importance in Information Retrieval with Fuzzy Semanti...
tmra
 
PDF
Distributed Data mining using Multi Agent data
IRJET Journal
 
PDF
ISWC 2016 Tutorial: Semantic Web of Things M3 framework & FIESTA-IoT EU project
FIESTA-IoT
 
PPTX
Inter-Organizational Crisis Management Infrastructures for Electrical Power B...
Torben Wiedenhoefer
 
PDF
Integrating Web Services With Geospatial Data Mining Disaster Management for ...
Waqas Tariq
 
PPT
Artemenko-poster
Виктор Артеменко
 
PDF
Mobile Sensors in the City
Neal Lathia
 
PDF
Data Models and the DMCA
professormadison
 
PPT
B08 B4pc 141 Diapo Amiotte En
Territorial Intelligence
 
PDF
Agent-Based Computing in the Internet of Things: a Survey. Claudio Savaglio, ...
Universita della Calabria,
 
PDF
A Survey On Ontology Agent Based Distributed Data Mining
Editor IJMTER
 
PPT
Reality Mining (Nathan Eagle)
Jan Sifra
 
PDF
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
Edward Curry
 
"Grid Computing: BOINC Overview" por Rodrigo Neves, Nuno Mestre, Francisco Ma...
Núcleo de Electrónica e Informática da Universidade do Algarve
 
Dacena
miss-lab
 
New prediction method for data spreading in social networks based on machine ...
TELKOMNIKA JOURNAL
 
세계산학관협력총회 Watef 패널을 공지합니다
Han Woo PARK
 
2008 Annual Review Presentation
Bang Dinh
 
Resume sima das
Sima-Das
 
Dotnet ieee titles 2013 14
S3 Infotech IEEE Projects
 
Quality, Relevance and Importance in Information Retrieval with Fuzzy Semanti...
tmra
 
Distributed Data mining using Multi Agent data
IRJET Journal
 
ISWC 2016 Tutorial: Semantic Web of Things M3 framework & FIESTA-IoT EU project
FIESTA-IoT
 
Inter-Organizational Crisis Management Infrastructures for Electrical Power B...
Torben Wiedenhoefer
 
Integrating Web Services With Geospatial Data Mining Disaster Management for ...
Waqas Tariq
 
Mobile Sensors in the City
Neal Lathia
 
Data Models and the DMCA
professormadison
 
B08 B4pc 141 Diapo Amiotte En
Territorial Intelligence
 
Agent-Based Computing in the Internet of Things: a Survey. Claudio Savaglio, ...
Universita della Calabria,
 
A Survey On Ontology Agent Based Distributed Data Mining
Editor IJMTER
 
Reality Mining (Nathan Eagle)
Jan Sifra
 
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
Edward Curry
 
Ad

Viewers also liked (20)

PDF
Asterisk (IP-PBX) CDR Log Rotation
William Lee
 
PPT
Social network analysis using Mobile phone data
António Oliveira
 
PDF
Cygwin Install How-To (Chinese)
William Lee
 
PDF
Timing over packet demarcation
Nir Cohen
 
PDF
Hello world在那邊?背景說明
Wen Liao
 
PDF
GNU AS簡介
Wen Liao
 
PDF
How to use phone calls and network analysis to identify criminals
Linkurious
 
PDF
from Source to Binary: How GNU Toolchain Works
National Cheng Kung University
 
PDF
UPnP 1.0 簡介
Wen Liao
 
PDF
Internationalization(i18n) of Web Page
William Lee
 
PDF
Trace 程式碼之皮
Wen Liao
 
PDF
GNU ld的linker script簡介
Wen Liao
 
PDF
A successful git branching model 導讀
Wen Liao
 
PDF
Streaming Media Server Setup Manual
William Lee
 
PDF
C++ idioms by example (Nov 2008)
Olve Maudal
 
PDF
Solid C++ by Example
Olve Maudal
 
PDF
How A Compiler Works: GNU Toolchain
National Cheng Kung University
 
PDF
Insecure coding in C (and C++)
Olve Maudal
 
PDF
MTP & PTP
William Lee
 
PDF
Introdunction to Network Management Protocols - SNMP & TR-069
William Lee
 
Asterisk (IP-PBX) CDR Log Rotation
William Lee
 
Social network analysis using Mobile phone data
António Oliveira
 
Cygwin Install How-To (Chinese)
William Lee
 
Timing over packet demarcation
Nir Cohen
 
Hello world在那邊?背景說明
Wen Liao
 
GNU AS簡介
Wen Liao
 
How to use phone calls and network analysis to identify criminals
Linkurious
 
from Source to Binary: How GNU Toolchain Works
National Cheng Kung University
 
UPnP 1.0 簡介
Wen Liao
 
Internationalization(i18n) of Web Page
William Lee
 
Trace 程式碼之皮
Wen Liao
 
GNU ld的linker script簡介
Wen Liao
 
A successful git branching model 導讀
Wen Liao
 
Streaming Media Server Setup Manual
William Lee
 
C++ idioms by example (Nov 2008)
Olve Maudal
 
Solid C++ by Example
Olve Maudal
 
How A Compiler Works: GNU Toolchain
National Cheng Kung University
 
Insecure coding in C (and C++)
Olve Maudal
 
MTP & PTP
William Lee
 
Introdunction to Network Management Protocols - SNMP & TR-069
William Lee
 
Ad

Similar to Re-identification of Anomized CDR datasets using Social networlk Data (10)

PPTX
Presentation of PhD thesis on Location Data Fusion
Alket Cecaj
 
PDF
The evidential value of mobile phone co-location
datasciencenl
 
PDF
ledio_gjoni_tesi
Ledio Gjoni
 
PDF
The evidential value of mobile phone colocation
Richard Gill
 
PDF
Human Mobility Patterns Modelling using CDRs
ijujournal
 
PDF
Human mobility patterns modelling using cd rs
ijujournal
 
PDF
Human Mobility Patterns Modelling using CDRs
ijujournal
 
PDF
IRJET- Cross System User Modeling and Personalization on the Social Web
IRJET Journal
 
PDF
Identity Resolution across Different Social Networks using Similarity Analysis
rahulmonikasharma
 
PDF
AI-based re-identification of behavioral data
MOSTLY AI
 
Presentation of PhD thesis on Location Data Fusion
Alket Cecaj
 
The evidential value of mobile phone co-location
datasciencenl
 
ledio_gjoni_tesi
Ledio Gjoni
 
The evidential value of mobile phone colocation
Richard Gill
 
Human Mobility Patterns Modelling using CDRs
ijujournal
 
Human mobility patterns modelling using cd rs
ijujournal
 
Human Mobility Patterns Modelling using CDRs
ijujournal
 
IRJET- Cross System User Modeling and Personalization on the Social Web
IRJET Journal
 
Identity Resolution across Different Social Networks using Similarity Analysis
rahulmonikasharma
 
AI-based re-identification of behavioral data
MOSTLY AI
 

More from Alket Cecaj (6)

PPTX
Distributed systems and blockchain technology
Alket Cecaj
 
PPT
Joomla
Alket Cecaj
 
PPT
Elaborazione e rappresentazione grafica e interattiva dell'informazione
Alket Cecaj
 
PPTX
Collective awareness for human ict collaboration in smart cities
Alket Cecaj
 
PPTX
Algorithms presentation
Alket Cecaj
 
PDF
Bridges innovcampdk
Alket Cecaj
 
Distributed systems and blockchain technology
Alket Cecaj
 
Joomla
Alket Cecaj
 
Elaborazione e rappresentazione grafica e interattiva dell'informazione
Alket Cecaj
 
Collective awareness for human ict collaboration in smart cities
Alket Cecaj
 
Algorithms presentation
Alket Cecaj
 
Bridges innovcampdk
Alket Cecaj
 

Recently uploaded (20)

PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
The Future of Artificial Intelligence (AI)
Mukul
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 

Re-identification of Anomized CDR datasets using Social networlk Data

  • 1. Re-identification of Anonymized CDR datasets Using Social network Data Alket Cecaj, Marco Mamei, Nicola Bicocchi University of studies of Modena and Reggio Emilia PerCom 2014
  • 3. Dataset join and privacy issues • Matching different users associated to the same real person. • Privacy issues: any kind of information can be inferred ● Join different datasets is the key for advanced forms of context awareness
  • 4. Related work Anonymization.. and re-identification • Gender, ZIP and full date of birth 63% of re-identification • movie ratings from NetFlix Prize dataset • Medical records of Massachusetts Hospital using a voters list • re-identification of anonymous volunteers in a DNA study for Personal Genome Project In line with our domain • Unique in the Crowd: the privacy bounds of Human Mobility • Markov chain models for de-anonymization of geo-located data
  • 5. Dataset join and privacy issues. • Can we use data from social networks to re- identify users for an anonymized dataset such as a CDR one? • Probabilistic approach to evaluate the re- identification potential.
  • 6. CDR and Social Data sets
  • 7. CDR and Social Dataset - Distribution of events ● CDR ● on average 28 events/period , max = 330, min = 3 ● 2.019321 users for final analysis ● Social dataset ● on average 20 events/period , max = 424, min = 3 ● 700 users for final analysis
  • 8. Matching users among datasets ● Time and space parameters for matching for example 10min of time interval between events and cell radius as physical distance ● Clone of social dataset in order to check/verify the quantity of matchings that were done by chance following Bonferroni’s principle. ● Exclusion of CDR users making events in the same time but in a long distance much bigger that the cell radius.
  • 11. Probabilistic modelling Given FTa, U discrete random variable, having NU values Ui i= 1...N
  • 13. Conclusions Potential and/or limits of re-identification of users across multiple mobility datasets. Future research: • the current model and overall approach needs refinement • privacy concerns though mechanisms for preserving privacy and data utility for a single aspect • correlation among data sets represents a big opportunity to enrich the information available to a pervasive application
  • 14. Thank you for your attention. Questions are welcome.

Editor's Notes

  • #2: My name is Alket Cecaj and I’m a PhD student at the University of studies of Modena and Reggio Emilia. In this work which has been done together with my supervisor Marco Mamei, and with Nicola Bicocchi we examine a large dataset of 335 million, anonymized call records made by 3 million users during a period of 47 days in a region of northern Italy. By combining this dataset with publicly available data from social networks such as twitter and flickr we present a probabilistic approach in order to evaluate the potential of re-identification of the anonymized dataset.
  • #3: As mobile devices and internet become available also a vast quantity of data is generated. In particular mobile telecom companies have the possibility of monitoring a large number of terminals as they connect to the network through collecting CDRs (Call Description Records). There is also publically available data from social networks such as twitter or flickr. Those services collect geo-referenced data about their users and make it available through their REST API services. This gives the possibility to infer people presence or actions in a determined context and study human and crowd behavior in a large scale.
  • #4: Obviously having more data or enriching existent data with other information enables interesting applications.For example it would be interesting to know if user X in the CDR dataset is actually the same user Y from the Twitter user data and then join the two datasets. The matching process is straightforward and consists in identifying if CDR user X and Twitter user Y consistently produced data at the same time and place and once enough geo-referenced elements overlap we can be reasonably sure that users are actually the same. The dark side of the moon is that merging dataset could raise privacy issues as relations between different types of data in particular geo-referenced data can be used to infer socio-economic status, mobility and shopping patterns or even user’s social graph. On the other hand combining different datasets is a key enabler for advanced context-awareness.
  • #5: The related work can be divided in two parts that are complementary. On one hand the data anonymization (in particular k-anonymity technique that means making a person indistinguishable from at least k users.) and on the other data re-identification So as anonymized data is available to researchers there is a considerable amount of works on data re-identification. Starting with some early works there is census re-identification by knowing 1-gender, ZIP and full date of birth allows for 63% of re-identification 2-re-identification of users in NetFlix Prize movie ratings dataset that NetFlix released for improving it’s recommendation system where the users where re-identified by relating their movie preferences or ratings with side information from IMDb 3-Medical records of Massachusetts Hospital using a voters list 4-re-identification of anonymous volunteers in a DNA study for Personal Genome Project More similar to our work are : unique in the crowd that analyzes mobility traces from CDR data in which the authors say that 4 geo-referenced points are enough for identifying up to 90 % of the CDR users.
  • #6: So our research purpose during this work was that of experimenting in this direction asking the following question (bullet point 1). and subsequently evaluate the potential of re-identification.
  • #7: CDR data consists in records or events made by a mobile device (such as incoming/outgoing calls, text messages and data transmission for Internet connections), timestamp and coordinates of the cell tower handling the event.. Social dataset is also made of records having an identifier(name or nickname), description of pic or tweet, coordinates and event timestamp.
  • #8: In a) (left side) there is the distribution of events generated by 3 million CDR users with an average of about 28. At your right there is the distribution of Twitter/Flickr users. At the beginning we considered a pool of 810 user from which we decided to choose 700 of them. Basically we excluded users which had done too many events or very few events .
  • #9: Combinatory approach trying to match (by time and space) every user from the first dataset with every other user in the second dataset. For example we had a match if the temporal distance between the events of the user X from the Flickr/Twitter dataset and the user Y from the CDR dataset was less than 10 minutes, and their physical distance was less than the radius of the cell tower handling the CDR event of Y.
  • #10: Considering the social user FTa (in black) producing data during a time interval in different moments t1, t2, t3 and t4 (starting from the left side and moving to the right), and considering the CDR users C1, C2, C3 and C4 we can built the following matchings as by figure. We can exclude C3 as this user produced data in the same interval of time but at a distance d >> r which is the radius of the cell. Between C1, C2 and C4 the best candidate is C2 which has a better overlapping, while C1 and C4 are lacking some data but still we can not exclude them.
  • #11: This slide presents some statistics of the quantity of matchings we found and their distribution. At the left there is a boxplot diagram summarizing the statistics of the number of CDR users (for a better graphical representation the y axis is in logarithmic scale) having x matching events with FT users. In the right side we have plotted the percentage of FT users that can be associated to x number of CDR users. Or course it is not possible to be completely sure about these users and for dealing with those kind of matchings we use a probabilistic approach that will be illustrated in the next slide.
  • #12: The probabilistic modelling tries to answer the question : given that the CDR user C2 has n events matching with FTa how likely it is that the two users are the same? In other words how likely it is that we actually de-anonymized the CDR user C2? We choose this approach not only because we had data from only one carrier but also because the number of possible matchings(or matching events) is really high and at the end not all the CDR users can be excluded with respect to the social user i. So given the FTa user(which is our social user), we consider a discrete random variable U having Nu values Ui (with i that goes from one to N) associated to the people that could be the user FTa. This way a subset of U will be associated also to our CDR users. Theta_i is the probability that two users(each from different datasets) are the same person. Then we can assume that the probability mass function associated to U can be modelled as a Dirichlet distribution where we set each alfa_i equal to one over Nu. So if our social user matches with 10 CDR users that each of them has the same probability (one tenth) 1/10. If a CDR user falls in the exclusion condition illustrated in the previous slide then we set alpha_ i = 0. Then we count the number of times each CDR users produces events matching the events of social user as M and following the Bayes rule update the posterior probability as the conditional probability of theta given M. At the end there will be a single most probable hypothesis or Maximum a Posteriori theta_i MAP
  • #13: Having considered only users having more than one match for each FT user we compute the probability of matching a CDR user. Figure a) left side, illustrates the results for a CRD-FT re-identification and it shows that the CDR user “0de7f” has a high probability and a large gap with other CDR users and even we don’t have ground truth evidence this large gap suggests the conclusion that the social user 1278644 is the same person as the CDR with whom it has such a large probability. In fig b) are shown the overall results where for each social user we compute the probability of top matching CDR user and then we count the number of CDR that are re-identified with a given probability and in this case with probability larger than 0.1. There are 260 social users we re-identified and this number is about one third of the social dataset we considered.
  • #14: Model based on a number of independency assumptions that can be hardly justified in the real world. Also the random variable being used tend to have a large number of possible outcomes and the overall probability distribution remains low even after a large number of matching events. Privacy concerns are the main impeding factor to prevent CDR data to be applied in pervasive applications but we believe that a viable approach can be that of a mechanism of differential anonymization that could preserve privacy without destroying the utility of the dataset for a single aspect that is the one useful for the specific application. Correlation among datasets represents a big opportunity to enrich the information available to a pervasive application for the achievement of pervasive computing vision.