SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 4320
Privacy preservation using Apache Spark
Sumedha Shenoy K1, Thamatam Bhavana2, S.Lokesh3
1
Student, CSE/The National Institute of Engineering, Mysuru, Karnataka, India
2Student, CSE/The National Institute of Engineering, Mysuru, Karnataka, India
3Associate Professor, Dept. of Computer Science Engineering, The National Institute of Engineering, Mysuru,
Karnataka, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - In the present, where the huge amountsofdatais
available; there is a difficulty to preserve the privacy of the
data. There exists medical data in which the privacy of the
patients is of utmost importance. The dataset of the patients
includes sensitive propertiessuchasname, age, disease, etc. So,
to prevent revealing the identity of person, big data
anonymization techniques are used. The implementations of
anonymization techniques are done using Apache Hadoop
previously. In this study, Spark framework is chosen to
facilitatehighprocessingspeedusingIn-memorycomputation.
It caches data in memory for further iterations whichenhance
the overall performance. Faster data anonymization
techniques using Spark are proposed to overcome themedical
dataset privacy problems.
Key Words: Anonymization, big-data, Spark , k-
anonymity, l-diversity, t-closeness, privacy
preservation.
1. INTRODUCTION
Privacy and confidentiality are huge aspects in social
life that we always have the dangers of misuse. In any
real-life situation we see lot of personal data being
shared, by entrusting the people around usforkeeping
it safe and away from misuse. In educational field, data
of studentsandtheiracademics;Ineconomicarea bank
details, salary information, share and stock related
stuff; In medical fields, patients personal data like
address, cell-phone number etc are some of the
sensitive attributes. These data should be with-held
from leaking into public domain. If not there can be
severe consequences of privacybreachanddataabuse.
The data that can be sensitive to a person but does not
directly identify him are called as quasi-identifiers.
These quasi-identifiers when analyzed in a particular
manner can point to the person. For example, a person
of age 30 suffering from cancer is living in a city (say
A). There can be few people matching this description
but the person’s identity can be found if we can put
together some other ofthesequasi-identifiers(Q.I)and
zero-in on a single match. Thus signifying that the Q.I
values also play a role in protecting or disclosing a
person’s privacy.
Anonymization is a way to handle these sensitive
attributes in a sense that there will beonlylimiteddata
available so as to make sure the privacy is preserved.
The approach is to make sure that differentiating
datasets becomes difficult and thus picking out one
individual data is not possible. Big-data is nothing but
the collection of growing datasets that obviously
includes a lot of sensitive attributes. When processing
these large datasets it is possible to implement the
anonymization algorithmsandthuspreservingprivacy
adequately.
2. EXISTING APPROACH
Data anonymization on medical data was done using
Hadoop as proposed in [1].The health-care data
includes a lot of data tuples containing sensitive
attributes enough to divulge privacy. UsingHadoopfor
computation anonymization algorithms like k-
anonymity, l-diversity was implemented to obtain
partitioned datasets. A scalable two phase top-down
specialization approach using MapReduce was
considered in [1].The first phase included partitioning
of datasets into smaller subsets to get an intermediate
anonymized results and second phase covers up to
merging various subsets for further anonymization.
Moredemonstrationsonanonymizationalgorithmsare
obtained from [2].
2.1 Drawbacks in the existing system-
1) Hadoop is not too suitable for smalldata.HDFShasa
high capacity design which restricts it from random
reading of small volume data [3].
2) MapReduce works in two processing phases: Map
and Reduce. So, MapReduce takes a lot of time to
perform these tasks, thus significantly increasing
latency [3], thereby reduces processing speed.
3) Hadoop only supports batch processing; it is not
suitableforstreamingdata.Alsoreal-timeprocessingis
not employed in Hadoop [3].
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 4321
3. PROPOSED APPROACH
Apache Spark system is proposed in order to curb
some of the drawbacks of the implementations on
Hadoop. Similar algorithms are run on a sparkcluster-
specifically Pyspark - to achieve similar yet faster
processed results. Pyspark is the Python version of
Apache Spark that can also integrate other languages
like Scala and Java [4]. But Python being the easily
implementable language is used in the said system. An
arx anonymisation tool is used for analysis purpose.
3.1 Advantages of the proposed system-
1) Apache Spark uses In-memory processing of data.
This way of processing the data doesn’t involve in
moving the data to andfrom thedisk.Therefore,makes
Apache spark 100 times faster than MapReduce.
2) Spark is suitable for stream processing. Streaming
gives continuous input/output data. It process data in
less time.
3) In Spark, the data is cached in memory for further
iterations, which increases the performance.
4. IMPLEMENTATION
4.1 K-Anonymity
k- Anonymity is a property of a data set, used to
describe the data set’s level of anonymity. A dataset
is k-anonymous if every combination of identity-
revealing characteristics occurs in at least k different
rows of the data set. It involves increasing the
similarity between different rows of the dataset which
leads to k matches in the dataset. The probability that
the data belongs to an individual is 1/k.Givenadataset
and parameter k, the generalized form of the table
should have probability <= 1/k and information loss
minimized. The information loss depends on the
number of tuples on the same attribute. K-Anonymity
is an optimizationproblemformaximizingtheutilityof
the data and minimizing the information loss. It is NP-
hard problem and becomes polynomially solvable if
number of quasi identifiers is 1. There are two
approaches to generalize the dataset, the first one is
Homogeneous generalization in which the cluster has
to be created and similar values have to be given to the
tuples in the cluster.Then,assignageneralizedvalueto
each tuple to show that they belong to the same group.
Original values and anonymized values can be
represented as a bipartite graph and the order is
changed in order to not recognize the tuple. Each edge
in the graph denotes a possible identity. The other
approach is Heterogeneous generalization. In this
approach not all values in the column have been
modified to satisfy anonymity. Dataset is anonymized
with lower value of k. This method results in less
inaccuracy and hence less information loss. In the
bipartite graph the degree of incoming and outgoing
edges should be at least k i.e. same as each other.
Fig -1: Bipartite view for k=3
Generalization graph must have kdisjointassignments
and every edge of the bipartite graph should be in only
one of those assignments. So, as to make the
probability 1/k . For k disjoint assignments to exist,
indegree should be equal to outdegree foreachnodein
the graph. The bipartite graph should be k-regular.
So the idea of k-anonymity is notaboutjustpreventing
certainty of the data but creating an ambiguity in the
actual data in order to reduce suspicionsonfindingthe
matches for the person’s data. Thealgorithmshouldbe
secure enough that even after knowing the algorithm
the adversary should not be able to reverse-engineer
the anonymized data.
4.2 L-Diversity
L-Diversity technique can be implemented after k-
Anonymity is applied on the dataset. It is an extension
to k-Anonymity in which the number of partitions in
the representation of data is reduced. Sensitive
attributes are made diverse within each equivalence
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 4322
class (k-matches). This is to ensure that each
equivalence class has to have at least l-distinct values
for a sensitive attribute [5].
4.3 T-Closeness
T-closenessisamodelthatextendsl-diversity;ittreats
the values of a sensitive attribute noticeably by
considering the distribution of data values for that
attribute. There should be a threshold value t that all
the equivalence class (a group of k-matches) should
maintain at-most threshold 't' tobethe deviationofthe
sensitive attribute in this class from thecorresponding
distribution of the attribute in the whole table[6]. For
numerical values of the tuples, using t-closeness
anonymizing algorithm is more effective than many
other privacy-preserving data mining methods.
Fig -2: Flow chart of implementation
5. CONCLUSIONS
The system provides a faster anonymizationapproach
and discarding some major disadvantages from
Hadoop implementation. The Spark provides ease of
use and access.
The anonymization further can be improved for some
optimal condition to reduce the information loss and
improve efficiency. Since anonymization is not just
removal of Q.I but also preserving utility, it has a huge
factor in many big-data issues like scalability and
dimensionality. In the future this system can be
integrated to other system in order to make best useof
the privacy preservation. Analysis on the output can
also be improved.
REFERENCES
[1] Privacy preservation for medical dataset using
Hadoop by Balaji K Bodkhe and Dr. Sanjay P Sood
[2]Bighealthcaredata:preservingsecurityandprivacy
by Karim Abouelmehdi,AbderrahimBeni-Hessaneand
Hayat Khaloufi..
[3] Blog reference:
https://data-flair.training/blogs/hadoop-tutorial/
[4] Apache Spark Documentation:
https://spark.apache.org/docs/latest/
[5]Machanavajjhala, Ashwin; Kifer, Daniel; Gehrke,
Johannes; Venkitasubramaniam, Muthuramakrishnan
(March 2007). "L-diversity: Privacy Beyond K-
anonymity". ACM Trans. Knowl. Discov. Data
[6] Ninghui Li, Tiancheng Li, and Suresh
Venkatasubramanian (2007). "t-Closeness: Privacy
beyond k-anonymity and l-diversity"

More Related Content

What's hot (20)

PDF
The D-basis Algorithm for Association Rules of High Confidence
ITIIIndustries
 
PDF
A statistical data fusion technique in virtual data integration environment
IJDKP
 
PDF
Estimating project development effort using clustered regression approach
csandit
 
PDF
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
cscpconf
 
PDF
A03202001005
theijes
 
PDF
IRJET - Rainfall Forecasting using Weka Data Mining Tool
IRJET Journal
 
PDF
Vol 16 No 2 - July-December 2016
ijcsbi
 
PDF
IRJET- Machine Learning: Survey, Types and Challenges
IRJET Journal
 
PDF
Enhancing Keyword Query Results Over Database for Improving User Satisfaction
ijmpict
 
PDF
Research scholars evaluation based on guides view
eSAT Publishing House
 
PDF
Research scholars evaluation based on guides view using id3
eSAT Journals
 
PDF
Ijsws14 423 (1)-paper-17-normalization of data in (1)
Raghavendra Pokuri
 
PDF
A Preference Model on Adaptive Affinity Propagation
IJECEIAES
 
PDF
Recommendation system using bloom filter in mapreduce
IJDKP
 
PDF
Feature Subset Selection for High Dimensional Data using Clustering Techniques
IRJET Journal
 
PDF
Improved Slicing Algorithm For Greater Utility In Privacy Preserving Data Pub...
Waqas Tariq
 
PDF
Textual Data Partitioning with Relationship and Discriminative Analysis
Editor IJMTER
 
PDF
IRJET- Finding the Original Writer of an Anonymous Text using Naïve Bayes Cla...
IRJET Journal
 
PDF
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
IJDKP
 
PDF
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
csandit
 
The D-basis Algorithm for Association Rules of High Confidence
ITIIIndustries
 
A statistical data fusion technique in virtual data integration environment
IJDKP
 
Estimating project development effort using clustered regression approach
csandit
 
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
cscpconf
 
A03202001005
theijes
 
IRJET - Rainfall Forecasting using Weka Data Mining Tool
IRJET Journal
 
Vol 16 No 2 - July-December 2016
ijcsbi
 
IRJET- Machine Learning: Survey, Types and Challenges
IRJET Journal
 
Enhancing Keyword Query Results Over Database for Improving User Satisfaction
ijmpict
 
Research scholars evaluation based on guides view
eSAT Publishing House
 
Research scholars evaluation based on guides view using id3
eSAT Journals
 
Ijsws14 423 (1)-paper-17-normalization of data in (1)
Raghavendra Pokuri
 
A Preference Model on Adaptive Affinity Propagation
IJECEIAES
 
Recommendation system using bloom filter in mapreduce
IJDKP
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
IRJET Journal
 
Improved Slicing Algorithm For Greater Utility In Privacy Preserving Data Pub...
Waqas Tariq
 
Textual Data Partitioning with Relationship and Discriminative Analysis
Editor IJMTER
 
IRJET- Finding the Original Writer of an Anonymous Text using Naïve Bayes Cla...
IRJET Journal
 
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
IJDKP
 
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
csandit
 

Similar to IRJET- Privacy Preservation using Apache Spark (20)

PDF
Query Processing with k-Anonymity
Waqas Tariq
 
PDF
Data Anonymization for Privacy Preservation in Big Data
rahulmonikasharma
 
PDF
Data attribute security and privacy in Collaborative distributed database Pub...
International Journal of Engineering Inventions www.ijeijournal.com
 
PPT
Privacy preserving dm_ppt
Sagar Verma
 
PDF
Performance Analysis of Hybrid Approach for Privacy Preserving in Data Mining
idescitation
 
PDF
Data Privacy with Apache Spark: Defensive and Offensive Approaches
Databricks
 
PDF
IRJET- Study Paper on: Ontology-based Privacy Data Chain Disclosure Disco...
IRJET Journal
 
PDF
Enhanced Privacy Preserving Accesscontrol in Incremental Datausing Microaggre...
rahulmonikasharma
 
PDF
78201919
IJRAT
 
PDF
78201919
IJRAT
 
PDF
Privacy Preserving by Anonymization Approach
rahulmonikasharma
 
PDF
Enabling Use of Dynamic Anonymization for Enhanced Security in Cloud
IOSR Journals
 
PPTX
Distinct l diversity anonymization of set valued data
Rohan Khude
 
PDF
Data Privacy Patterns in databricks for data engineering professional certifi...
TusharAgarwal49094
 
PDF
A New Method for Preserving Privacy in Data Publishing
cscpconf
 
PDF
Ijcatr04051015
Editor IJCATR
 
PDF
The Constrained Method of Accessibility and Privacy Preserving Of Relational ...
IJERA Editor
 
PDF
DATA SHARING TAXONOMY RECORDS FOR SECURITY CONSERVATION
csandit
 
PDF
A Rule based Slicing Approach to Achieve Data Publishing and Privacy
ijsrd.com
 
PDF
Ak Anonymity Clustering Method for Effective Data Privacy Preservation 1st Ed...
vreckolbay
 
Query Processing with k-Anonymity
Waqas Tariq
 
Data Anonymization for Privacy Preservation in Big Data
rahulmonikasharma
 
Data attribute security and privacy in Collaborative distributed database Pub...
International Journal of Engineering Inventions www.ijeijournal.com
 
Privacy preserving dm_ppt
Sagar Verma
 
Performance Analysis of Hybrid Approach for Privacy Preserving in Data Mining
idescitation
 
Data Privacy with Apache Spark: Defensive and Offensive Approaches
Databricks
 
IRJET- Study Paper on: Ontology-based Privacy Data Chain Disclosure Disco...
IRJET Journal
 
Enhanced Privacy Preserving Accesscontrol in Incremental Datausing Microaggre...
rahulmonikasharma
 
78201919
IJRAT
 
78201919
IJRAT
 
Privacy Preserving by Anonymization Approach
rahulmonikasharma
 
Enabling Use of Dynamic Anonymization for Enhanced Security in Cloud
IOSR Journals
 
Distinct l diversity anonymization of set valued data
Rohan Khude
 
Data Privacy Patterns in databricks for data engineering professional certifi...
TusharAgarwal49094
 
A New Method for Preserving Privacy in Data Publishing
cscpconf
 
Ijcatr04051015
Editor IJCATR
 
The Constrained Method of Accessibility and Privacy Preserving Of Relational ...
IJERA Editor
 
DATA SHARING TAXONOMY RECORDS FOR SECURITY CONSERVATION
csandit
 
A Rule based Slicing Approach to Achieve Data Publishing and Privacy
ijsrd.com
 
Ak Anonymity Clustering Method for Effective Data Privacy Preservation 1st Ed...
vreckolbay
 
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
IRJET Journal
 
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
PDF
Kiona – A Smart Society Automation Project
IRJET Journal
 
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
IRJET Journal
 
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
IRJET Journal
 
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
IRJET Journal
 
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
IRJET Journal
 
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
IRJET Journal
 
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
IRJET Journal
 
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
IRJET Journal
 
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
IRJET Journal
 
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
IRJET Journal
 
PDF
Breast Cancer Detection using Computer Vision
IRJET Journal
 
PDF
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
PDF
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
IRJET Journal
 
PDF
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
PDF
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
IRJET Journal
 
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
Kiona – A Smart Society Automation Project
IRJET Journal
 
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
IRJET Journal
 
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
IRJET Journal
 
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
IRJET Journal
 
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
IRJET Journal
 
BRAIN TUMOUR DETECTION AND CLASSIFICATION
IRJET Journal
 
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
IRJET Journal
 
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
IRJET Journal
 
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
IRJET Journal
 
Breast Cancer Detection using Computer Vision
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Ad

Recently uploaded (20)

PDF
mbse_An_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
PDF
3rd International Conference on Machine Learning and IoT (MLIoT 2025)
ClaraZara1
 
PDF
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
PDF
Submit Your Papers-International Journal on Cybernetics & Informatics ( IJCI)
IJCI JOURNAL
 
PDF
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
PDF
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
PDF
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
PDF
SERVERLESS PERSONAL TO-DO LIST APPLICATION
anushaashraf20
 
PPTX
MODULE 05 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
PDF
WD2(I)-RFQ-GW-1415_ Shifting and Filling of Sand in the Pond at the WD5 Area_...
ShahadathHossain23
 
PDF
MODULE-5 notes [BCG402-CG&V] PART-B.pdf
Alvas Institute of Engineering and technology, Moodabidri
 
PDF
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
PPTX
Distribution reservoir and service storage pptx
dhanashree78
 
PDF
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
PPTX
Introduction to Internal Combustion Engines - Types, Working and Camparison.pptx
UtkarshPatil98
 
PDF
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
PDF
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
PDF
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
PDF
Electrical Machines and Their Protection.pdf
Nabajyoti Banik
 
mbse_An_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
3rd International Conference on Machine Learning and IoT (MLIoT 2025)
ClaraZara1
 
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
Submit Your Papers-International Journal on Cybernetics & Informatics ( IJCI)
IJCI JOURNAL
 
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
SERVERLESS PERSONAL TO-DO LIST APPLICATION
anushaashraf20
 
MODULE 05 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
WD2(I)-RFQ-GW-1415_ Shifting and Filling of Sand in the Pond at the WD5 Area_...
ShahadathHossain23
 
MODULE-5 notes [BCG402-CG&V] PART-B.pdf
Alvas Institute of Engineering and technology, Moodabidri
 
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
Distribution reservoir and service storage pptx
dhanashree78
 
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
Introduction to Internal Combustion Engines - Types, Working and Camparison.pptx
UtkarshPatil98
 
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
Electrical Machines and Their Protection.pdf
Nabajyoti Banik
 

IRJET- Privacy Preservation using Apache Spark

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 4320 Privacy preservation using Apache Spark Sumedha Shenoy K1, Thamatam Bhavana2, S.Lokesh3 1 Student, CSE/The National Institute of Engineering, Mysuru, Karnataka, India 2Student, CSE/The National Institute of Engineering, Mysuru, Karnataka, India 3Associate Professor, Dept. of Computer Science Engineering, The National Institute of Engineering, Mysuru, Karnataka, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - In the present, where the huge amountsofdatais available; there is a difficulty to preserve the privacy of the data. There exists medical data in which the privacy of the patients is of utmost importance. The dataset of the patients includes sensitive propertiessuchasname, age, disease, etc. So, to prevent revealing the identity of person, big data anonymization techniques are used. The implementations of anonymization techniques are done using Apache Hadoop previously. In this study, Spark framework is chosen to facilitatehighprocessingspeedusingIn-memorycomputation. It caches data in memory for further iterations whichenhance the overall performance. Faster data anonymization techniques using Spark are proposed to overcome themedical dataset privacy problems. Key Words: Anonymization, big-data, Spark , k- anonymity, l-diversity, t-closeness, privacy preservation. 1. INTRODUCTION Privacy and confidentiality are huge aspects in social life that we always have the dangers of misuse. In any real-life situation we see lot of personal data being shared, by entrusting the people around usforkeeping it safe and away from misuse. In educational field, data of studentsandtheiracademics;Ineconomicarea bank details, salary information, share and stock related stuff; In medical fields, patients personal data like address, cell-phone number etc are some of the sensitive attributes. These data should be with-held from leaking into public domain. If not there can be severe consequences of privacybreachanddataabuse. The data that can be sensitive to a person but does not directly identify him are called as quasi-identifiers. These quasi-identifiers when analyzed in a particular manner can point to the person. For example, a person of age 30 suffering from cancer is living in a city (say A). There can be few people matching this description but the person’s identity can be found if we can put together some other ofthesequasi-identifiers(Q.I)and zero-in on a single match. Thus signifying that the Q.I values also play a role in protecting or disclosing a person’s privacy. Anonymization is a way to handle these sensitive attributes in a sense that there will beonlylimiteddata available so as to make sure the privacy is preserved. The approach is to make sure that differentiating datasets becomes difficult and thus picking out one individual data is not possible. Big-data is nothing but the collection of growing datasets that obviously includes a lot of sensitive attributes. When processing these large datasets it is possible to implement the anonymization algorithmsandthuspreservingprivacy adequately. 2. EXISTING APPROACH Data anonymization on medical data was done using Hadoop as proposed in [1].The health-care data includes a lot of data tuples containing sensitive attributes enough to divulge privacy. UsingHadoopfor computation anonymization algorithms like k- anonymity, l-diversity was implemented to obtain partitioned datasets. A scalable two phase top-down specialization approach using MapReduce was considered in [1].The first phase included partitioning of datasets into smaller subsets to get an intermediate anonymized results and second phase covers up to merging various subsets for further anonymization. Moredemonstrationsonanonymizationalgorithmsare obtained from [2]. 2.1 Drawbacks in the existing system- 1) Hadoop is not too suitable for smalldata.HDFShasa high capacity design which restricts it from random reading of small volume data [3]. 2) MapReduce works in two processing phases: Map and Reduce. So, MapReduce takes a lot of time to perform these tasks, thus significantly increasing latency [3], thereby reduces processing speed. 3) Hadoop only supports batch processing; it is not suitableforstreamingdata.Alsoreal-timeprocessingis not employed in Hadoop [3].
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 4321 3. PROPOSED APPROACH Apache Spark system is proposed in order to curb some of the drawbacks of the implementations on Hadoop. Similar algorithms are run on a sparkcluster- specifically Pyspark - to achieve similar yet faster processed results. Pyspark is the Python version of Apache Spark that can also integrate other languages like Scala and Java [4]. But Python being the easily implementable language is used in the said system. An arx anonymisation tool is used for analysis purpose. 3.1 Advantages of the proposed system- 1) Apache Spark uses In-memory processing of data. This way of processing the data doesn’t involve in moving the data to andfrom thedisk.Therefore,makes Apache spark 100 times faster than MapReduce. 2) Spark is suitable for stream processing. Streaming gives continuous input/output data. It process data in less time. 3) In Spark, the data is cached in memory for further iterations, which increases the performance. 4. IMPLEMENTATION 4.1 K-Anonymity k- Anonymity is a property of a data set, used to describe the data set’s level of anonymity. A dataset is k-anonymous if every combination of identity- revealing characteristics occurs in at least k different rows of the data set. It involves increasing the similarity between different rows of the dataset which leads to k matches in the dataset. The probability that the data belongs to an individual is 1/k.Givenadataset and parameter k, the generalized form of the table should have probability <= 1/k and information loss minimized. The information loss depends on the number of tuples on the same attribute. K-Anonymity is an optimizationproblemformaximizingtheutilityof the data and minimizing the information loss. It is NP- hard problem and becomes polynomially solvable if number of quasi identifiers is 1. There are two approaches to generalize the dataset, the first one is Homogeneous generalization in which the cluster has to be created and similar values have to be given to the tuples in the cluster.Then,assignageneralizedvalueto each tuple to show that they belong to the same group. Original values and anonymized values can be represented as a bipartite graph and the order is changed in order to not recognize the tuple. Each edge in the graph denotes a possible identity. The other approach is Heterogeneous generalization. In this approach not all values in the column have been modified to satisfy anonymity. Dataset is anonymized with lower value of k. This method results in less inaccuracy and hence less information loss. In the bipartite graph the degree of incoming and outgoing edges should be at least k i.e. same as each other. Fig -1: Bipartite view for k=3 Generalization graph must have kdisjointassignments and every edge of the bipartite graph should be in only one of those assignments. So, as to make the probability 1/k . For k disjoint assignments to exist, indegree should be equal to outdegree foreachnodein the graph. The bipartite graph should be k-regular. So the idea of k-anonymity is notaboutjustpreventing certainty of the data but creating an ambiguity in the actual data in order to reduce suspicionsonfindingthe matches for the person’s data. Thealgorithmshouldbe secure enough that even after knowing the algorithm the adversary should not be able to reverse-engineer the anonymized data. 4.2 L-Diversity L-Diversity technique can be implemented after k- Anonymity is applied on the dataset. It is an extension to k-Anonymity in which the number of partitions in the representation of data is reduced. Sensitive attributes are made diverse within each equivalence
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 04 | Apr 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 4322 class (k-matches). This is to ensure that each equivalence class has to have at least l-distinct values for a sensitive attribute [5]. 4.3 T-Closeness T-closenessisamodelthatextendsl-diversity;ittreats the values of a sensitive attribute noticeably by considering the distribution of data values for that attribute. There should be a threshold value t that all the equivalence class (a group of k-matches) should maintain at-most threshold 't' tobethe deviationofthe sensitive attribute in this class from thecorresponding distribution of the attribute in the whole table[6]. For numerical values of the tuples, using t-closeness anonymizing algorithm is more effective than many other privacy-preserving data mining methods. Fig -2: Flow chart of implementation 5. CONCLUSIONS The system provides a faster anonymizationapproach and discarding some major disadvantages from Hadoop implementation. The Spark provides ease of use and access. The anonymization further can be improved for some optimal condition to reduce the information loss and improve efficiency. Since anonymization is not just removal of Q.I but also preserving utility, it has a huge factor in many big-data issues like scalability and dimensionality. In the future this system can be integrated to other system in order to make best useof the privacy preservation. Analysis on the output can also be improved. REFERENCES [1] Privacy preservation for medical dataset using Hadoop by Balaji K Bodkhe and Dr. Sanjay P Sood [2]Bighealthcaredata:preservingsecurityandprivacy by Karim Abouelmehdi,AbderrahimBeni-Hessaneand Hayat Khaloufi.. [3] Blog reference: https://data-flair.training/blogs/hadoop-tutorial/ [4] Apache Spark Documentation: https://spark.apache.org/docs/latest/ [5]Machanavajjhala, Ashwin; Kifer, Daniel; Gehrke, Johannes; Venkitasubramaniam, Muthuramakrishnan (March 2007). "L-diversity: Privacy Beyond K- anonymity". ACM Trans. Knowl. Discov. Data [6] Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian (2007). "t-Closeness: Privacy beyond k-anonymity and l-diversity"