Academia.eduAcademia.edu

Outline

Performance Analysis of Cloud Databases Handling Social Networking Data

2013, 2013 IEEE International Conference on Cloud Computing in Emerging Markets (CCEM)

https://doi.org/10.1109/CCEM.2013.6684437

Abstract

With the growing popularity of Social Networking, the need for storage and analysis of data generated by such applications has increased in leaps and bounds. With the advent of cloud computing tools that handle large amounts of data with ease, there is and increased usage of such tools to manage social networking data. However, usage of tools that can be employed for accessing and analyzing social networking data needs to be optimized. Performance of such tools largely depends on the nature of the database that is used in the back-end. This research illustrates that the choice of database decides to a large extent the performance of the tool and related application. This research will also bring in clarity on using social networking tools for specific purposes and the effect of MapReduce on various storage structures.

Performance Analysis of Cloud Databases Handling Social Networking Data Niladri Sekhar Dey Dr. Sujoy Bhattacharya Department of Information Technology Department of Information Technology Padmasri Dr. B. V. Raju Institute of Technology Padmasri Dr. B. V. Raju Institute of Technology Hyderabad, India Hyderabad, India [email protected] [email protected] Abstract—With the growing popularity of Social Networking, from experiments on SimpleDB and HBASE. In Part – V, the the need for storage and analysis of data generated by such effect of MapReduce on the used databases is illustrated. In applications has increased in leaps and bounds. With the advent Part – VI, we explain the tools and systems and in Part – VII of cloud computing tools that handle large amounts of data with we summarize and conclude our work with references to the ease, there is and increased usage of such tools to manage social earlier sections. networking data. However, usage of tools that can be employed for accessing and analyzing social networking data needs to be optimized. Performance of such tools largely depends on the II. RELATED WORK nature of the database that is used in the back-end. This research illustrates that the choice of database decides to a large extent the A. AWS Based Social Networking Applications performance of the tool and related application. This research The integration of social networking applications with will also bring in clarity on using social networking tools for cloud computing is not a rare example. Predominantly, the specific purposes and the effect of MapReduce on various storage applications use cloud to host social sites or create scalable structures. applications for social networking sites. The social networking site Facebook uses Amazon Web Services to build scalable Keywords—SNA(Social Networks Analysis), HBASE, applications and hosted over AWS with Elastic Cloud 2 and SimpleDB, MapReduce, AWS, Crawler, Reader, FB API, Twitter SimpleDB. The performance of Facebook on some rich content API types is discussed in section IV. I. INTRODUCTION B. Apache Based Social Networking Applications The analytical study of social networking data is very The social networking applications also integrate with open effective in order to achieve significant milestones in research source like Apache Foundation products. Another social of social structure, social behaviors, trends and characteristics networking site Twitter uses Apache Foundation products to of the framework [1][2][3]. The analysis can lead to multiple host and store social networking data. Twitter uses apache momentous decisions related to marketing, social studies and HBASE across Apache Hadoop clusters, where HBASE allows technology research. The investigation of social network data the social networking engineers from twitter to run their and then analyzing the same leads to major complications as MapReduce programs on HBASE [6][7]. HBASE allows this data is mostly oversized and unstructured. Moreover, the twitter to periodically update rows of data with the help of choice of appropriate applications is very important while Hadoop Distributed File System or HDFS [8]. The analyzing the data. This problem is largely addressed using the performance of Twitter on some rich content types is discussed cloud computing applications, which allow elasticity on in section – IV. infrastructure and unstructured storage of data [4]. As the data is the major source of analytical research, hence multiple migrations are connected to each analytical application and few III. PERFORMANCE EVALUATION APPLICATION social networking frameworks have their own preference on The social networking data analysis is developed to database types. Multiple cloud service providers provide understand the relation between multiple participants in the multiple different databases, which have their own social network called actors. The actors generate data during performance issues, benefits and policies. This research enables multiple operations they perform. This dataset is called the the understanding of performance comparison on SimpleDB social network data. To test the performance of HBASE and (Amazon Web Services or AWS) [5] and HBASE (Apache). SimpleDB, a simple application was built during this research The rest of the paper is organized as follows. In Part – II, we work, which allows multiple operations on social networking have elaborated some effective results on social networking data. The application maintains a mirrored copy of the data data analysis and their performance issues. In Part – III, we both in HBASE and SimpleDB [9] and performs all the have listed an application architecture which is used in this operations alternatively on both. The complete application research. In Part – IV, we have compared the results obtained (Fig. 1) has many parts, which are explained below [10][11]. A simple java framework is used, specially for the APIs for  executeBatch – The final step after initialization is to Facebook and Twitter. Note that a developer account is needed execute all the queries to fetch posts from Facebook to access the Facebook and Twitter APIs. and then store them in the database. C. Contexter The contexter component is basically a normalization component in this application. This component performs a few specific tasks in a specific order like Language detection, Named entity recognition, Anaphoric normalization and Text segmentation. It was assumed that the common text will appear in English and the rest of the process will start with this consideration. The first step is to compute each component block to extract the text by applying the formulas P( X )  P / D(Tx ) …………………… (1) D(Tx )   wn t  w1 Dt …………………… (2) P( X ) i , j  T( x ) j ,i …………………… (3) P( X ) i , j  T( x ) i1,i1 Fig 1: Application Components for Performance evaluation …………………… (4) A. FACEBOOK Interface Where, P( X ) is extracted text component The component Facebook interface is actually an open source package for Facebook and Java integration called P is the total text block RestFB, which is a simple and flexible Facebook Graph API and Old REST API client. The package contains a few useful D(Tx) is the domain of recognizable keywords classes like com.restfb.batch, com.restfb.exception and P( X ) i , j is the extracted text component before mapping com.restfb.util. The connection to Facebook through Java code using REST API sets has a few steps – T( x ) j ,i is the extracted text component after mapping  Initialization – This step ensures the connectivity between Java client and Facebook using the personal The named entry recognition algorithm is to find access token. A personal access token can be obtained multiple small normalization-able components of texts (Eq. 1), by joining the Facebook developer group with a valid where the domains of the known keywords are made from each Facebook account. collected keywords (Eq. 2). When the final text is extracted,  Fetching Single Object – This step defines the (which can actually be many pieces of text), the mapping process starts. This mapping process eventually normalizes the conversion of Facebook objects into Java objects and unstructured text. The mapping process maps the extracted farther mapping of those objects to User and Page types texts to mapping fields (Eq. 3). Sometimes based on few respectively. extracted text, new fields also need to be created (Eq. 4). B. Crawler D. Twitter Interface The Crawler components is not a URL crawler, rather This interface built on the Twitter4J is an unofficial Java this is a Facebook Post crawler. The Batch API in the library for the Twitter API. With Twitter4J, the integration of com.restfb package is used. The Batch API is great if this application with the Twitter service is very effective we have to retrieve multiple posts at a time. This because of the available classes. component also has to perform the following steps –  BatchRequestBuilder - In this step the batch request is E. Reader built with specific command operation or queries. This interface sets a timer, either a 90 second TCP level  BatchResponseBuilder – In this step the response socket timeout, or a 90 second application level timer on the containers are initialized, so that the converted receipt of new data. If 90 seconds pass with no data received, Facebook objects can be stored. including newlines, the system will disconnect and reconnect immediately. If the data is received successfully, then the Contexter component works again. F. Analysis The results are compared in a graphical format in Fig 2 and Fig 3 below: This component interfaces between the query GUI and the database. This module also takes care of the pre-defined queries, result sets, user data display and finally the performance monitoring, which is elaborated in Section - IV. IV. CLOUD DATABASE PERFORMACE The Cloud Databases like HBASE and SimpleDB are connected to this application, where the application sends all the queries to both the databases. The setup of the experiment is on a single node, the details of which are explained in part – VII. The basic and initial dataset was 10,000 rows with 10 Fig 2: Query Vs Data Set Size in HBASE fields in each, measuring about 10KB of data for each block. Hence the total size of the data used for this experiment was nearly 1GB. The application was also equipped with multi- threading of queries. Hence we also included an incremental type of multi-threading code, where initially the code will batch only 5 queries per second and then for each second 2 new threads will be generated. Hence after running the application for 1 hour, total number of queries will be approximately 7000 with a internet connection of 10 MBPS. With this setup, we recorded the response time for both HBASE (Local) and SimpleDB (AWS). The results are listed in Table 1 and Table 2 below: Fig 3: Query Vs Data Set Size in SimpleDB TABLE I. HBASE RESPONSE TIME We have noticed a significant difference in the Query Instances response time between SimpleDB (Fig. 2) and HBASE HBAS E 5 7000 10,000 15,000 20,000 (Fig. 3), where HBASE applications are deployed locally Queries Queries Queries Queries Queries on the Apache Tomcat server and SimpleDB applications 1 MB 2 1.4 K 2.0 K 3.0 K 4.0 K are reployed on Elastic Cloud Server on AWS. 5 MB 2 6.98 K 9.58 K 15.6 K 16.3 K QR = DBS * QN * 2 + NS , NS = 0 …………(5) 10 MB 2 12 K 18.4 K 32.4 K 44.5 K QR = DBS * QN * 4 + NS , NS = 10 ……… (6) 15 MB 2 20.88 K 28.46 K 46.2 K 66.9 K All the time values are in ms, where k denotes thousand Where, Q R is Query Response Time, DB S is size of the database domain to scan, Q N is number of Queries to be fired in a second, TABLE II. SIMPLEDB RESPONSE TIME N S is the network speed to be considered at 0 for local Query Instances HBAS 5 7000 10,000 15,000 20,000 We have listed the formulas for calculating the query E Queries Queries Queries Queries Queries response times for SimpleDB and HBASE, where the number 1 MB 20 2.8 K 40 K 160 K 640 K of queries, amount of data and network speed have to be considered. Here after analysing the result sets we found the 5 MB 20 14 K 56 K 224 K 896 K formula for calculating the Response Time for HBASE (Eq. 5) 10 MB 20 28 K 112 K 448 K 1702 K and SimpleDB (Eq. 6). 15 MB 20 42 K 168 K 672 K 1952 K V. EFFECT OF MAPREDUCE ON CLOUD DATABASES All the time values are in ms, where k denotes thousand A generic MapReduce is a software programming model for compiling any large data sets for any distributed and parallel algorithms on a configured cluster. Any simple MapReduce program consists of a Map() procedure, which REFERENCES performs filtering and sorting and a Reduce() procedure that [1] Daniel Nurmi, Rich Wolski, Chris Grzegorczyk, Graziano performs a summarization operation. We used a MapReduce Obertelli, Sunil Soman, Lamia Youseff, and Dmitrii library for HBASE called Amazon Hadoop MapReduce and Zagorodnov. The eucalyptus open-source cloud- one for SimpleDB called Amazon Elastic Cloud MapReduce. computing system. In Proceedings of 9th IEEE After applying the MapReduce algorithms on HBASE and International Symposium on Cluster Computing and the Grid (CCGrid 09), Shanghai, China., 2009. SimpleDB we noticed an astounding 15% performance improvement in both the cases, irrespective of network speed [2] Facebook Meets the Virtualized Enterprise, Washington, as they can be configured locally on Virtual Machine or DC, USA, 2008. IEEE Computer Society. distributed clusters. We used the following standard algorithm of MapReduce [12]:  [3] OpenSocial and Gadgets Specification Group. Opensocial Prepare the Map() input – The "MapReduce system" specification v0.9. http://www.opensocial.org/Technical- Resources/opensocial-specv09/OpenSocial- allocates Map processors, assigns the K1 input key Specification.html, April 2009. value each processor would work on, and provides that processor with all the input data associated with that [4] David Recordon and Drummond Reed. Openid 2.0: a key value. platform for usercentric identity management. In DIM ’06:  Proceedings of the second ACM workshop on Digital Run the user-provided Map() code – Map() is run identity management, pages 11–16, New York, NY, USA, exactly once for each K1 key value, generating output 2006. ACM. organized by key values K2.  "Shuffle" the Map output to the Reduce processors – [5] Amazon. building facebook applications on aws website. http://aws.amazon.com/solutions/global-solution- the MapReduce system designates Reduce processors, providers/facebook/. assigns the K2 key value each processor would work on, and provides that processor with all the Map- [6] K. Keahey, I. Foster, T. Freeman, and X. Zhang. Virtual generated data associated with that key value. workspaces: Achieving quality of service and quality of  life in the grid. Scientific Programming Journal: Special Run the user-provided Reduce() code – Reduce() is run Issue: Dynamic Grids and Worldwide Computing, exactly once for each K2 key value produced by the 13(4):265–276, 2005. Map step.  [7] David P. Anderson. Boinc: A system for public-resource Produce the final output – the MapReduce system computing and storage. In GRID ’04: Proceedings of the collects all the Reduce output, and sorts it by K2 to 5th IEEE/ACM International Workshop on Grid produce the final outcome. Computing, pages 4–10, Washington, DC, USA, 2004. IEEE Computer Society. VI. CONCLUSION [8] K. Czajkowski, S. Fitzgerald, I. Foster, and C. Kesselman. The performance of HBASE and SimpleDB is analyzed on Grid information services for distributed resource sharing. a social network dataset, which is large in volume and the In the 10th IEEE Symposium on High Performance effect of the number of queries on the same dataset is studied. Distributed Computing (HPDC), 2001. It is proven that there is a large difference in terms of query response time depending on the database used. We have also [9] D. Neumann, J. Ster, A. Anandasivam, and N. Borissov. noticed that the query response time is highly dependent on the SORMA - Building an Open Grid Market for Grid Resource Allocation. In Lecture Notes in Computer network speed or the network bandwidth, through which the Science: The 4th International Workshop on Grid AWS is accessed. But in both the cases we found that the effect Economics and Business Models (GECON 2007), pages of MapReduce is significant. There is a large effect in case of a 194–200, Rennes, France, 2007. local VM that clustered or shared clustered and for Amazon Elastic Cloud MapReduce as well. During the Crawler and [10] Nazareno Andrade, Francisco Brasileiro, Miranda Reader implementation we found that the normalization Mowbray, and Walfredo Cirne. A reciprocation-based process is easier for Facebook data for SimpleDB and Twitter economy for multiple services in a computational grid. In R. Buyya and K. Bubendorfer, editors, Market Oriented data for HBASE. The conclusion is that due to internal use of Grid and Utility Computing, pages 357–370. Wiley Press, the same databases the social data from both the social 2009. networking sites are compatible with SimpleDB and HBASE. [11] Zhenhua Guo, Raminderjeet Singh, and Marlon Pierce. ACKNOWLEDGMENT Building the polargrid portal using web 2.0 and opensocial. In GCE ’09: Proceedings of the 5th Grid The research work is being carried in the Cloud Computing Computing Environments Workshop, pages 1–8, New Center in Padmasri Dr. B. V. Raju Institute of Technology, York, NY, USA, 2009. ACM. Hyderabad. We would like to thank the management for allocating necessary funds for the research. [12] Anjomshoaa et al. Job Submission Description Language (JSDL) Specification, Version 1.0. 2005.

References (12)

  1. Daniel Nurmi, Rich Wolski, Chris Grzegorczyk, Graziano Obertelli, Sunil Soman, Lamia Youseff, and Dmitrii Zagorodnov. The eucalyptus open-source cloud- computing system. In Proceedings of 9th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 09), Shanghai, China., 2009.
  2. Facebook Meets the Virtualized Enterprise, Washington, DC, USA, 2008. IEEE Computer Society.
  3. OpenSocial and Gadgets Specification Group. Opensocial specification v0.9. http://www.opensocial.org/Technical- Resources/opensocial-specv09/OpenSocial- Specification.html, April 2009.
  4. David Recordon and Drummond Reed. Openid 2.0: a platform for usercentric identity management. In DIM '06: Proceedings of the second ACM workshop on Digital identity management, pages 11-16, New York, NY, USA, 2006. ACM.
  5. Amazon. building facebook applications on aws website. http://aws.amazon.com/solutions/global-solution- providers/facebook/.
  6. K. Keahey, I. Foster, T. Freeman, and X. Zhang. Virtual workspaces: Achieving quality of service and quality of life in the grid. Scientific Programming Journal: Special Issue: Dynamic Grids and Worldwide Computing, 13(4):265-276, 2005.
  7. David P. Anderson. Boinc: A system for public-resource computing and storage. In GRID '04: Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing, pages 4-10, Washington, DC, USA, 2004. IEEE Computer Society.
  8. K. Czajkowski, S. Fitzgerald, I. Foster, and C. Kesselman. Grid information services for distributed resource sharing. In the 10th IEEE Symposium on High Performance Distributed Computing (HPDC), 2001.
  9. D. Neumann, J. Ster, A. Anandasivam, and N. Borissov. SORMA -Building an Open Grid Market for Grid Resource Allocation. In Lecture Notes in Computer Science: The 4th International Workshop on Grid Economics and Business Models (GECON 2007), pages 194-200, Rennes, France, 2007.
  10. Nazareno Andrade, Francisco Brasileiro, Miranda Mowbray, and Walfredo Cirne. A reciprocation-based economy for multiple services in a computational grid. In R. Buyya and K. Bubendorfer, editors, Market Oriented Grid and Utility Computing, pages 357-370. Wiley Press, 2009.
  11. Zhenhua Guo, Raminderjeet Singh, and Marlon Pierce. Building the polargrid portal using web 2.0 and opensocial. In GCE '09: Proceedings of the 5th Grid Computing Environments Workshop, pages 1-8, New York, NY, USA, 2009. ACM.
  12. Anjomshoaa et al. Job Submission Description Language (JSDL) Specification, Version 1.0. 2005.