Big data and Hadoop

1. Big Data and HadoopRahul Agarwalirahul.com

2. AmrAwadallah: http://www.sfbayacm.org/wp/wp-content/uploads/2010/01/amr-hadoop-acm-dm-sig-jan2010.pdf

3. Hadoop: http://hadoop.apache.org/

4. Computerworld: http://www.computerworld.com/s/article/350908/5_Indispensable_IT_Skills_of_the_Future

5. AshishTushoo: http://www.sfbayacm.org/wp/wp-content/uploads/2010/01/sig_2010_v21.pdf

6. Big data: http://en.wikipedia.org/wiki/Big_data

7. Chukwa: http://www.cca08.org/papers/Paper-13-Ariel-Rabkin.pdf

8. Dean, Ghemawat: http://labs.google.com/papers/mapreduce.htmlAttributions

9. Big Data Problem

10. What is Hadoop

11. HDFS

12. MapReduce

13. HBase

14. PIG

15. HIVE

16. Chukwa

17. ZooKeeper

18. Q&AAgenda

19. Why?

20. Extremely large datasets that are hard to deal with using Relational DatabasesStorage/CostSearch/PerformanceAnalytics and VisualizationNeed for parallel processing on hundreds of machinesETL cannot complete within a reasonable timeBeyond 24hrs – never catch upBig Data

21. System shall manage and heal itselfAutomatically and transparently route around failureSpeculatively execute redundant tasks if certain nodes are detected to be slowPerformance shall scale linearlyProportional change in capacity with resource changeCompute should move to dataLower latency, lower bandwidthSimple core, modular and extensibleHadoop design principles

22. A scalablefault-tolerantgrid operating system for data storage and processingCommodity hardwareHDFS: Fault-tolerant high-bandwidth clustered storageMapReduce: Distributed data processingWorks with structured and unstructured dataOpen source, Apache licenseMaster (named-node) – Slave architectureWhat is Hadoop

23. Hadoop ProjectsBI ReportingETL ToolsHive (SQL)Pig (Data Flow)MapReduce (Job Scheduling/Execution System)ZooKeeper (Coordination)(Streaming/Pipes APIs)HBase (key-value store)Chukwa (Monitoring)HDFS(Hadoop Distributed File System)

24. HDFS: Hadoop Distributed FSBlock Size = 64MBReplication Factor = 3

25. Patented Google frameworkDistributed processing of large datasetsmap (in_key, in_value) -> list(out_key, intermediate_value)reduce (out_key, list(intermediate_value)) -> list(out_value)MapReduce

26. Example: count word occurences

27. “Project's goal is the hosting of very large tables - billions of rows X millions of columns - atop clusters of commodity hardware”Hadoop database, open-source version of Google BigTableColumn-orientedRandom access, realtime read/write“Random access performance on par with open source relational databases such as MySQL” HBase

28. High level language (Pig Latin) for expressing data analysis programsCompiled into a series of MapReduce jobsEasier to programOptimization opportunitiesgrunt> A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);grunt> B = FOREACH A GENERATE name;PIG

29. Managing and querying structured dataMapReduce for executionSQL like syntaxExtensible with types, functions, scriptsMetadata stored in a RDBMS (MySQL)Joins, Group By, NestingOptimizer for number of MapReduce requiredhive> SELECT a.foo FROM invites a WHERE a.ds='<DATE>';HIVE

30. A highly available, scalable, distributed, configuration, consensus, group membership, leader election, naming, and coordination serviceCluster ManagementLoad balancingJMX monitoringZooKeeper

31. Data collection system for monitoring distributed systems

32. Agents to collect and process logs

33. Monitoring and analysis

34. Hadoop Infrastructure Care CenterChukwa

35. Data Flow at Facebook

36. Choose the right toolHadoop

37. Affordable Storage/Compute

38. Structured or Unstructured

39. Resilient Auto Scalability

40. Relational Databases

41. Interactive response times

42. ACID

43. Structured data

44. Cost/Scale prohibitive

Big data and Hadoop

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Big data and Hadoop (20)

Recently uploaded (20)

Big data and Hadoop

Editor's Notes