Big data analysis in java world

Download as PPTX, PDF

1 like1,095 views

The document discusses big data challenges and solutions using technologies like MapReduce and MPP-based analytical databases. It highlights the practical implementation of these technologies through real-life projects and analytics requirements, emphasizing structured analytics for sensor and time-series data. The presentation also includes insights on storage, querying methods, and the importance of choosing appropriate tools for big data processing.

Data & Analytics

Big Data Analysis
in Java World
by Serhiy Masyutin

Agenda
 The Big Data Problem
 Map-Reduce
 MPP-based Analytical Database
 In-Memory Data Grid
 Real-Life Project
 Q&A

The Big Data Problem
http://www.datameer.com/images/product/big_data_hadoop/img_bigdata.png

The Big Data Problem
Map-Reduce MPP AD IMDG
When do I
need it?
In an hour In a minute Now
What do I need
to do with it?
Exploratory
analytics
Structured
analytics
Singular event
processing
(some
analytics),
Transactions
How will I query
and search?
Unstructured Ad hoc SQL Structured
How do I need
to store it?
I do, but not
required to
I must and I am
required to
Temporarily
Where is it
coming from?
File/ETL File/ETL Event/Stream/F
ile/ETL
http://blog.pivotal.io/pivotal/products/exploring-big-data-solutions-when-to-use-hadoop-vs-in-memory-vs-mpp

The Big Data Problem
Map-Reduce MPP AD IMDG
Transactions
Customer records
Geo-spatial
Sensors
Social Media
XML, JSON
Raw Logs
Text
Image
Video
moreprocessing
http://blog.pivotal.io/big-data-pivotal/products/exploratory-data-science-when-to-use-an-mpp-database-sql-on-hadoop-or-map-reduce

The Big Data Problem
Data is not Information
- Clifford Stoll

Map-Reduce
http://jeremykun.files.wordpress.com/2014/10/mapreduceimage.gif?w=1800

Map-Reduce
https://anonymousbi.files.wordpress.com/2012/11/hadoopdiagram.png

Map-Reduce
http://hadoop.apache.org/docs/r1.2.1/images/hdfsarchitecture.gif

Map-Reduce
Volume Variety Velocity
Medium-
Large
Unstructured
data
Batch
processing

MPP Analytical Database
http://www.ndm.net/datawarehouse/images/stories/greenplum/gp-dia-3-0.png

MPP Analytical Database
http://my.vertica.com/docs/7.1.x/HTML/Content/Resources/Images/K-SafetyServerDiagram.png

MPP Analytical Database
http://my.vertica.com/docs/7.1.x/HTML/Content/Resources/Images/DataK-Safety-K2Nodes2And3Failed.png

MPP Analytical Database
JDBC
http://www.ndm.net/datawarehouse/images/stories/greenplum/gp-dia-3-0.png

MPP Analytical Database
Volume Variety Velocity
Small-
Medium-
Large
Structured
data
Interactive
ASTER DATABASEMatrix

In-Memory Data Grid
https://ignite.incubator.apache.org/images/in_memory_data.png

In-Memory Data Grid
https://ignite.incubator.apache.org/images/in_memory_compute.png

In-Memory Data Grid
http://hazelcast.com/wp-content/uploads/2013/12/IMDGEmbeddedMode_w1000px.png

In-Memory Data Grid
Volume Variety Velocity
Small-
Medium
Structured
data
(Near) Real-
Time

Real-Life Project
 Sensor data
 Currently number of devices
doubles every year
 Data flow ~200GB/month
 Target data flow ~500GB/month

Real-Life Project
Requirements
When do I need it? In a minute
What do I need to do with it? Structured analytics
How will I query and search? Ad hoc SQL
How do I need to store it? I must and I am required to
Where is it coming from? XML

Real-Life Project
 Time-series data
 RESTful API
 Extendable analytics
 Scalability
 Speed to Market

Availability Zone C
Availability Zone B
Availability Zone A
Real-Life Project
Processor
Raw message
store
Client API
Collector
Analytic Executor
Pool
Analytics API
Clients
Devices
3rd Party
Services
Post-Processor
UI
Recent
data store
Permanent
data store

Real-Life Project
 Vertica stores time-series data only
 Append-only data store
 Store organizational data
separately
 Use Vertica’s ExternalFilter for data
load
 R analytics as UDFs on Vertica
 Scale Vertica cluster accordingly

Real-Life Project
 Choose the right tool for the job,
late changes are expensive
 You can do everything yourself.
Should you?

Big data analysis in java world

1. Big Data Analysis in Java World by Serhiy Masyutin

2. Agenda  The Big Data Problem  Map-Reduce  MPP-based Analytical Database  In-Memory Data Grid  Real-Life Project  Q&A

3. The Big Data Problem http://www.datameer.com/images/product/big_data_hadoop/img_bigdata.png

4. The Big Data Problem Map-Reduce MPP AD IMDG When do I need it? In an hour In a minute Now What do I need to do with it? Exploratory analytics Structured analytics Singular event processing (some analytics), Transactions How will I query and search? Unstructured Ad hoc SQL Structured How do I need to store it? I do, but not required to I must and I am required to Temporarily Where is it coming from? File/ETL File/ETL Event/Stream/F ile/ETL http://blog.pivotal.io/pivotal/products/exploring-big-data-solutions-when-to-use-hadoop-vs-in-memory-vs-mpp

5. The Big Data Problem Map-Reduce MPP AD IMDG Transactions Customer records Geo-spatial Sensors Social Media XML, JSON Raw Logs Text Image Video moreprocessing http://blog.pivotal.io/big-data-pivotal/products/exploratory-data-science-when-to-use-an-mpp-database-sql-on-hadoop-or-map-reduce

6. The Big Data Problem Data is not Information - Clifford Stoll

7. Map-Reduce http://jeremykun.files.wordpress.com/2014/10/mapreduceimage.gif?w=1800

8. Map-Reduce https://anonymousbi.files.wordpress.com/2012/11/hadoopdiagram.png

9. Map-Reduce http://hadoop.apache.org/docs/r1.2.1/images/hdfsarchitecture.gif

10. Map-Reduce https://anonymousbi.files.wordpress.com/2012/11/hadoopdiagram.png

11. Map-Reduce Volume Variety Velocity Medium- Large Unstructured data Batch processing

12. MPP Analytical Database http://www.ndm.net/datawarehouse/images/stories/greenplum/gp-dia-3-0.png

13. MPP Analytical Database http://my.vertica.com/docs/7.1.x/HTML/Content/Resources/Images/K-SafetyServerDiagram.png

14. MPP Analytical Database http://my.vertica.com/docs/7.1.x/HTML/Content/Resources/Images/K-SafetyServerDiagramOneNodeDown.png

15. MPP Analytical Database http://my.vertica.com/docs/7.1.x/HTML/Content/Resources/Images/K-SafetyServerDiagramTwoNodesDown.png

16. MPP Analytical Database http://my.vertica.com/docs/7.1.x/HTML/Content/Resources/Images/DataK-Safety-K2Nodes2And3Failed.png

17. MPP Analytical Database JDBC http://www.ndm.net/datawarehouse/images/stories/greenplum/gp-dia-3-0.png

18. MPP Analytical Database Volume Variety Velocity Small- Medium- Large Structured data Interactive ASTER DATABASEMatrix

19. In-Memory Data Grid https://ignite.incubator.apache.org/images/in_memory_data.png

20. In-Memory Data Grid https://ignite.incubator.apache.org/images/in_memory_data.png

21. In-Memory Data Grid https://ignite.incubator.apache.org/images/in_memory_compute.png

22. In-Memory Data Grid http://hazelcast.com/wp-content/uploads/2013/12/IMDGEmbeddedMode_w1000px.png

23. In-Memory Data Grid Volume Variety Velocity Small- Medium Structured data (Near) Real- Time

24. Real-Life Project  Sensor data  Currently number of devices doubles every year  Data flow ~200GB/month  Target data flow ~500GB/month

25. Real-Life Project Requirements When do I need it? In a minute What do I need to do with it? Structured analytics How will I query and search? Ad hoc SQL How do I need to store it? I must and I am required to Where is it coming from? XML

26. Real-Life Project  Time-series data  RESTful API  Extendable analytics  Scalability  Speed to Market

27. Real-Life Project

28. Availability Zone C Availability Zone B Availability Zone A Real-Life Project Processor Raw message store Client API Collector Analytic Executor Pool Analytics API Clients Devices 3rd Party Services Post-Processor UI Recent data store Permanent data store

29. Real-Life Project  Vertica stores time-series data only  Append-only data store  Store organizational data separately  Use Vertica’s ExternalFilter for data load  R analytics as UDFs on Vertica  Scale Vertica cluster accordingly

30. Real-Life Project  Choose the right tool for the job, late changes are expensive  You can do everything yourself. Should you?

31. Q&A

Editor's Notes

#2: Introduction Hello guyz-n-girlz, my name is Serhiy Masyutin. I have more than 14 years of professional experience in different IT branches, learned tons of languages and tools. Every day I just enjoy my job. And yes, I like when things are done nicely, robust, useful and in time. Started from Desktop applications in C++ and Delphi(you know it? ) then moved to C++ and Java Telecommunication projects, have done eCommerse project in PHP and Mobile crossplatform applications in JavaScript. Currently my project is in Java and it falls into Big Data category. So today I’m here to provide you an overview of what Big Data is and how can one approach to solutions of problems that it brings.
#3: TBD: How each approach fits CAP-theorem in slides
#4: What is big data? Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include capture, transfer, storage, sharing, search, analysis, visualization, and information privacy. The term often refers simply to the use of predictive analytics or other certain advanced methods to extract value from data, and seldom to a particular size of data set. Accuracy in big data may lead to more confident decision making. And better decisions can mean greater operational efficiency, cost reductions and reduced risk. In a 2001 Doug Laney, an analyst from Gartner, defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). Gartner, and now much of the industry, continue to use this "3Vs" model for describing big data.[17] Additionally, a new V "Veracity" is added by some organizations to describe it. If Gartner’s definition (the 3Vs) is still widely used, the growing maturity of the concept fosters a more sound difference between big data and Business Intelligence, regarding data and their use:[20] Business Intelligence uses descriptive statistics with data with high information density to measure things, detect trends etc.; Big data uses inductive statistics and concepts from nonlinear system identification [21] to infer laws from large sets of data with low information density to reveal relationships, dependencies and perform predictions of outcomes and behaviors. A more recent, consensual definition states that "Big Data represents the Information assets characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value". Big data can be described by the following characteristics: Volume – The quantity of data that is generated is very important in this context. It is the size of the data which determines the value and potential of the data under consideration and whether it can actually be considered Big Data or not. The name ‘Big Data’ itself contains a term which is related to size and hence the characteristic. Variety - The next aspect of Big Data is its variety. This means that the category to which Big Data belongs to is also a very essential fact that needs to be known by the data analysts. This helps the people, who are closely analyzing the data and are associated with it, to effectively use the data to their advantage and thus upholding the importance of the Big Data. Velocity - The term ‘velocity’ in the context refers to the speed of generation of data or how fast the data is generated and processed to meet the demands and the challenges which lie ahead in the path of growth and development. Veracity - The quality of the data being captured can vary greatly. Accuracy of analysis depends on the veracity of the source data. Complexity - Data management can become a very complex process, especially when large volumes of data come from multiple sources. These data need to be linked, connected and correlated in order to be able to grasp the information that is supposed to be conveyed by these data. This situation, is therefore, termed as the ‘complexity’ of Big Data.
#5: Use Nepal earthquake as a sample for data value that decreases in time as a single event and becomes useful as historical data. IMDG could be composed to work with MR & MPP when event processing is done via IMDG and historical data is batch stored into MR or MPP AD TBD text cleanup == Approaches to analysis How do you decide what belongs in the real-time tier versus the interactive tier? To understand the best use of each of these, there are some questions you can start asking to help you determine which is the best fit for your use case. It’s worth noting that any decision will also be subject to other architectural considerations unique to each business. When do I need it? Over time, the value derived from actions on a singular piece of data becomes lower, and becomes more useful in the aggregate. The decay of immediate relevance for a piece a data is something like inverse exponential. My applications need to use the data now It’s helpful to think of real time as being your “Now Data” — the in-memory data which is relevant at this moment, which you need to have fast access to. Real time brings together your now data with aggregates of your historic data, which is where customers will find immediate value. Unique to each enterprise are the interactions between what your business is doing, and events external to your company. Some parts of your business may operate and respond in real time, while others may not. Keeping data in-memory can help to alleviate problems such as large batch jobs in back end systems taking too long. Think about areas of your business where real time data analysis would give you an advantage: Online retailers need to respond quickly to queries. This is even more critical when the retailer is a target of aggregators like Google Shopping Financial institutions reacting to market events and news Airlines trying to optimize scheduling of services while aircraft are on the ground in the most efficient and cost effective way Retailers need to keep taking orders during surges in demand, even if the back end systems can’t scale to accommodate. Financial institutions calculating risk of a particular transaction in real time to rapidly make the best decision possible. In such use cases, the answer is real time products such as IMDG. Once IMDG receives the data, the user can then act on it immediately through the event framework. It can take part in application-led XA transactions (global transactions from multiple data stores,) so anything that needs to be transactional and consistent should go there. What if you need it both now and later? If this is the case, you do not want to cram all of your computation and data into a single tier and have it address both cases. Neither will be solved well. The key item is to use the right solution in the right moment. It is recommended to constrain your real time tier to only respond to business events happening now. The work done on this tier should be focused on singular events, which includes anything that should be updated as a result of a single piece of data coming in. This could be a transaction from an internal system, or a feed from an external system. Since the real time tier must be as responsive as possible, you don’t want to do long-running, exhaustive work on it. You will want to do deep exploratory analysis somewhere else. With a singular piece of data, you might decide to update a running aggregate in-memory, send it to another system, persist it, index it, or take another action. The key is that the action being taken is based on the singular event or a small set of singular events that are being correlated. For longer running queries and analytics, such as year end reporting, or data exploration to detect new patterns in your business, the interactive and batch tiers are more appropriate. What are my storage requirements? You may have multiple answers to this question depending on the type of data. If you do not need to store the data long term, IMDG can manage it in-memory with strong consistency and availability. If you want to store it long term, but may not be working with it in the short term, then Map-Reduce is a highly scalable storage solution. If you are required to store the data because of regulations and reporting requirements, and it is well structured, then MPP AD is a fantastic answer. Where is my data coming from? Is the data coming from a stream of events from internal or external systems? Message driven architectures? Files? Extract, transform, and load (ETL) database events? IMDG is great at handling large and varying streams of data from any type of system. IMDG can handle data streams accelerating by adding more nodes to the system, meaning your pipe in isn’t throttled. Meanwhile, MPP AD and Map-Reduce solutions are both better at taking batch updates (either file or ETL). This works out well, because IMDG can write to both these systems in batch. It can be configured to write to any backend store. In such a case, you would use IMDG to do a large data ingest, write it out to Map-Reduce store in batch, and then analyze it there. What are my latency requirements? What latency does your business require in a given scenario and data set? For machine time latency (microseconds to seconds,) IMDG is the solution. If the latency is longer, or at the speed of human interaction, SQL over Map-Reduce or MPP AD might be most appropriate. Usually, these break down pretty cleanly into customer/partner latency (real time) versus internal latency (interactive and batch), however if you are in a real time business, like stock trading, everything may be time critical.
#6: Wiki: Transaction data are data describing an event (the change as a result of a transaction) and is usually described with verbs. Transaction data always has a time dimension, a numerical value and refers to one or more objects (i.e. the reference data). Pre-processing Requirements In many ways, your choice of platform is determined by the data you want to analyze. Instead of thinking in terms of structured, semi-structured, or unstructured data, consider the amount of pre-processing needed to develop an effective predictive model. Predictive models and machine learning algorithms need specific and well-formatted inputs. The steps to transform the raw data into something usable for modeling depend on the source and type of data. The following chart compares the analysis approaches for different data pre-processing requirements. Transactional data and traditional customer information records are best suited for MPP AD, as they require little to no pre-processing. Geospatial data often requires relatively complex geometric calculations Raw log files, XML or JSON files and typical social media data are semi-structured, and this is a situation where SQL over Map-Reduce is ideal. Users write SQL to interact with files in HDFS, enabling quick insights without writing Pig or MapReduce jobs. Depending on the log files, MPP AD can also be used to parse semi-structured logs efficiently. For text it is easy to include open-source Natural Language Processing toolkits in your processing pipeline within Map-Reduce stack using Procedural Languages such as PL/Python. Video or image data demands extensive pre-processing requirements. Map-Reduce is the best choice, especially when combined with in-memory processing engine. Summary Whether you need to have high scalability the first step is to make your data ready for exploration and analysis.
#7: http://todayinsci.com/S/Stoll_Clifford/StollClifford-Quotations.htm
#8: http://jeremykun.files.wordpress.com/2014/10/mapreduceimage.gif?w=1800 Wiki MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. The "MapReduce Framework" orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance. The model is inspired by the map and reduce functions commonly used in functional programming,[6] although their purpose in the MapReduce framework is not the same as in their original forms.[7] The key contributions of the MapReduce framework are not the actual map and reduce functions, but the scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine once. As such, a single-threaded implementation of MapReduce (such as MongoDB) will usually not be faster than a traditional (non-MapReduce) implementation, any gains are usually only seen with multi-threaded implementations.[8] The use of this model is beneficial only when the optimized distributed shuffle/combine operation (which reduces network communication cost) and fault tolerance features of the MapReduce framework come into play. Optimizing the communication cost is essential to a good MapReduce algorithm. The name MapReduce originally referred to the proprietary Google technology, but has since been genericized.
#9: https://anonymousbi.files.wordpress.com/2012/11/hadoopdiagram.png MapReduce allows for distributed processing of the map and reduction operations. Provided that each mapping operation is independent of the others, all maps can be performed in parallel – though in practice this is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase, provided that all outputs of the map operation that share the same key are presented to the same reducer at the same time, or that the reduction function is associative. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled – assuming the input data is still available. Another way to look at MapReduce is as a 5-step parallel and distributed computation: - Prepare the Map() input – the "MapReduce system" designates Map processors, assigns the input key value K1 that each processor would work on, and provides that processor with all the input data associated with that key value. - Run the user-provided Map() code – Map() is run exactly once for each K1 key value, generating output organized by key values K2. - “Combine" the Map output to the Reduce processors – the MapReduce system designates Reduce processors, assigns the K2 key value each processor should work on, and provides that processor with all the Map-generated data associated with that key value. - Run the user-provided Reduce() code – Reduce() is run exactly once for each K2 key value produced by the Map step. - Produce the final output – the MapReduce system collects all the Reduce output, and sorts it by K2 to produce the final outcome. These five steps can be logically thought of as running in sequence – each step starts only after the previous step is completed – although in practice they can be interleaved as long as the final result is not affected. In many situations, the input data might already be distributed ("sharded") among many different servers, in which case step 1 could sometimes be greatly simplified by assigning Map servers that would process the locally present input data. Similarly, step 3 could sometimes be sped up by assigning Reduce processors that are as close as possible to the Map-generated data they need to process. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster. The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.
#10: HDFS The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. Assumptions and Goals Hardware Failure Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. Streaming Data Access Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. Large Data Sets Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance. Simple Coherency Model HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. “Moving Computation is Cheaper than Moving Data” A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located. HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications. NameNode and DataNodes HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case. The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode. Robustness The NameNode machine is a single point of failure for an HDFS cluster. If the NameNode machine fails, manual intervention is necessary. Emphasis on the single point of failure or HDFS. HDFS and CAP http://blog.cloudera.com/blog/2010/04/cap-confusion-problems-with-partition-tolerance/ Systems such as ZooKeeper are explicitly sequentially consistent because there are few enough nodes in a cluster that the cost of writing to quorum is relatively small. The Hadoop Distributed File System (HDFS) also chooses consistency – three failed datanodes can render a file’s blocks unavailable if you are unlucky. Both systems are designed to work in real networks, however, where partitions and failures will occur*, and when they do both systems will become unavailable, having made their choice between consistency and availability. That choice remains the unavoidable reality for distributed data stores. *For more on the inevitably of failure modes in large distributed systems, the interested reader is referred to James Hamilton’s LISA ’07 paper On Designing and Deploying Internet-Scale Services. https://www.usenix.org/legacy/event/lisa07/tech/full_papers/hamilton/hamilton_html/index.html http://stackoverflow.com/questions/19923196/cap-with-distributed-system The NameNode is a Single Point of Failure for the HDFS Cluster. HDFS is not currently a High Availability system. When the NameNode goes down, the file system goes offline. There is an optional SecondaryNameNode that can be hosted on a separate machine. It only creates checkpoints of the namespace by merging the edits file into the fsimage file and does not provide any real redundancy.
#11: Java integration Minimally, applications specify the input/output file locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable etc.) and configuration to the JobTracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client. User creates configuration, implements Mapper-Combiner-Reducer, defines input and output data types and file paths, submits to job client that is doing all the work with job tracker.
#12: The Apache Hive ™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. Impala is the open source, native analytic database for Apache Hadoop.
#13: http://blog.pivotal.io/big-data-pivotal/products/why-mpp-based-analytical-databases-are-still-key-for-enterprises http://architects.dzone.com/articles/sql-and-mpp-next-phase-big http://www.ndm.net/datawarehouse/Greenplum/greenplum-database-overview Most DW appliances use massively parallel processing (MPP) architectures to provide high query performance and platform scalability. MPP architectures consist of independent processors or servers executing in parallel. Most MPP architectures implement a "shared-nothing architecture" where each server operates self-sufficiently and controls its own memory and disk. DW appliances distribute data onto dedicated disk storage units connected to each server in the appliance. This distribution allows DW appliances to resolve a relational query by scanning data on each server in parallel. The divide-and-conquer approach delivers high performance and scales linearly as new servers are added into the architecture. MPP Analytical Database’s shared-nothing architecture provides every segment with an independent high-bandwidth connection to dedicated storage. The segment servers are able to process every query in a fully parallel manner, use all disk connections simultaneously, and efficiently flow data between segments as query plans dictates. The degree of parallelism and overall scalability that this allows far exceeds general-purpose database systems. By transparently distributing data and work across multiple 'segment' servers, MPP Analytical Database executes mathematically intensive analytical queries “close to the data” with performance that scales linearly with the number of segment servers. MPP-Based Analytic Databases have been designed with security, authentication, disaster recovery, high availability and backup/restore in mind. Main features to achieve performance goals are: Column-oriented storage Data compression Other outstanding features: ANSI SQL support. Built-in analytical functions to work with: Time series analysis Statistical analysis Event series analysis, i.e. pattern matching UDFs, i.e. R/C++/Java http://en.wikipedia.org/wiki/Shared_nothing_architecture SNA A shared nothing architecture (SN) is a distributed computing architecture in which each node is independent and self-sufficient, and there is no single point of contention across the system. More specifically, none of the nodes share memory or disk storage. People typically contrast SN with systems that keep a large amount of centrally-stored state information, whether in a database, an application server, or any other similar single point of contention.[citation needed] The advantages of SN architecture versus a central entity that controls the network (a controller-based architecture) include eliminating single points of failure, allowing self-healing capabilities and providing an advantage with offering non-disruptive upgrades. Emphasis on contrast to HDFS master/slave architecture with single point of failure
#14: K-Safety K-safety is a measure of fault tolerance in the database cluster. The value K represents the number of replicas of the data in the database that exist in the database cluster. These replicas allow other nodes to take over for failed nodes, allowing the database to continue running while still ensuring data integrity. If more than K nodes in the database fail, some of the data in the database may become unavailable. It is possible for an database to have more than K nodes fail and still continue running safely, because the database continues to run as long as every data segment is available on at least one functioning cluster node. Potentially, up to half the nodes in a database with a K-safety level of 1 could fail without causing the database to shut down. As long as the data on each failed node is available from another active node, the database continues to run. If half or more of the nodes in the database cluster fail, the database will automatically shut down even if all of the data in the database is technically available from replicas. This behavior prevents issues due to network partitioning. In HP Vertica, the value of K can be zero (0), one (1), or two (2). The physical schema design must meet certain requirements. To create designs that are K-safe, HP recommends using the Database Designer. The diagram above shows a 5-node cluster that has a K-safety level of 1. Each of the nodes contains buddy projections for the data stored in the next higher node (node 1 has buddy projections for node 2, node 2 has buddy projections for node 3, etc.). Any of the nodes in the cluster could fail, and the database would still be able to continue running (although with lower performance, since one of the nodes has to handle its own workload and the workload of the failed node).
#15: If node 2 fails, node 1 handles requests on its behalf using its replica of node 2's data, in addition to performing its own role in processing requests. The fault tolerance of the database will fall from 1 to 0, since a single node could cause the database to become unsafe. In this example, if either node 1 or node 3 fails, the database would become unsafe because not all of its data would be available. If node 1 fails, then node 2's data will no longer be available. If node 3 fails, its data will no longer be available, because node 2 is also down and could not fill in for it. In this case, nodes 1 and 3 are considered critical nodes. In a database with a K-safety level of 1, the node that contains the buddy projection of a failed node and the node whose buddy projections were on the failed node will always become critical nodes.
#16: With node 2 down, either node 4 or 5 in the cluster could fail and the database would still have all of its data available. For example, if node 4 fails, node 3 is able to use its buddy projections to fill in for it. In this situation, any further loss of nodes would result in a database shutdown, since all of the nodes in the cluster are now critical nodes. (In addition, if one more node were to fail, half or more of the nodes would be down, requiring HP Vertica to automatically shut down, no matter if all of the data were available or not.)
#17: In a database with a K-safety level of 2, any node in the cluster could fail after node 2 and the database would be able to continue running. For example, if in the 5-node cluster each node contained buddy projections for both its neighbors (for example, node 1 contained buddy projections for both node 5 and node 2), then nodes 2 and 3 could fail and the database could continue running. Node 1 could fill in for node 2, and node 4 could fill in for node 3. Due to the requirement that half or more nodes in the cluster be available in order for the database to continue running, the cluster could not continue running if node 5 were to fail as well, even though nodes 1 and 4 both have buddy projections for its data. K-Safety Requirements Your database must have a minimum number of 2K+1 nodes to be able to have a K-safety level of K. // CAP-theorem states that in distributed database if you have a network that may drop messages, then you cannot have both complete availability and perfect consistency in the event of a partition, instead you must choose one. The CAP theorem is useful from a system engineering perspective because distributed systems must pick 2/3 of the properties to implement and 1/3 to give up. A system that “gives up” on a particular property strives makes a best effort but cannot provide solid guarantees. Different systems choose to give up on different properties, resulting in different behavior when failures occur. However, there is a fair amount of confusion about what the C, A, and P actually mean for a system. Partition-tolerance – A network partition results in some node A being unable to exchange messages with another node B. More generally, the inability of the nodes to communicate. Systems that give up on P assume that all messages are reliably delivered without fail and nodes never go down. Pretty much any context in which the CAP theorem is invoked, the system in question supports P. Consistency – For these types of distributed systems, consistency means that all operations submitted to the system are executed as if in some sequential order on a single node. For example, if a write is executed, a subsequent read will observe the new data. Systems that give up on C can return inconsistent answers when nodes fail (or are partitioned). For example, two clients can read and each receive different values. Availability – A system is unavailable when a client does not receive an answer to a request. Systems that give up on A will return no answer rather than a potentially incorrect (or inconsistent) answer. For example, unless a quorum of nodes are up, a write will fail to succeed. Vertica is a stateful distributed system and thus worthy of consideration under the CAP theorem: Partition-tolerance – Vertica supports partitions. That is, nodes can fail or messages can fail to be delivered and Vertica can continue functioning. Consistency – Vertica is consistent. All operations on Vertica are strongly ordered – i.e., there is a singular truth about what data is in the system and it can be observed by querying the database. Availability – Vertica is willing to sacrifice availability in pursuit of consistency when failures occur. Without a quorum of nodes (over half), Vertica will shut down since no modification may safely be made to the system state. The choice to give up availability for consistency is a very deliberate one and represents cultural expectations for a relational database as well as a belief that a database component should make the overall system design simpler. Developers can more easily reason about the database component being up or down than about it giving inconsistent (dare I say … “wrong”) answers. One reason for this belief is that a lack of availability is much more obvious than a lack of consistency. The more obvious and simplistic a failure mode is, the easier integration testing will be with other components, resulting in a higher quality overall system.
#18: Java integration Via JDBC, so in most cases for Java application you are using your regular SQL as with MySQL.
#19: https://dataddict.wordpress.com/2013/05/14/choosing-a-mpp-database-is-incredibly-hard/ HP Vertica Analytics Platform 6.1 “Bulldozer” I’ve talked a lot about Vertica because I love this product, and the last release of the platform kept the same feeling in my head: This is a trulyAdvanced Database. But, believe me, I’m not the only one; companies like Twitter, Zynga, Convertro, are big users of the platform. If you want to know more about the Bulldozer release, just see the webinar dedicated to this topic here or you ca download its Datasheet from here. Greenplum Database 4.0 But Vertica’s team is not the only one innovating in this space. The amazing engineering team at Greenplum have added some outstanding features to its new release which are very useful and synonym of the hard work of this team: High Performance gNet™ for Hadoop Out-of-the-Box Support for Big Data Analytics Multi-level Partitioning with Dynamic Partitioning Elimination Polymorphic Data Storage-MultiStorage/SSD Support Fast Query processing with a new loading technology called Scatter/Gather Streaming, allowing automatic parallelization of data loading and queries Analytics and Language Support , proving methods for advanced analytic functions like t-statistics, p-values, and Naïve Bayes inside the database, and besides have a great integration with R Dynamic Query Prioritization and many more amazing features Teradata Aster Data Database 5.0 This is another great team which is doing very well in this field, with itsamazing platform combining highly technical research too in a single product. Some of these features are: A patent-pending SQL-MapReduce framework who allows to combine the power of MapReduce with SQL Hybrid row/column storage depending of your needs Two great things called “Always-On” and “Always-Parallel” who allow to use parallelism for data and analytics processing; and provide world-class fault tolerance A great group of ready-to-use analytic functions for rapid analytic platform development called Aster MapReduce Analytics Portfolio Rich monitoring and easy management of data and analytic processing with the intuitive Aster Management Console Great integration with several languages like Java, C, C#, Python, C++, and R Dynamic mixed workload management ensures scalable performance even with large numbers of concurrent users and workloads You can download its Datasheet from here. ParAccel Analytic Platform 4.0 This is another team which is doing a terrific job building this outstanding analytic platform. Some of its features: On-Demand integration with Hadoop Columnar-Oriented storage scheme A power Extensibility framework Advanced Query Optimization, allowing to perform better complex operations like JOINs, sorting and query planing Advanced communication protocol for the whole interconnection around the cluster to improve the process of data loading, backup and recovery and parallel query execution Advanced I/O optimization which allows to scan performance improving, using high performance algorithms to predict which data blocks will be needed for future operations. Adaptive compression encoding depending of the involved data type A great number of analytic functions ready to use for a lot of techniques like pattern matching, time series analysis, advertising attribution and optimization, sophisticated fraud detection and event analysis and a lot of statistic methods like Univariate, Multivariate, Data Mining, Mathematical, Corporate Finance, Options/Derivatives, Portfolio Management, Fixed Income and many more. You can read more about all this on its Datasheet here. Amazon Redshift Amazon Redshift is based in a version of ParAccel.
#20: The data model is distributed across many servers in a single location or across multiple locations. This distribution is known as a data fabric. This distributed model is known as a ‘shared nothing’ architecture. All servers can be active in each site. All data is stored in memory of the servers. Servers can be added or removed non-disruptively, to increase the amount of CPU and RAM available. The data model is non-relational and is object-based. Distributed applications written on the .NET and Java application platforms are supported. The data fabric is resilient, allowing non-disruptive automated detection and recovery of a single server or multiple servers. http://www.gridgain.com/in-memory-database-vs-in-memory-data-grid-revisited/ Before moving forward, let’s clarify what we mean by “in-memory”. Although some vendors refer to SSDs, Flash-on-PCI, Memory Channel Storage, and DRAM as “in-memory”, in reality, most vendors support a tiered storage model where part of the data is stored in DRAM, which then gets overflown to a variety of flash or disk devices. Therefore it is rarely a DRAM-only, Flash-only or disk-only product. However, it’s important to note that most products in both categories are often biased towards mostly DRAM or mostly flash/disk storage in their architecture. The main point to take away is that “in-memory” products are not confined to one fixed definition, but in the end they all have a significant “in-memory” component. In-Memory Data Grids typically lack full ANSI SQL support but instead provide MPP-based (Massively Parallel Processing) capabilities where data is spread across large cluster of commodity servers and processed in explicitly parallel fashion. The main access pattern is key/value access, MapReduce, various forms of HPC-like processing, and a limited distributed SQL querying and indexing capabilities. An In-Memory Data Grid always works with an existing database providing a layer of massively distributed in-memory storage and processing between the database and the application. Applications then rely on this layer for super-fast data access and processing. Most In-Memory Data Grids can seamlessly read-through and write-through from and to databases, when necessary, and generally are highly integrated with existing databases. Emphasis on similarities with MPP AD, just needs DB for persistence.
#21: Hazelcast and cap The oft-discussed CAP-theorem in distributed database theory states that if you have a network that may drop messages, then you cannot have both complete availability and perfect consistency in the event of a partition, instead you must choose one. Multiple copies of data (up to 3, backup-count property) are stored in multiple machines for automatic data recovery in case of single or multiple server failures Hazelcast’s approach to the CAP-theorem notes that network partitions caused by total loss of messages are very rare on modern LaNs; instead, Hazelcast applies the CAP-theorem over wide-area (and possibly geographically diverse) networks. In the event of a network partition where nodes remain up and connected to different groups of clients (i.e. a split-brain scenario), Hazelcast will give up Consistency (“C”) and remain available (“a”) whilst partitioned (“p”). Emphasis on contrast with HP Vertica However, unlike some NoSQL implementations, C is not given up unless a network partition occurs. The effect for the user in the event of a partition would be that clients connected to one partition would see locally consistent results. However, clients connected to different partitions would not necessarily see the same result. For example, an AtomicInteger could now potentially have different values in different partitions. Fortunately, such partitions are very rare in most networks. Since Hazelcast clients are always made aware of a list of nodes they could connect to, in the much more likely event of loss of a datacenter (or region), clients would simply reconnect to unaffected nodes.
#22: Add Computations to Data TBD text cleanup http://hazelcast.com/use-cases/imdg/in-memory-computing/ What is it? In-Memory Computing Microprocessors double in performance and speed roughly every two years. Software developers have created analytics that let researchers crunch millions of variables from disparate sources of information. However, the time it takes a server or a smartphone to retrieve data from a storage system for a cloud company or hosting provider hasn’t decreased much since it still involves searching a spinning, mechanical hard drive. Such a transaction might only take milliseconds, but millions of transactions per day add up to a lot of time. In-memory computing (IMC) reduces the need to fetch data from disks by moving more data out of the drives and into faster memory, such as flash (or in the case of Hazelcast, RAM). Memory based on flash can be more than 53 times faster than memory based around disks. IMC that processes distributed data in parallel across multiple nodes is a technical necessity because the data is stored that way (in-memory across multiple nodes) due to its sheer size. Moving data from drives to memory results in ultra fast access to data and allows developers to cut many lines of codes from their application. This helps on many fronts: Fewer product delays and operations headaches Greater customer usability experience Higher customer satisfaction Business customers such as retailers, banks and utilities, to quickly detect patterns, analyze massive volume of data on the fly, and perform operations in real time In-Memory Computing Advantages In-memory computing has advantages relating to the fact that reading and writing data that is purely in-memory is faster than data stored on a drive. For example: Cache large amounts of data, and get fast response times for searches Store session data, allowing for customization of live sessions and optimizing website performance Improve complex event processing Benefits of Using Hazelcast Being a true IMC solution, Hazelcast provides you the capability to apply and execute business logic and complex algorithms on data in-memory of the magnitude of terabytes and petabytes. Some of the immediate benefits of using Hazelcast include: The ability to store countless amount of data in-memory, thus ensuring extremely fast response times for extracting data Bypassing the need of complex memory tuning that would otherwise be required to keep huge data in-memory Saving the costs in hardware and operations due to the capacity of storing large volume of data within a single machine (based on RAM) The ability of in-memory computing for large volumes of data in parallel across a handful of nodes Ultra low latency and high throughput infrastructure because of in-memory storage and computing In-memory MapReduce and Aggregators for super-fast computation/aggregation of large amounts of data Hazelcast is Flexible Hazelcast is highly flexible. It provides multiple locking semantics for users to attain the desired level of consistency while performing complex in-memory computing of data. Some of the Hazelcast features that empower a user with in-memory computing are: Distributed Executor Service EntryProcessors MapReduce Distributed Aggregator The diagram above depicts a typical IMC architecture with Hazelcast.
#23: Topology with every node being an app server as well. Topology with a dedicated Hazelcast cluster Java integration Hazelcast provides a drop-in library that any Java developer can include in minutes to enable them to build elegantly simple mission-critical, transactional, and terascale in-memory applications. Data structures: Map, Queue, Lock, Cache Computing: Aggregated and Map-Reduce TBD: cleanup Hazelcast provides a convenient and familiar interface for developers to work with distributed data structures and other aspects of in-memory computing. For example, in its simplest configuration, hazelcast can be treated as an implementation of the familiar ConcurrenthashMap that can be accessed from multiple JVMs, including JVMs that are spread out across the network. However, it is not necessary to deal with the overall sophistication present in the architecture in order to work effectively with hazelcast, and many users are happy integrating purely at the level of the java.util.concurrent or javax.cache apis. http://hazelcast.org/ Core Java java.util.concurrent.ConcurrentMap com.hazelcast.core.MultiMap >> Collection<String> values = multiMap.get("key") javax.cache.Cache java.util.concurrent.BlockingQueue java.util.concurrent.locks.Lock Hazelacast specific com.hazelcast.core.IMap // EntryProcessor >> map.executeOnKey com.hazelcast.core.ITopic com.hazelcast.core.IExecutorService com.hazelcast.core.IAtomicLong In-memory data computing com.hazelcast.mapreduce.aggregation.Supplier com.hazelcast.mapreduce.aggregation.Aggregations com.hazelcast.mapreduce.Job; com.hazelcast.mapreduce.JobTracker; com.hazelcast.mapreduce.KeyValueSource; com.hazelcast.mapreduce.Mapper; com.hazelcast.mapreduce.Reducer; …
#24: How it fits the big data problem Apache Ignite In-Memory Data Fabric is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash technologies. Oracle Coherence is the industry leading in-memory data grid solution that enables organizations to predictably scale mission-critical applications by providing fast access to frequently used data. Pivotal GemFire is a distributed data management platform. Pivotal GemFire is designed for many diverse data management situations, but is especially useful for high-volume, latency-sensitive, mission-critical, transactional systems. Gigaspaces XAP is an in-memory computing software platform that processes all your data & apps in real time.
#25: Project domain Talk about Gen3 – Oracle & .NET. – Works but is not scalable.
#27: Project goals http://en.wikipedia.org/wiki/Business_analytics#Types_of_analytics http://en.wikipedia.org/wiki/Data_analysis Note: all system changes i.e. parameters for analytical computations are time-series data.
#28: Project technology Akka is an open-source toolkit and runtime simplifying the construction of concurrent and distributed applications on the JVM. Akka supports multiple programming models for concurrency, but it emphasizes actor-based concurrency, with inspiration drawn from Erlang. Language bindings exist for both Java and Scala. R is a free software environment for statistical computing and graphics.
#29: Talk on Vertica built-in time-series analysis capabilities Redis as in-memory database for recent data Akka as framework for building distributed applications R as statistical computing language Java-R integration via RESTful API
#30: What could be done another way http://www.vertica.com/2012/11/14/how-to-parse-anything-into-vertica-using-externalfilter/ http://www.vertica.com/2012/10/02/how-to-implement-r-in-vertica/
#31: Lessons learned

Big data analysis in java world

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Big data analysis in java world (20)

Recently uploaded (20)

Big data analysis in java world

Editor's Notes