- Apache Spark is an open-source cluster computing framework that provides fast, in-memory processing for large-scale data analytics. It can run on Hadoop clusters and standalone.
- Spark allows processing of data using transformations and actions on resilient distributed datasets (RDDs). RDDs can be persisted in memory for faster processing.
- Spark comes with modules for SQL queries, machine learning, streaming, and graphs. Spark SQL allows SQL queries on structured data. MLib provides scalable machine learning. Spark Streaming processes live data streams.