From the course: Big Data Analytics with Hadoop and Apache Spark

Storage formats

- [Instructor] In this chapter, I will review various options available and best practices to store data in HDFS. I will start off with storage formats in this video. HDFS supports a variety of storage formats, each with its own advantages and use cases. The list includes raw text files, structured text files like CSV, XML and JSON, native sequence files, Avro formatted files, ORC files, and Parquet files. I will review the most popular ones for analytics now. Text files carry the same format they have in a normal file system. They are stored as a single physical file in HDFS. They are of low performance as they do not support parallel operations. They require more storage and do not have an enforced schema. In general, they are not recommended. Avro file support language neutral data serialization. So data return through one language or tool can be read with another with no problems. Data is stored row by row like CSV files. They support a self describing schema and is used to enforce constraints on data. They are compressible and hence can optimize on storage. They are splittable into partitions and hence can help in parallel reads and writes. They're ideal for situations that require multi-language support. Parquet file store data column by column, similar to columnar databases. This means each column can be read separately from disc without reading other columns. This saves on IO. They support schema. Parquet files are both compressible and splittable, and hence our performance and storage optimized. They also can support nested data structures. Parquet files are ideal for batch analytics jobs for these reasons. Analytics applications typically have data stored as records and columns, similar to RDBM tables. Parquet provides overall better performance and flexibility for these applications. I will show later in the course how Parquet enables parallelization and IO optimization.

Contents