Memory-efficient data loading techniques
For datasets too large to fit in memory, we can use memory mapping or chunking techniques.
Memory mapping leverages OS-level functionality to map large files directly into memory without loading the entire file. This enables random access to portions of the file, making it suitable for scenarios requiring frequent but non-sequential access. It is fast for large, structured datasets such as embeddings or tokenized text files but may have higher overhead for small, scattered reads.
Chunking, on the other hand, divides data into smaller, sequentially processed chunks. This is effective for streaming large, sequentially accessed datasets (for example, text or logs) into memory-limited environments. While simpler and more portable, chunking may be slower for random access patterns compared to memory mapping.
Here’s an example using NumPy’s memmap
feature, which creates array-like objects that map to files on disk, permitting...