This document is a thesis submitted for the degree of Bachelor of Computer Science at Opole University of Technology. It explores using distributed systems for processing large text datasets in the context of near duplicate text detection. The study reviews big data concepts, popular analytics frameworks like Hadoop and Spark, and algorithms for determining document duplication levels. The results were applied to develop a prototype distributed anti-plagiarism system that showed improved performance over existing solutions for analyzing large collections of text data.