Best Open Source Big Data Tools 2025

Big Data Tools

Big Data System Clear Filters

Browse free open source Big Data tools and projects below. Use the toggles on the left to filter open source Big Data tools by OS, license, language, programming language, and project status.

Our Free Plans just got better! | Auth0
With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.

Try free now
Collect! is a highly configurable debt collection software
Everything that matters to debt collection, all in one solution.

The flexible & scalable debt collection software built to automate your workflow. From startup to enterprise, we have the solution for you.

Learn More
1

Apache HBase

Get random, realtime read/write access to your Big Data

Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables, billions of rows X millions of columns, atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable. A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS. Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options. Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX. Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.

Downloads: 5 This Week

Last Update: 2025-11-14
See Project
2

Apache RocketMQ

Distributed messaging and streaming platform with low latency

Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability. Messaging patterns including publish/subscribe, request/reply and streaming. Financial grade transactional message. Built-in fault tolerance and high availability configuration options base on DLedger. A variety of cross language clients, such as Java, C/C++, Python, Go. Pluggable transport protocols, such as TCP, SSL, AIO. Built-in message tracing capability, also support opentracing. Versatile big-data and streaming ecosytem integration. Message retroactivity by time or offset. Reliable FIFO and strict ordered messaging in the same queue. Efficient pull and push consumption model. Million-level message accumulation capacity in a single queue. Multiple messaging protocols like JMS and OpenMessaging. Flexible distributed scale-out deployment architecture. Lightning-fast batch message exchange system.

Downloads: 1 This Week

Last Update: 2 days ago
See Project
3

Logan

Logan is a lightweight case logging system based on mobile platform

Logan is a log platform with the ability to collect, store, upload and analyze front-end logs. We provide five components, including iOS SDK, Android SDK, Web SDK, analysis services Server SDK and LoganSite. In addition, we also provide a Flutter plugin Flutter Plugin. LoganSite provides a visualized way for developers to scan and search logs uploaded from App and Web. To put it simply, the traditional idea is to piece together the problems that appear in the logs of each system, but the new idea is to aggregate and analyze all the logs generated by the user to find the scenes with problems. In the future, we will provide a data platform based on Logan big data, including advanced functions such as machine learning, troubleshooting log solution, and big data feature analysis.

Downloads: 1 This Week

Last Update: 2025-08-05
See Project
4

MyCAT

Active, high-performance open source database middleware

MyCAT is an Open-Source software, “a large database cluster” oriented to enterprises. MyCAT is an enforced database which is a replacement for MySQL and supports transaction and ACID. Regarded as MySQL cluster of enterprise database, MyCAT can take the place of expensive Oracle cluster. MyCAT is also a new type of database, which seems like a SQL Server integrated with the memory cache technology, NoSQL technology and HDFS big data. And as a new modern enterprise database product, MyCAT is combined with the traditional database and new distributed data warehouse. In a word, MyCAT is a fresh new middleware of database. MyCAT ’s objective is to smoothly migrate the current stand-alone database and applications to cloud side with low cost and to solve the bottleneck problem caused by the rapid growth of data storage and business scale.

Downloads: 1 This Week

Last Update: 2021-06-28
See Project
Create and run cloud-based virtual machines.
Secure and customizable compute service that lets you create and run virtual machines.

Computing infrastructure in predefined or custom machine sizes to accelerate your cloud transformation. General purpose (E2, N1, N2, N2D) machines provide a good balance of price and performance. Compute optimized (C2) machines offer high-end vCPU performance for compute-intensive workloads. Memory optimized (M2) machines offer the highest memory and are great for in-memory databases. Accelerator optimized (A2) machines are based on the A100 GPU, for very demanding applications.

Try for free
5

HPCC Systems

End-to-end big data in a massively scalable supercomputing platform.

HPCC Systems® (www.hpccsystems.com) from LexisNexis® Risk Solutions is a proven, open source solution for Big Data insights that can be implemented by businesses of all sizes. With HPCC Systems, developers can design applications with Big Data at their core, enabling businesses to better analyze and understand data at scale, improving business time to results and decisions. HPCC Systems offers a consistent data-centric programming language, two processing platforms and a single, complete end-to-end architecture for efficient processing. Read our blog (http://hpccsystems.com/blog ), or connect with us on Twitter (@hpccsystems), Facebook (https://www.facebook.com/hpccsystems ) and LinkedIn (http://www.linkedin.com/company/hpcc-systems) HPCC Systems is available on AWS & can be configured through the Instant Cloud Solution.

2 Reviews

Downloads: 4 This Week

Last Update: 6 days ago
See Project
6

json-scada

A portable SCADA/IoT platform centered on the MongoDB database server.

Standard IT tools applied to SCADA/IoT (MongoDB, PostgreSQL/TimescaleDB,Node.js, C#, Golang, Grafana, etc.). MongoDB as the real-time core database, persistence layer, config store, SOE historian. Portability and interoperability over Linux, Windows, x86/64, ARM. Horizontal scalability, from a single computer to big clusters (MongoDB-sharding), Bare Metal, Docker containers, VM, cloud, or hybrid deployments. Unlimited tags, servers, and users. HTML5 Web interface. UTF-8/I18N. Protocols: IEC61850 Client, IEC60870-5-101/104 Client and Server, DNP3 Client, OPC-UA Client/Server, MQTT/Sparkplug-B, Telegraf (various data sources for monitoring like Modbus, SNMP, etc.) Github. project https://github.com/riclolsen/json-scada Requirements for Windows Installer: Windows 10/11 64 bits or Server 2016, Windows PowerShell.

Downloads: 2 This Week

Last Update: 2025-11-10
See Project
7

FrincBackup

Incremtal backup tool supporting removable storage devices

FrincBackup means free incremental backup. It is developed for backing up a x TB NAS with storage devices in a logical volume to multiple removable storage devices, such as 500 GB USB hard drives. Files are backuped as files (not as an archive) and are readable without the need of a tool and without the need of FrincBackup itself (allthough there is a restore mode for better handling).

Downloads: 1 This Week

Last Update: 2014-07-18
See Project
8

Cube Platform

Cube Platform is a decentralized grid computing system that uses P2P Pastry protocol for communication between nodes. It's a big data storage written in Java.

Downloads: 0 This Week

Last Update: 2013-04-23
See Project
9

Fluid

Fluid, elastic data abstraction and acceleration for BigData/AI apps

Fluid, elastic data abstraction and acceleration for BigData/AI applications in the cloud. Provide DataSet abstraction for underlying heterogeneous data sources with multidimensional management in a cloud environment. Enable dataset warmup and acceleration for data-intensive applications by using a distributed cache in Kubernetes with observability, portability, and scalability. Taking characteristics of application and data into consideration for cloud application/dataset scheduling to improve the performance.

Downloads: 0 This Week

Last Update: 2025-10-31
See Project
Deliver trusted data with dbt
dbt Labs empowers data teams to build reliable, governed data pipelines—accelerating analytics and AI initiatives with speed and confidence.

Data teams use dbt to codify business logic and make it accessible to the entire organization—for use in reporting, ML modeling, and operational workflows.

Learn More
10

GnuCopy

GnuCopy is an Open-Source tool to copy and archive all your important data. It supports all important archive typs like Zip and Tar to guaranty an easy and secure exchange between all types of operating systems. Additionally, you can create profiles to blacklist or whitelist specific file types or folders to seperate your big data stores for backups.

Downloads: 0 This Week

Last Update: 2023-07-28
See Project
11

ODD Platform

First open-source data discovery and observability platform

Unlock the power of big data with OpenDataDiscovery Platform. Experience seamless end-to-end insights, powered by unprecedented observability and trust - from ingestion to production - while building your ideal tech stack! Democratize data and accelerate insights. Find data that fits your use case and discover hints left by your peers to leverage existing knowledge. Explore tags, ownership details, links to other sources and other information to shorten and simplify data discovery phase. Forget unnerved stakeholders and wasting too much time on digging the root cause of data issues when it fails. With ODD’s automatic company-wide ingestion-to-product lineage you’ll have answers in just seconds and stakeholders won’t need to wait. Sleep well, knowing all your data is in check. Forget manual testing, days of debugging, and weeks of worrying. Know the impact of each code change with automatic testing. Enjoy lineage and alerts powered with data quality information.

Downloads: 0 This Week

Last Update: 2025-02-19
See Project
12

Oasis Development Tool

OASIS Development Tool

The OASIS Development Tool is an innovative IDE for Code Generation-, Code Debugging- and Visual Coding- using the OASIS Programming Language. The OASIS Programming Language is a 4GL Concurrency- and Database Language running round a distributed OASIS Runtime Machine Environment (RME) as interpreted OASIS Scripts sequenced into OASIS Polyglot Runtime Components (PRC) with just in time patterns. The IDE is designed specifically for the OASIS Programming Language. The IDE is focused around the concept of Visual-, Online-, Data-Centric-, Concurrent-, and Runtime- Code, whilst remaining an IDE to handle OASIS Programming. The IDE has a number of visual code drag and drop features. The Tool is by no means a representative of the Cyclical UML Model- and Code concept, but rather a replacement. The IDE Tool is focused around (Team Based) System Engineering, Meta Programming, Visual Coding, Concurrent Processing and, Databases and Big Data.

Downloads: 0 This Week

Last Update: 2015-02-15
See Project
13

Streams for IBM i

Batch performance boosting and Big Data framework for IBM i

Streamd for IBM i is a suite of tools for IBM i (previously known as AS/400 and iSeries) that can significantly improve performance characteristics of batch processes. Due to extensive use of parallel programming techniques Streams for IBM i delivers significant performance improvements for single streamed batch jobs. Streams for IBM i can split an existing batch process into a number of concurrent streams, completely eliminate backup-related delays, introduce new robust recovery policies and even modify the program logic of existing applications - all without any code modifications. Streams for IBM i includes a feature allowing manipulations of batch job QTEMP libraries.

Downloads: 0 This Week

Last Update: 2014-06-16
See Project
14

giServer

giServer the easy to use and extensible batch and integration server

The giServer is an easy-to-use integration server for process automation and event-driven or scheduled execution of batch jobs. Instead of using complex XML configuration files an elaborate GUI for batch job management is included. Some possible usage scenarios are: - Automatic processing of incoming data files - Big Data applications - Process automation - Data Mining/Aggregation applications - Automatic Reporting - Processing and analysis of database records

Downloads: 0 This Week

Last Update: 2015-05-15
See Project
15

wzd

Powerful storage server, designed for big data storage systems

wZD is a server written in Go language that uses a modified version of the BoltDB database as a backend for saving and distributing any number of small and large files, NoSQL keys/values, in a compact form inside micro Bolt databases (archives), with distribution of files and values in BoltDB databases depending on the number of directories or subdirectories and the general structure of the directories. Using wZD can permanently solve the problem of a large number of files on any POSIX compatible file system, including a clustered one. Outwardly it works like a regular WebDAV server.

Downloads: 0 This Week

Last Update: 2020-05-19
See Project