This document began in 2013 as a survey of the landscape of "Big Data" offerings. I've added to and updated it over the years, but is by no means complete (and may contain inaccuracies). Pull requests welcome for corrections and additions!
Apache Spark is a fast and general engine for large-scale data processing.
Apache Storm is a free and open source distributed realtime computation system.
Hindsight is an open source stream processing software system developed by Mozilla. It was implemented as a successor to Heka.
Amazon CloudWatch is a monitoring service for AWS cloud resources and the applications you run on AWS.
Fluentd is an open source data collector designed for processing high-volume data streams.
By combining the massively popular Elasticsearch, Logstash and Kibana we have created an end-to-end stack that delivers actionable insights in real-time from almost any type of structured and unstructured data source.
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
Disco is an implementation of mapreduce for distributed computing.
Inferno is an open-source python map/reduce library, powered by Disco.
Apache Samza is a distributed stream processing framework.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
Hadoop is kind of a container/base for many other projects, such as HBase, Hive, Spark (see above), Mahout, Pig, Tez, ...
- http://www.cloudera.com/content/cloudera/en/home.html
- Tez - The Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data.
- http://hortonworks.com/
- http://www.jethrodata.com/home
- http://www.platfora.com/
- http://www.splicemachine.com/
- http://sqrrl.com/
- Cascading is an application development platform for building Data applications on Apache Hadoop. The data processing APIs define data processing flows. The APIs exposed provide a rich set of capabilities that allow you to think in terms of the data and the business problem such as sort, average, filter, merge etc.
- http://gethue.com/ Hue is a Web interface for analyzing data with Apache Hadoop.
Graylog2 is an integrated log capture and analysis solution for operational intelligence. Non Graylog2-authored components include MongoDB for metadata and Elasticsearch for log file storage and text search.
Suro is a data pipeline service for collecting, aggregating, and dispatching large volume of application events including log data.
http://techblog.netflix.com/2013/12/announcing-suro-backbone-of-netflixs.html
Manta is an open-source, HTTP-based object store that uses OS containers to allow running arbitrary compute on data at rest (i.e., without copying data out of the object store). General overview here.
Flink is astreaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. One interesting feature is that it provides the option of “exactly once” message delivery (and out-of-order events).
Apache Nifi is an easy to use, powerful, and reliable system to process and distribute data. Nifi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.
PNDA is an Open source Platform for Network Data Analytics from Cisco.
Streamlio is Enterprise-grade, unified, end-to-end real-time solution. Uses Pulsar, Heron, and BookKeeper (linked below).
Snowplow is an enterprise-strength marketing and product analytics platform. It does three things:
- Identifies your users, and tracks the way they engage with your website or application
- Stores your users' behavioural data in a scalable "event data warehouse" you control: in Amazon S3 and (optionally) Amazon Redshift or Postgres
- Lets you leverage the biggest range of tools to analyze that data, including big data tools (e.g. Spark) via EMR or more traditional tools e.g. Looker, Mode, Caravel, Re:dash to analyze that behavioural data
StreamSets is the industry’s first data operations platform. With StreamSets data operations platform you can efficiently develop batch and streaming dataflows, operate them with full visibility and control, and easily evolve your architecture over time.
Snap is a powerful open telemetry framework. Easily collect, process, and publish telemetry data at scale.
DataFusion: Modern Distributed Compute Platform implemented in Rust. Code on github
See serialization.
Apache Mesos is a cluster manager that simplifies the complexity of running applications on a shared pool of servers.
Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Logging_Solutions_Overview
Persisted Queue, fast enough to record everything.
Sawzall is a procedural domain-specific programming language, used by Google to process large numbers of individual log records.
Druid is an open-source analytics data store designed for OLAP queries on timeseries data (trillions of events, petabytes of data). Druid provides cost-effective and always-on real-time data ingestion, arbitrary data exploration, and fast data aggregation.
Aerospike is a distributed, scalable NoSQL database.
NuoDB is a distributed database offering a rich SQL implementation and true ACID transactions. Designed for the modern datacenter, and as a scale-out cloud database, NuoDB is the NewSQL solution you need to simplify application deployment.
ElasticSearch is distributed restful search and analytics
logstash is a tool for managing events and logs. You can use it to collect logs, parse them, and store them for later use (like, for searching).
Kibana lets you visualize logs and time-stamped data.
Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse solution that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools.
Riemann aggregates events from your servers and applications with a powerful stream processing language.
The Apache Mahout project's goal is to build a scalable machine learning library.
H2O is the world’s fastest in-memory platform for machine learning and predictive analytics on big data.
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
Sensu is often described as the “monitoring router”. Essentially, Sensu takes the results of “check” scripts run across many systems, and if certain conditions are met; passes their information to one or more “handlers”.
dCache A system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods.
Cockroach is a distributed key/value datastore which supports ACID transactional semantics and versioned values as first-class features. The primary design goal is global consistency and survivability, hence the name. Cockroach aims to tolerate disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention. Cockroach nodes are symmetric; a design goal is one binary with minimal configuration and no required auxiliary services.
TaskCluster is a set of components that manages task queuing, scheduling, execution and provisioning of resources. It was designed to run automated builds and test at Mozilla.
LMAX is a platform aimed at processing events (it’s intended for financial trading) at high speed with low latency. The “disruptor” is open source.
Charted is a stripped down charting tool for quick & dirty visualizations of publicly accessible data. May be of use for things like public data exported to S3.
Atlas is a telemetry-recording and display tool from Netflix.
AWS Lambda is a compute service that runs your code in response to events and automatically manages the compute resources for you, making it easy to build applications that respond quickly to new information.[1]
Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. It achieves high performance by leveraging lineage information and using memory aggressively. Tachyon caches working set files in memory, thereby avoiding going to disk to load datasets that are frequently read. This enables different jobs/queries and frameworks to access cached files at memory speed.
VoltDB is an in-memory, operational database built to rapidly import, operate on, and then export HUGE amounts of data at lightning speed.
Bond is a cross-platform framework for working with schematized data. It supports cross-language de/serialization and powerful generic mechanisms for efficiently manipulating data.
Streamtools is a graphical toolkit for dealing with streams of data. Streamtools makes it easy to explore, analyse, modify and learn from streams of data.
Fenzo is a scheduler Java library for Apache Mesos frameworks that supports plugins for scheduling optimizations and facilitates cluster autoscaling.
Luigi is a Python (2.7, 3.3, 3.4) package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
Kubernetes is an open source Container Cluster orchestration framework that was started by Google in 2014.
Airflow is a platform to programmatically author, schedule and monitor workflows.
DistributedLog (DL) is a high-performance, replicated log service, offering durability, replication and strong consistency as essentials for building reliable distributed systems.
Apache Beam is an open source, unified programming model that you can use to create a data processing pipeline. You start by building a program that defines the pipeline using one of the open source Beam SDKs. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which includeApache Apex,Apache Flink,Apache Spark, and Google Cloud Dataflow.
Beam SQL allows a Beam user (currently only available in Beam Java) to query bounded and unbounded PCollections with SQL statements. Your SQL query is translated to a PTransform, an encapsulated segment of a Beam pipeline. You can freely mix SQL PTransforms and other PTransforms in your pipeline.
AWS Glue is a fully managed ETL service that makes it easy to move data between your data stores. AWS Glue simplifies and automates the difficult and time consuming data discovery, conversion, mapping, and job scheduling tasks.
Pachyderm is a data lake that offers complete version control for data and leverages the container ecosystem to provide reproducible data processing.
BookKeeper is a scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads
Apache Pulsar (incubating) provides pub-sub messaging scalable out to millions of topics.
Heron provides topology-based real-time stream processing.
Prometheus is an open-source systems monitoring and alerting toolkit.
Kudu is a hadoop-compatible storage layer to enable fast analytics on fast data.
Apache Kylin™ is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop/Spark supporting extremely large datasets.
MatrixDS is the data project workbench. MatrixDS is Github meets Docker for data scientists and data analysts. Designed from the ground up for the data community, we are a place to build, share and manage data projects at any scale.
Split helps product teams make smarter decisions by enabling them to turn every feature roll-out into an experiment.
Netifi Proteus is the next-generation reactive microservices platform that allows developers to focus on their product by transparently providing API management, routing, service discovery, predictive load balancing, and ultra low latency RPC.
ClickHouse is an open source column-oriented database management system capable of real time generation of analytical data reports using SQL queries.
MQTT is a lightweight publish/subscribe messaging protocol. It is useful for use with low power sensors, but is applicable to many scenarios.
Roughtime is a project that aims to provide secure time synchronisation.
Great Expectations is a framework that helps teams save time and promote analytic integrity with a new twist on automated testing: pipeline tests. Pipeline tests are applied to data (instead of code) and at batch time (instead of compile or deploy time). Blog post: "Down with pipeline debt"
Sia is a decentralized storage platform secured by blockchain technology. The Sia Storage Platform leverages underutilized hard drive capacity around the world to create a data storage marketplace that is more reliable and lower cost than traditional cloud storage providers.
Mixpanel: Understand every user's journey. Acquire, engage, and retain with actionable user analytics.
dbt (data build tool) is a productivity tool that helps analysts get more done and produce higher quality results.
Analysts commonly spend 50-80% of their time modeling raw data—cleaning, reshaping, and applying fundamental business logic to it. dbt empowers analysts to do this work better and faster.
InfluxDB is used as a data store for any use case involving large amounts of timestamped data, including DevOps monitoring, application metrics, IoT sensor data, and real-time analytics. Conserve space on your machine by configuring InfluxDB to keep data for a defined length of time, automatically expiring & deleting any unwanted data from the system. InfluxDB also offers a SQL-like query language for interacting with data.
StreamX is a kafka-connect based connector to copy data from Kafka to Object Stores like Amazon s3, Google Cloud Storage and Azure Blob Store. It focusses on reliable and scalable data copying. It can write the data out in different formats (like parquet, so that it can readily be used by analytical tools) and also in different partitioning requirements.
Databricks Delta is a unified data management system to simplify large-scale data management.
Apache Arrow is a cross-language development platform for in-memory data.
TrailDB is an efficient tool for storing and querying series of events.
Pinot is a realtime distributed OLAP datastore, which is used at LinkedIn to deliver scalable real time analytics with low latency.
Intake is a data access layer and cataloging system.
Apache Crunch is a framework for writing, testing and running MapReduce pipelines.
Spline - Data Lineage Tracking and Visualization tool for Apache Spark. The project consists of two parts:
- A core library that sits on drivers, capturing data lineages from the jobs being executed by analyzing Spark execution plans
- A Web UI application that visualizes the stored data lineages.
Apache Gobblin is a distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.
Dask provides scalable analytics in Python.
Prefect is a workflow management system.
M3 - The fully open source metrics platform built on M3DB, a distributed timeseries database
QuestDB is a SQL open-source time-series database to store, stream and query data - faster
Mortar: Wide-Scale Stream Processing. Mortar is a light-weight distributed stream processing engine for federated systems.
Borealis: Distributed Stream Processing Engine. Borealis is a distributed stream processing engine that is being developed at Brandeis University, Brown University, and MIT.
MillWheel: Fault-Tolerant Stream Processing at Internet Scale
Megastore: Providing Scalable, Highly Available Storage for Interactive Services
S4 is a general-purpose, distributed, scalable, fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.
Ruled out because: dead project.
Scribe is a server for aggregating log data streamed in real time from a large number of servers.
Ruled out because: dead project.
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Logging_Solutions_Overview
- Kinesis
- Amazon Kinesis is a fully managed service for real-time processing of streaming data at massive scale.
- Ruled out because: payload size too small, data retention too short.
- Velocidata
- Hardware enhanced data transformation claiming to be 100 to 1000x faster than conventional ETL processes. Essentially, VelociData gives you a purpose-built supercomputer in a 4U form factor at a monthly or annual cost that radically transforms the economics of Big Data, analytics and BI.
- Sisense
- End to end BI product highly optimized for x64 architectures enabling a single commodity server to deliver the same data processing power as a (Hadoop) cluster made up of 10 individual nodes.
- Splunk
- Platform for Operational Intelligence. Customers use Splunk to search, monitor, analyze and visualize machine data. By monitoring and analyzing everything from customer clickstreams and transactions to network activity and call records, Splunk Enterprise turns your machine data into valuable insights.
- Google Cloud Dataflow
- Not available yet
- Azure Stream Analytics
- Part of Microsoft’s Azure cloud platform.
- Cirro
- Cirro’s Next Generation Data Federation allows organizations to ask questions and harvest value previously unavailable in disconnected corporate data silos. Cirro can spread a query across multiple data sources and use the local processing capability located where each of the sources resides to help resolve the query.
- Alpine Data Labs
- Basically BI for the non-developer: “Alpine Chorus 5.0 is the world's only advanced analytics platform that is 100% visual, web-based, collaborative and optimized for Big Data.”
- logentries
- Log Management Built for the Cloud. Fast Search & Real-time Log Processing. Centralized search, aggregation, and correlation. See query results in seconds. Connect Distributed Systems, Apps & Users. Process log data in any format, connect with any system, and share with any team. Smarter Analytics, Quicker Insights
- papertrail
- Frustration-free log management.
- Jethro
- A SQL-on-Hadoop engine, Jethro acts as a BI-on-Hadoop acceleration layer that speeds up big data query performance for BI tools like Tableau, Qlik and Microstrategy from any data source like Hadoop or Amazon S3.
- Periscope announced their Unified Data Platform in 2017.
- Sentry Open-source error tracking that helps developers monitor and fix crashes in real time. Iterate continuously. Boost efficiency. Improve user experience.
- Alteryx A platform for end-to-end automation of analytics, machine learning, and data science processes
- http://jasonwilder.com/blog/2013/11/19/fluentd-vs-logstash/
- http://highscalability.com/blog/2014/8/18/1-aerospike-server-x-1-amazon-ec2-instance-1-million-tps-for.html
- https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Logging_Solutions_Overview
- Databricks https://www.youtube.com/watch?v=dJQ5lV5Tldw
- http://sqrrl.com/
- http://www.splicemachine.com/
- http://www.platfora.com/
- http://www.crn.com/slide-shows/applications-os/240164423/the-10-coolest-big-data-products-of-2013.htm?itc=nextpage
- http://www.knime.org/
- http://www.zdatainc.com/2014/09/apache-storm-apache-spark/
- http://techblog.netflix.com/2014/10/using-presto-in-our-big-data-platform.html
- The Big Data Information Architecture
- http://highlyscalable.wordpress.com/2013/08/20/in-stream-big-data-processing/
- Storm vs Spark
- http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
- https://aws.amazon.com/blogs/aws/dynamodb-streams-preview/
- http://blog.confluent.io/2015/03/04/turning-the-database-inside-out-with-apache-samza/
- Prometheus vs. other time-series databases
- Workflow Managers