Existing options for building a data pipeline

This document began in 2013 as a survey of the landscape of "Big Data" offerings. I've added to and updated it over the years, but is by no means complete (and may contain inaccuracies). Pull requests welcome for corrections and additions!

Systems and Frameworks

Spark

Apache Spark is a fast and general engine for large-scale data processing.

Storm

Apache Storm is a free and open source distributed realtime computation system.

Hindsight

Hindsight is an open source stream processing software system developed by Mozilla. It was implemented as a successor to Heka.

CloudWatch

Amazon CloudWatch is a monitoring service for AWS cloud resources and the applications you run on AWS.

Fluentd

Fluentd is an open source data collector designed for processing high-volume data streams.

ELK Stack (ElasticSearch Logstash Kibana)

By combining the massively popular Elasticsearch, Logstash and Kibana we have created an end-to-end stack that delivers actionable insights in real-time from almost any type of structured and unstructured data source.

Flume

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.

Disco

Disco is an implementation of mapreduce for distributed computing.

Inferno (DDS - Disco Distributed FS)

Inferno is an open-source python map/reduce library, powered by Disco.

Samza

Apache Samza is a distributed stream processing framework.

Hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

Hadoop is kind of a container/base for many other projects, such as HBase, Hive, Spark (see above), Mahout, Pig, Tez, ...

http://www.cloudera.com/content/cloudera/en/home.html
Tez - The Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data.
http://hortonworks.com/
http://www.jethrodata.com/home
http://www.platfora.com/
http://www.splicemachine.com/
http://sqrrl.com/
Cascading is an application development platform for building Data applications on Apache Hadoop. The data processing APIs define data processing flows. The APIs exposed provide a rich set of capabilities that allow you to think in terms of the data and the business problem such as sort, average, filter, merge etc.
http://gethue.com/ Hue is a Web interface for analyzing data with Apache Hadoop.

Graylog

Graylog2 is an integrated log capture and analysis solution for operational intelligence. Non Graylog2-authored components include MongoDB for metadata and Elasticsearch for log file storage and text search.

Suro

Suro is a data pipeline service for collecting, aggregating, and dispatching large volume of application events including log data.

http://techblog.netflix.com/2013/12/announcing-suro-backbone-of-netflixs.html

Manta

Manta is an open-source, HTTP-based object store that uses OS containers to allow running arbitrary compute on data at rest (i.e., without copying data out of the object store). General overview here.

Flink

Flink is astreaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. One interesting feature is that it provides the option of “exactly once” message delivery (and out-of-order events).

Nifi

Apache Nifi is an easy to use, powerful, and reliable system to process and distribute data. Nifi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.

Panda / PNDA

PNDA is an Open source Platform for Network Data Analytics from Cisco.

Streamlio

Streamlio is Enterprise-grade, unified, end-to-end real-time solution. Uses Pulsar, Heron, and BookKeeper (linked below).

Snowplow

Snowplow is an enterprise-strength marketing and product analytics platform. It does three things:

Identifies your users, and tracks the way they engage with your website or application
Stores your users' behavioural data in a scalable "event data warehouse" you control: in Amazon S3 and (optionally) Amazon Redshift or Postgres
Lets you leverage the biggest range of tools to analyze that data, including big data tools (e.g. Spark) via EMR or more traditional tools e.g. Looker, Mode, Caravel, Re:dash to analyze that behavioural data

StreamSets

StreamSets is the industry’s first data operations platform. With StreamSets data operations platform you can efficiently develop batch and streaming dataflows, operate them with full visibility and control, and easily evolve your architecture over time.

Snap

Snap is a powerful open telemetry framework. Easily collect, process, and publish telemetry data at scale.

DataFusion

DataFusion: Modern Distributed Compute Platform implemented in Rust. Code on github

Components

Serialization Formats

See serialization.

Mesos

Apache Mesos is a cluster manager that simplifies the complexity of running applications on a shared pool of servers.

Kafka

Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.

https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Logging_Solutions_Overview

Chronicle Queue

Persisted Queue, fast enough to record everything.

Sawzall

Sawzall is a procedural domain-specific programming language, used by Google to process large numbers of individual log records.

Druid

Druid is an open-source analytics data store designed for OLAP queries on timeseries data (trillions of events, petabytes of data). Druid provides cost-effective and always-on real-time data ingestion, arbitrary data exploration, and fast data aggregation.

Aerospike

Aerospike is a distributed, scalable NoSQL database.

http://highscalability.com/blog/2014/8/18/1-aerospike-server-x-1-amazon-ec2-instance-1-million-tps-for.html

NuoDB

NuoDB is a distributed database offering a rich SQL implementation and true ACID transactions. Designed for the modern datacenter, and as a scale-out cloud database, NuoDB is the NewSQL solution you need to simplify application deployment.

ElasticSearch

ElasticSearch is distributed restful search and analytics

Logstash

logstash is a tool for managing events and logs. You can use it to collect logs, parse them, and store them for later use (like, for searching).

Kibana

Kibana lets you visualize logs and time-stamped data.

Redshift

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse solution that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools.

Riemann

Riemann aggregates events from your servers and applications with a powerful stream processing language.

Mahout

The Apache Mahout project's goal is to build a scalable machine learning library.

H20

H2O is the world’s fastest in-memory platform for machine learning and predictive analytics on big data.

Presto

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

Sensu

Sensu is often described as the “monitoring router”. Essentially, Sensu takes the results of “check” scripts run across many systems, and if certain conditions are met; passes their information to one or more “handlers”.

dCache

dCache A system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods.

Cockroach

Cockroach is a distributed key/value datastore which supports ACID transactional semantics and versioned values as first-class features. The primary design goal is global consistency and survivability, hence the name. Cockroach aims to tolerate disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention. Cockroach nodes are symmetric; a design goal is one binary with minimal configuration and no required auxiliary services.

TaskCluster

TaskCluster is a set of components that manages task queuing, scheduling, execution and provisioning of resources. It was designed to run automated builds and test at Mozilla.

LMAX

LMAX is a platform aimed at processing events (it’s intended for financial trading) at high speed with low latency. The “disruptor” is open source.

Charted

Charted is a stripped down charting tool for quick & dirty visualizations of publicly accessible data. May be of use for things like public data exported to S3.

Atlas

Atlas is a telemetry-recording and display tool from Netflix.

AWS Lambda

AWS Lambda is a compute service that runs your code in response to events and automatically manages the compute resources for you, making it easy to build applications that respond quickly to new information.[1]

Tachyon

Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. It achieves high performance by leveraging lineage information and using memory aggressively. Tachyon caches working set files in memory, thereby avoiding going to disk to load datasets that are frequently read. This enables different jobs/queries and frameworks to access cached files at memory speed.

VoltDB

VoltDB is an in-memory, operational database built to rapidly import, operate on, and then export HUGE amounts of data at lightning speed.

Microsoft Bond

Bond is a cross-platform framework for working with schematized data. It supports cross-language de/serialization and powerful generic mechanisms for efficiently manipulating data.

Streamtools

Streamtools is a graphical toolkit for dealing with streams of data. Streamtools makes it easy to explore, analyse, modify and learn from streams of data.

Fenzo

Fenzo is a scheduler Java library for Apache Mesos frameworks that supports plugins for scheduling optimizations and facilitates cluster autoscaling.

Luigi

Luigi is a Python (2.7, 3.3, 3.4) package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.

Kubernetes

Kubernetes is an open source Container Cluster orchestration framework that was started by Google in 2014.

Airflow

Airflow is a platform to programmatically author, schedule and monitor workflows.

DistributedLog

DistributedLog (DL) is a high-performance, replicated log service, offering durability, replication and strong consistency as essentials for building reliable distributed systems.

Apache Beam

Apache Beam is an open source, unified programming model that you can use to create a data processing pipeline. You start by building a program that defines the pipeline using one of the open source Beam SDKs. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which includeApache Apex,Apache Flink,Apache Spark, and Google Cloud Dataflow.

Beam SQL

Beam SQL allows a Beam user (currently only available in Beam Java) to query bounded and unbounded PCollections with SQL statements. Your SQL query is translated to a PTransform, an encapsulated segment of a Beam pipeline. You can freely mix SQL PTransforms and other PTransforms in your pipeline.

AWS Glue

AWS Glue is a fully managed ETL service that makes it easy to move data between your data stores. AWS Glue simplifies and automates the difficult and time consuming data discovery, conversion, mapping, and job scheduling tasks.

Pachyderm

Pachyderm is a data lake that offers complete version control for data and leverages the container ecosystem to provide reproducible data processing.

Apache Bookkeeper

BookKeeper is a scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads

Apache Pulsar

Apache Pulsar (incubating) provides pub-sub messaging scalable out to millions of topics.

Heron

Heron provides topology-based real-time stream processing.

Prometheus

Prometheus is an open-source systems monitoring and alerting toolkit.

Kudu

Kudu is a hadoop-compatible storage layer to enable fast analytics on fast data.

Kylin

Apache Kylin™ is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop/Spark supporting extremely large datasets.

MatrixDS

MatrixDS is the data project workbench. MatrixDS is Github meets Docker for data scientists and data analysts. Designed from the ground up for the data community, we are a place to build, share and manage data projects at any scale.

Split.io

Split helps product teams make smarter decisions by enabling them to turn every feature roll-out into an experiment.

Proteus

Netifi Proteus is the next-generation reactive microservices platform that allows developers to focus on their product by transparently providing API management, routing, service discovery, predictive load balancing, and ultra low latency RPC.

ClickHouse

ClickHouse is an open source column-oriented database management system capable of real time generation of analytical data reports using SQL queries.

MQTT

MQTT is a lightweight publish/subscribe messaging protocol. It is useful for use with low power sensors, but is applicable to many scenarios.

Roughtime

Roughtime is a project that aims to provide secure time synchronisation.

Great Expectations

Great Expectations is a framework that helps teams save time and promote analytic integrity with a new twist on automated testing: pipeline tests. Pipeline tests are applied to data (instead of code) and at batch time (instead of compile or deploy time). Blog post: "Down with pipeline debt"

Sia

Sia is a decentralized storage platform secured by blockchain technology. The Sia Storage Platform leverages underutilized hard drive capacity around the world to create a data storage marketplace that is more reliable and lower cost than traditional cloud storage providers.

Mixpanel

Mixpanel: Understand every user's journey. Acquire, engage, and retain with actionable user analytics.

DBT

dbt (data build tool) is a productivity tool that helps analysts get more done and produce higher quality results.

Analysts commonly spend 50-80% of their time modeling raw data—cleaning, reshaping, and applying fundamental business logic to it. dbt empowers analysts to do this work better and faster.

InfluxDB

InfluxDB is used as a data store for any use case involving large amounts of timestamped data, including DevOps monitoring, application metrics, IoT sensor data, and real-time analytics. Conserve space on your machine by configuring InfluxDB to keep data for a defined length of time, automatically expiring & deleting any unwanted data from the system. InfluxDB also offers a SQL-like query language for interacting with data.

StreamX

StreamX is a kafka-connect based connector to copy data from Kafka to Object Stores like Amazon s3, Google Cloud Storage and Azure Blob Store. It focusses on reliable and scalable data copying. It can write the data out in different formats (like parquet, so that it can readily be used by analytical tools) and also in different partitioning requirements.

Databricks Delta

Databricks Delta is a unified data management system to simplify large-scale data management.

Apache Arrow

Apache Arrow is a cross-language development platform for in-memory data.

TrailDB

TrailDB is an efficient tool for storing and querying series of events.

Pinot

Pinot is a realtime distributed OLAP datastore, which is used at LinkedIn to deliver scalable real time analytics with low latency.

Intake

Intake is a data access layer and cataloging system.

Crunch

Apache Crunch is a framework for writing, testing and running MapReduce pipelines.

Spline

Spline - Data Lineage Tracking and Visualization tool for Apache Spark. The project consists of two parts:

A core library that sits on drivers, capturing data lineages from the jobs being executed by analyzing Spark execution plans
A Web UI application that visualizes the stored data lineages.

Gobblin

Apache Gobblin is a distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

Dask

Dask provides scalable analytics in Python.

Prefect

Prefect is a workflow management system.

M3DB

M3 - The fully open source metrics platform built on M3DB, a distributed timeseries database

QuestDB

QuestDB is a SQL open-source time-series database to store, stream and query data - faster

Ideas and Research

Mortar

Mortar: Wide-Scale Stream Processing. Mortar is a light-weight distributed stream processing engine for federated systems.

Borealis

Borealis: Distributed Stream Processing Engine. Borealis is a distributed stream processing engine that is being developed at Brandeis University, Brown University, and MIT.

MillWheel

MillWheel: Fault-Tolerant Stream Processing at Internet Scale

Megastore

Megastore: Providing Scalable, Highly Available Storage for Interactive Services

Spanner

Ruled out

S4

S4 is a general-purpose, distributed, scalable, fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.

Ruled out because: dead project.

Scribe

Scribe is a server for aggregating log data streamed in real time from a large number of servers.