Skip to content

dharmeshkakadia/Data-Infra-Projects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 

Repository files navigation

Data-Infra-Projects

This is an attempt to list out all the interesting projects.

It is intended for anyone designing modern large scale architectures and need to choose tools/technoglogies/frameworks. The purpose is to help in making that choices with resources like comparisons/use-cases/features/maturity or really anything that helps in making an informed decision.

Abstractions

Distributed Coordination

This are implementations/libraries to help write distributed applications which require some form of coordination.

Infrastructure Management

comparisons

File Systems

Distributed Databases

Infrastrcuture Logging/Monitoring

Infrastructure Helpers

MultiCloud/CrossCloud utilities

Virtualization

Virtualization++

Generalized Data Processing

comparisons

  • Tez vs Dryad
  • Hadoop vs Spark - Too many differences, no good link.

Largescale Distributed ML

pub-sub / messaging

Data Ingest

Data change management

Graph Storing and/or Processing

SQL Engines

Stream Processing

Security

Performance Analysis

Workflow engines/DAG-executors/Pipelines

Comparisons

Configuration Management

Service Discovery

Comparison

Testing

Visualization

Libraries

  • Zoie
  • Norbert - cluster manager and networking layer built on top of Zookeeper.
  • Okapi - Large-scale ML & graph analytics on Giraph
  • Scalding - A Scala API for Cascading
  • SummingBird - Streaming MapReduce with Scalding and Storm
  • Curator - set of Java libraries that make using Apache ZooKeeper much easier
  • Turbine - Low latency high throughput aggregator for real time streams
  • DataFu - Collection of MapReduce lib
  • Twill (Previsously known as Weave) - YARN application writing lib

Search

others

  • Nutch - web crawler
  • Ambari - Hadoop Deployment + Management
  • Bigtop - Hadoop Packaging
  • Skuld
  • Camus - LinkedIn's Kafka to HDFS pipeline.
  • Kiji - collect, analyze and serve data in real time on Apache Hadoop and HBase

About

List of some interesting projects

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published