Awesome Hadoop
A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources. Inspired by Awesome PHP, Awesome Python and Awesome Sysadmin
Hadoop
- Apache Hadoop - Apache Hadoop
- Apache Tez
- SpatialHadoop - SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.
- GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework
- Elasticsearch Hadoop - Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig.
- dumbo - Python module that allows you to easily write and run Hadoop programs.
- hadoopy - Python MapReduce library written in Cython.
- mrjob - mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs.
- pydoop - Pydoop is a package that provides a Python API for Hadoop.
- hdfs-du - HDFS-DU is an interactive visualization of the Hadoop distributed file system.
- White Elephant - Hadoop log aggregator and dashboard
- Kiji Project
- Genie - Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them.
- Kylin - Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.
- Crunch - Crunch – Go-based toolkit for ETL and feature extraction on Hadoop
YARN
- Apache Slider - Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster.
- Apache Twill - Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic.
- mpich2-yarn - Running MPICH2 on Yarn
NoSQL
Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.
- Apache HBase - Apache HBase
- Apache Phoenix - A SQL skin over HBase
- happybase - A developer-friendly Python library to interact with Apache HBase.
- Hannibal - Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.
- Haeinsa - Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase
- hindex - Secondary Index for HBase
- Apache Accumulo - The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
- OpenTSDB - The Scalable Time Series Database
- Apache Cassandra
SQL on Hadoop
SQL on Hadoop
- Apache Hive
- Hive Plugins
- UDF
- http://nexr.github.io/hive-udf/
- https://github.com/edwardcapriolo/hive_cassandra_udfs
- https://github.com/livingsocial/HiveSwarm
- https://github.com/ThinkBigAnalytics/Hive-Extensions-from-Think-Big-Analytics
- https://github.com/karthkk/udfs
- https://github.com/kevinweil/elephant-bird - Twitter
- https://github.com/lovelysystems/ls-hive
- https://github.com/stewi2/hive-udfs
- https://github.com/klout/brickhouse
- https://github.com/markgrover/hive-translate (PostgreSQL translate())
- https://github.com/deanwampler/HiveUDFs
- https://github.com/myui/hivemall (Machine Learning UDF/UDAF/UDTF)
- https://github.com/edwardcapriolo/hive-geoip (GeoIP UDF)
- Storage Handler
- https://github.com/dvasilen/Hive-Cassandra
- https://github.com/yc-huang/Hive-mongo
- https://github.com/balshor/gdata-storagehandler
- https://github.com/karthkk/hive-hbase-json
- https://github.com/sunsuk7tp/hive-hbase-integration
- https://bitbucket.org/rodrigopr/redisstoragehandler
- https://github.com/zhuguangbin/HiveJDBCStorageHanlder
- https://github.com/chimpler/hive-solr
- https://github.com/bfemiano/accumulo-hive-storage-manager
- SerDe
- Libraries and tools
- https://github.com/forward/rbhive
- https://github.com/synctree/activerecord-hive-adapter
- https://github.com/hrp/sequel-hive-adapter
- https://github.com/forward/node-hive
- https://github.com/recruitcojp/WebHive
- shib - WebUI for query engines: Hive and Presto
- clive - Clojure library for interacting with Hive via Thrift
- http://www.phphiveadmin.net/
- https://github.com/anjuke/hwi
- https://code.google.com/a/apache-extras.org/p/hipy/
- https://github.com/dmorel/Thrift-API-HiveClient2 (Perl - HiveServer2)
- PyHive - Python interface to Hive and Presto
- https://github.com/recruitcojp/OdbcHive
- Hive-Sharp
- HiveRunner - An Open Source unit test framework for hadoop hive queries based on JUnit4
- Beetest - A super simple utility for testing Apache Hive scripts locally for non-Java developers.
- Hive_test- Unit test framework for hive and hive-service
- UDF
- Cloudera Impala
- Presto
- Apache Tajo
- Apache Drill
Workflow, Lifecycle and Governance
- Apache Oozie - Apache Oozie
- Azkaban
- Apache Falcon - Data management and processing platform
Data Ingestion and Integration
- Apache Flume - Apache Flume
- Flume Plugins
- Flume MongoDB Sink
- Flume HornetQ Channel
- Flume MessagePack Source
- Flume RabbitMQ source and sink
- Flume UDP Source
- Stratio Ingestion - Custom sinks: Cassandra, MongoDB, Stratio Streaming and JDBC
- Flume Custom Serializers
- Real-time analytics in Apache Flume
- .Net FlumeNG Clients
- Suro - Netflix's distributed Data Pipeline
- Apache Sqoop - Apache Sqoop
- Apache Kafka - Apache Kafka
DSL
**
- Apache Pig - Apache Pig
- Apache DataFu - A collection of libraries for working with large-scale data in Hadoop
- vahara - Machine learning and natural language processing with Apache Pig
- packetpig - Open Source Big Data Security Analytics
- akela - Mozilla's utility library for Hadoop, HBase, Pig, etc.
- seqpig - Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop
- Lipstick - Pig workflow visualization tool. Introducing Lipstick on A(pache) Pig
- PigPen - PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.
Libraries and Tools
- Kite Software Development Kit - A set of libraries, tools, examples, and documentation
- gohadoop - Native go clients for Apache Hadoop YARN.
- Hue - A Web interface for analyzing data with Apache Hadoop.
- Zeppelin
- Jumbune - Jumbune is an open-source product built for analyzing Hadoop cluster and MapReduce jobs.
- Apache Thrift
- Apache Avro - Apache Avro is a data serialization system.
- Elephant Bird - Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.
- Spring for Apache Hadoop
Realtime Data Processing
Distributed Computing and Programming
- Apache Spark
- Apache Crunch
- Cascading - Cascading is the proven application development platform for building data applications on Hadoop.
- Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing.
Packaging, Provisioning and Monitoring
- Apache Bigtop - Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem
- Apache Ambari - Apache Ambari
- Ganglia Monitoring System
- ankush - A big data cluster management tool that creates and manages clusters of different technologies.
- Apache Zookeeper - Apache Zookeeper
- Apache Curator - ZooKeeper client wrapper and rich ZooKeeper framework
- Buildoop - Hadoop Ecosystem Builder
- Deploop - The Hadoop Deploy System
Search
- ElasticSearch
- Apache Solr
- SenseiDB - Open-source, distributed, realtime, semi-structured database
Benchmark
**
- Big Data Benchmark
- HiBench
- Big-Bench
- hive-benchmarks
- hive-testbench - Testbench for experimenting with Apache Hive at any data scale.
Machine learning and Big Data analytics
- Apache Maout
- Cloudera Oryx - The Oryx open source project provides simple, real-time large-scale machine learning / predictive analytics infrastructure.
- MLlib - MLlib is Apache Spark's scalable machine learning library.
- R - R is a free software environment for statistical computing and graphics.
- RHive - RHive is an R extension facilitating distributed computing via Apache Hive.
- RHadoop
Misc.
Resources
Various resources, such as books, websites and articles.
Websites
Useful websites and articles
- Hadoop Weekly
- The Hadoop Ecosystem Table
- Hadoop 1.x vs 2
- Apache Hadoop YARN: Yet Another Resource Negotiator
- Introducing Apache Hadoop YARN
- Apache Hadoop YARN - Background and an Overview
- Apache Hadoop YARN - Concepts and Applications
- Apache Hadoop YARN - ResourceManager
- Apache Hadoop YARN - NodeManager
- Migrating to MapReduce 2 on YARN (For Users)
- Migrating to MapReduce 2 on YARN (For Operators)
- Hadoop and Big Data: Use Cases at Salesforce.com
- All you wanted to know about Hadoop, but were too afraid to ask: genealogy of elephants.
- What is Bigtop, and Why Should You Care?
- Hadoop - Distributions and Commercial Support
- Ganglia configuration for a small Hadoop cluster and some troubleshooting
- Hadoop illuminated - Open Source Hadoop Book
- NoSQL Database
- 10 Best Practices for Apache Hive
- Hadoop Operations at Scale
Presentations
- Hadoop 24/7
- An example Apache Hadoop Yarn upgrade
- Apache Hadoop In Theory And Practice
- Hadoop Operations at LinkedIn
- Hadoop Performance at LinkedIn
- Docker based Hadoop provisioning
Books
- Hadoop: The Definitive Guide
- Hadoop Operations
- Apache Hadoop Yarn
- HBase: The Definitive Guide
- Programming Pig
- Programming Hive
- Hadoop in Practice, Second Edition
- Hadoop in Action, Second Edition
Other Awesome Lists
Other amazingly awesome lists can be found in the awesome-awesomeness list.