In this blogpost I want to write down details of some the big data frameworks. Ideally I want to create a nice interactive graph, but don’t what technology will be best to do that. Any suggestions? I most likely will continuously update this post, rather then create the best one at one go.

Note – this is not intended to be a complete list. Also, some of the classification/information might be wrong. I am first doing a brain dump, will later refine each section.

Some of the technologies

  1. Data Processing
    1. Batch
      1. Hadoop – old workhorse, two operators : map and reduce. Plus integrate with programming language
      2. Hive – sql-ish like syntax, works on large amounts of data
      3. Pig – similar use case as hive, but more powerful syntax
      4. Spark – new age hadoop, has several more operators than just map and reduce. Plus integrate with programming language
        1. Shark
        2. SparkSQL
    2. Streaming
      1. Spark Streaming
      2. Storm
  2. Data Storage
    1. File Systems
      1. HDFS
      2. Tachyon – in-memory file system, used alongside spark when iterating over same dataset several times
    2. Storage formats
      1. Parquet
      2. Protobuf
      3. Thrift
      4. Avro
    3. Data Compression
      1. LZO
      2. Snappy
  3. Message Queues
    1. RabbitMQ
    2. Kafka
  4. Workflow Scheduling
    1. Oozie
    2. Azkaban
    3. Luigi
  5. Storage systems
    1. CouchDB
    2. HBase
    3. Casandra
    4. Sqqrl
    5. Several, several more
  6. Cluster management OS/Containers
    1. YARN
    2. Mesos
    3. Micrososft Research REEF
  7. Visualizations
    1. D3.js
    2. Bokeh
  8. ML
    1. HexData
    2. Oryx, Mahout
    3. Spark MLib
    4. Graphlab, Giraph (Grafos)
    5. SparkR
  9. Cluster management suite
    1. Hortonworks
    2. MapR
    3. Cloudera
    4. Pivotal
  10. Search
    1. Lucene
    2. Solr/Lucene
    3. Elasticsearch/Lucene
  11. Unknown, don’t know how related
    1. Akka
  12. Langauages
    1. Java 8
    2. Scala
    3. IPython Notebook
    4. R?

Edit – I wanted to create a nice graph of all these technologies, but seems like someone has already done a good job at that –