Latest Entries »

Elasticon notes

500 B documents
1) JBOD over RAId
2) unicast discovery
3) men for JVM heap + fs cache
4) tune kernel params, user/process/network limits
5) JVM might corrupt data in es

6) tune JVM params, network/connectivity, recovery params, gateway params, caching params,

7) tribe node of greater than 150 node
8) refresh interval =-1
9) bulk index thread pool
10) disable _all
11) explain/validate queries
12) search templates
13) high cardinality fields – disable aggregation/sorting
14) do search on client nodes
15) monitoring – nagios
16) upgrade es during upgrade
17) 600-700 TB.
18) reindexing faster than recovery
19) bulk index while replica is zero
20) what if you kill master
21) field level statistic
22) merge count


1) field data and doc values

1) suro netflix (equivalent to herd), asgard ( netflix oss),
2) search nodes
3) Apollo discovery plugin open source – raigad does it
4) ram/2 for es, use jstat and check JVM
5) unbounded bulk indexing, unless heap error
6) file descriptor limit increase


Strata notes

Keynotes – nothing substantial, couple of them pitches by sponsors, salesforce – how mobile Internet is changing lives (couple of interesting points). Rubicon – in the moment analytics (check out). Data drill – data exploration for non IT. Data science getting it’s due (open government data) – DJ Patil first chief data scientist. New government data – open and machine readable. Big Data report?

Spark talk – hall of fame. Alibaba tabolo. Zebra fish, microscope the fish. Spark mapping the brain. Laser, activate individual neurons. Scala resource optimization, better shuttle, data frames!

Tsar – Google analytics for twitter

Streaming design patterns – kappa architecture. External lookup.

Belgian MoD – @stevenbeeckman
Sex and cash theory
Startup bus. Civilian data analyst.

Quid – contextual vs global models (see pic).

Raster maps.geotrellis – geographic data processing.

Keynote – cyber operations room, Stafford beer, chile

Open data platform

New data visualizations – length better than area. Edward tufte.

The connected cow. Estrus

Netflix. Hadoop + s3 instead of hdfs.

Crunch – faster cascading. Hive on tez fastest. Spark – easy maintenance.

Fastest SQL – hive on tez. Hawk/hive

Cloudera presentation – click stream data

Adobe presentation – middle America. Tb is standard, pb is limits. Most of the time people are mining structured data. Old organizations (>100 years) using data.

Mapr – myriad. The day yarn was announced, mesos was in production for a year. Actor based bidi RPC(?). Mesos create virtual clusters. Omega paper/Google – single scheduler framework not viable. Slider

Spark – shuffle (common inefficiency), job on driver vs worker. Rdd.toDebugString(). Collect transfers data from worker to driver. Only driver can perform operations on rdds (no rdds within rdds). For converting batch to stream – transform/foreachrdd. Testing : instead of SparkContext.stop() use LocalSparkContext.stop(). Spark-packages.

Cybernetics revolutionaries.

Big Data Frameworks

In this blogpost I want to write down details of some the big data frameworks. Ideally I want to create a nice interactive graph, but don’t what technology will be best to do that. Any suggestions? I most likely will continuously update this post, rather then create the best one at one go.

Note – this is not intended to be a complete list. Also, some of the classification/information might be wrong. I am first doing a brain dump, will later refine each section.

Some of the technologies

  1. Data Processing
    1. Batch
      1. Hadoop – old workhorse, two operators : map and reduce. Plus integrate with programming language
      2. Hive – sql-ish like syntax, works on large amounts of data
      3. Pig – similar use case as hive, but more powerful syntax
      4. Spark – new age hadoop, has several more operators than just map and reduce. Plus integrate with programming language
        1. Shark
        2. SparkSQL
    2. Streaming
      1. Spark Streaming
      2. Storm
  2. Data Storage
    1. File Systems
      1. HDFS
      2. Tachyon – in-memory file system, used alongside spark when iterating over same dataset several times
    2. Storage formats
      1. Parquet
      2. Protobuf
      3. Thrift
      4. Avro
    3. Data Compression
      1. LZO
      2. Snappy
  3. Message Queues
    1. RabbitMQ
    2. Kafka
  4. Workflow Scheduling
    1. Oozie
    2. Azkaban
    3. Luigi
  5. Storage systems
    1. CouchDB
    2. HBase
    3. Casandra
    4. Sqqrl
    5. Several, several more
  6. Cluster management OS/Containers
    1. YARN
    2. Mesos
    3. Micrososft Research REEF
  7. Visualizations
    1. D3.js
    2. Bokeh
  8. ML
    1. HexData
    2. Oryx, Mahout
    3. Spark MLib
    4. Graphlab, Giraph (Grafos)
    5. SparkR
  9. Cluster management suite
    1. Hortonworks
    2. MapR
    3. Cloudera
    4. Pivotal
  10. Search
    1. Lucene
    2. Solr/Lucene
    3. Elasticsearch/Lucene
  11. Unknown, don’t know how related
    1. Akka
  12. Langauages
    1. Java 8
    2. Scala
    3. IPython Notebook
    4. R?

Edit – I wanted to create a nice graph of all these technologies, but seems like someone has already done a good job at that –

First Tachyon Meetup

1) Meetup –

2) Video / Slides –

Summary –

Tachyon is an in-memory file system by the UC Berkley lab famous for in-memory systems – AMPLab 

1) Why is Tachyon needed?
a) Sharing data between two subsequent jobs (say, Spark jobs) requires it to go through HDFS – this slows them down 

b) Two jobs working on the same data need to create a copy of it in their process. Also placing data within JVM will lead garbage collection issues.

2) Tachyon write performance over HDFS is 100x in test conditions. A real world application performance vs memHDFS resulted in 4x faster performance

3)Tachyon writes only one copy (in memory). For reliability it uses a concept of lineage – it knows how the data was produced and it re-runs the processing.

4) Assumption about lineage – the programs/jobs must be deterministic (MR/Spark have same kind of restrictions)

5) A particular example of suitability to machine learning – the same data set needs to be iterated over several times, say to find minima

6) As of now there is no concept of security – no concept of users or ACLs

7) Concepts I didn’t clearly understand – Spark OFF HEAP stores in Tachyon (?), ramdisk

1) Meetup –

2) Video/Slides – Not available

Summary – 

1) TellApart is mainly into ad personalization for retail companies (ex. nordstorm). They have a large nice office. The main difference in their model is people have to click AND buy something – they operate on revenue sharing

2) One half of the talk was about lambda architecture. It is basically a big data design pattern where a datum needs to be acted upon immediately (streaming/real-time) and also in more elaborate manner later on (batch). Stupid example using music recommendation – if a person listens to only melancholic piano music but suddenly likes a grunge metal track, the immediate recommendation should be more grunge metal tracks, but later on (few mins/hours when the batch processing system has processed this datum) recommendation should include some heavy metal (or whatever goes along with melancholy piano and grunge metal). 

3) Two major models out there for lambda architecture – hadoop(batch) + storm(realtime) and spark(batch) – spark streaming (realtime). There was no clear contrast online so I asked some people there about their experiences. Quote – “A company has a storm stream pipeline, a redundant storm stream pipeline for failover, if even that fails – page engineers”. Not very inspiring.

4) The remaining part of talk was about ad placements – the math and strategy. Discussions about optimization strategy (nash equilibrium, etc), response time (a decision about placing ad has to be taken within 100ms, spark stack – 40ms), ML issues (cold start problem, models per user vs models per feature-set, modeling the competition) 

I want to start with a new series where I summarize about the various meetups I go to. I want to include the following information – 


1) Actual link of meetup for anyone who is willing to go to the next one.

2) Video/Slides of the presentation

3) My summary of things


Cavets – 1) My summary is going to be far from thorough 2) Cross verify any claims/facts


The first meetup I want to start with was called “New Developments in Scalable Machine Learning”.

1) Meetup –

2) Video –

Summary – 

1) This was a panel discussion instead of traditional presentation
2) The only way to upgrade Hadoop version is to start a new company (Ted Dunning)
3) Everyone is excited and talking about Spark – mainly because we have reached a point where we can have clusters doing most of the computation in-memory (0xdata is in-memory ML computing engine)
4) Data ingestion/cleaning/munging is 80-90% of the ML pipeline, according to all the panelists
5) At production scale, the focus is on the time it takes for an ML model to score, version controlling and hot swapping of ML models
6) Deep learning has helped a lot of customers – in doing things that were not feasible in reasonable amount of time. Progression : Logistic regression -> GBM -> deep learning
7) The panelists are not too excited about GPU computing just yet – GPU computing is hard, performance improvement in only very specific applications (dense matrix multiplication)
But the most important of all, 
8) The technologies are changing rapidly. It is important for someone to learn a technology, use it in production for a while, but then be ready to move onto a better newer technology. This is going to be the norm in the immediate future. There will be a lot of relearning involved wrt technologies.


1. apt-get install ruby-full build-essential
2. apt-get install rubygems
3. gem install rails

This blog post has a nice code which dumps the visual tree in the immediate window. Check it out.

So how can we consume information?

Edit : I found this really amazing website few days after writing this post :

This is going to be a long post. Right now it is a brain dump, need to organize it better.

So in this post, I will try to reason with myself and ask about the question of, a still unanswered question, how can we systematically consume information present around us for our betterment?

Specifically, how can we better our careers?

To give a context of how I reached this question – I am subscribed to Goodreads and one of my friends read a series of books on career improvement and design patterns. This led me to read the book “A passionate programmer”. It gave some nice advise, about what to do to improve career. I also found many online courses like Stanford Classes – Google University, Udacity, etc.

Now it is established that there is a lot of material available to improve oneself – books, tutorials, online courses which mimic a class room a la Stanford classes, online courses with no end goal. I talked to few friends on how this information can be consumed and used to better our careers. Sure enough everyone was interested in it, but we stumbled on some problems, which might not need to be solved.

Before going into these problems, let me put my thoughts about how the institution of university has solved the problem of too much material. The biggest problems I think, when a person sits down to do online material, are motivation to continue working and lack of continuous rewards. The University system has solved this wonderfully. Talking of rewards first, its brilliant how the learning is divided into chunks of years. So a person learns 1st grade material, then 2nd, so on. Right from kindergarten to PhD! This gives a sense of continuous rewards. There is a yearly reward of passing a year and motivation to work towards it. This year is further divided into smaller chunks at higher level, where there is need of more motivation, thereby giving more immediate rewards.

The second problem is of motivation. University system has many factors to motivate. In the initial schooling years, its friends. Then it’s the learning. In college its the promise of a job, a career, a ticket to a good life. In still higher levels its the joy of learning and changing one’s field. Of course, these reasons are no way exhaustive, just one of the motivations.

Coming back to learning from material, after several years in an environment where an exam motivates you to learn and the rewards are more mainstream, like the promise of a job, how do we transition ourselves to a system where motivation is from self and rewards are not immediately visible, like a better way to handle people or a better code architecture? I don’t yet have an answer for this. A group for this is good, but is extremely dependent on the motivation of its members.

The next problem comes is order of information to learn and personal interests. We are used to system of CS10X courses, then CS20X courses then 30X, 40X, 50X etc. Information is ordered, there are people dedicating their entire lives to order this information ( Board of Studies). Can all information be ordered? Can design patterns, functional programming, learning about the business side of your company, customer success stories be learnt in an order? Is an ordering even required? Then comes the problem of personal interest. How does a group of people who have come together to learn, grow together and is small, also respect individual interests?

Udacity has started with the aim of an online university. Will it evolve over time to an even better method of learning? Right now the content writers gate the information flow. Will there be a time where there will be no select group of people gating information flow to learn? Is such a model even possible? Is there a market for a social media site dedicated only to foster discovery of content, learning, motivation and a rewards system for continuous learning? Will an online model of learning be able to surpass the traditional way of learning, which is severely restricted when it comes to scaling?