Latest Entries »

Big Data Frameworks

In this blogpost I want to write down details of some the big data frameworks. Ideally I want to create a nice interactive graph, but don’t what technology will be best to do that. Any suggestions? I most likely will continuously update this post, rather then create the best one at one go.

Note – this is not intended to be a complete list. Also, some of the classification/information might be wrong. I am first doing a brain dump, will later refine each section.

Some of the technologies

  1. Data Processing
    1. Batch
      1. Hadoop – old workhorse, two operators : map and reduce. Plus integrate with programming language
      2. Hive – sql-ish like syntax, works on large amounts of data
      3. Pig – similar use case as hive, but more powerful syntax
      4. Spark – new age hadoop, has several more operators than just map and reduce. Plus integrate with programming language
        1. Shark
        2. SparkSQL
    2. Streaming
      1. Spark Streaming
      2. Storm
  2. Data Storage
    1. File Systems
      1. HDFS
      2. Tachyon – in-memory file system, used alongside spark when iterating over same dataset several times
    2. Storage formats
      1. Parquet
      2. Protobuf
      3. Thrift
      4. Avro
    3. Data Compression
      1. LZO
      2. Snappy
  3. Message Queues
    1. RabbitMQ
    2. Kafka
  4. Workflow Scheduling
    1. Oozie
    2. Azkaban
    3. Luigi
  5. Storage systems
    1. CouchDB
    2. HBase
    3. Casandra
    4. Sqqrl
    5. Several, several more
  6. Cluster management OS/Containers
    1. YARN
    2. Mesos
    3. Micrososft Research REEF
  7. Visualizations
    1. D3.js
    2. Bokeh
  8. ML
    1. HexData
    2. Oryx, Mahout
    3. Spark MLib
    4. Graphlab, Giraph (Grafos)
    5. SparkR
  9. Cluster management suite
    1. Hortonworks
    2. MapR
    3. Cloudera
    4. Pivotal
  10. Search
    1. Lucene
    2. Solr/Lucene
    3. Elasticsearch/Lucene
  11. Unknown, don’t know how related
    1. Akka
  12. Langauages
    1. Java 8
    2. Scala
    3. IPython Notebook
    4. R?

Edit – I wanted to create a nice graph of all these technologies, but seems like someone has already done a good job at that –

First Tachyon Meetup

1) Meetup –

2) Video / Slides –

Summary –

Tachyon is an in-memory file system by the UC Berkley lab famous for in-memory systems – AMPLab 

1) Why is Tachyon needed?
a) Sharing data between two subsequent jobs (say, Spark jobs) requires it to go through HDFS – this slows them down 

b) Two jobs working on the same data need to create a copy of it in their process. Also placing data within JVM will lead garbage collection issues.

2) Tachyon write performance over HDFS is 100x in test conditions. A real world application performance vs memHDFS resulted in 4x faster performance

3)Tachyon writes only one copy (in memory). For reliability it uses a concept of lineage – it knows how the data was produced and it re-runs the processing.

4) Assumption about lineage – the programs/jobs must be deterministic (MR/Spark have same kind of restrictions)

5) A particular example of suitability to machine learning – the same data set needs to be iterated over several times, say to find minima

6) As of now there is no concept of security – no concept of users or ACLs

7) Concepts I didn’t clearly understand – Spark OFF HEAP stores in Tachyon (?), ramdisk

1) Meetup –

2) Video/Slides – Not available

Summary – 

1) TellApart is mainly into ad personalization for retail companies (ex. nordstorm). They have a large nice office. The main difference in their model is people have to click AND buy something – they operate on revenue sharing

2) One half of the talk was about lambda architecture. It is basically a big data design pattern where a datum needs to be acted upon immediately (streaming/real-time) and also in more elaborate manner later on (batch). Stupid example using music recommendation – if a person listens to only melancholic piano music but suddenly likes a grunge metal track, the immediate recommendation should be more grunge metal tracks, but later on (few mins/hours when the batch processing system has processed this datum) recommendation should include some heavy metal (or whatever goes along with melancholy piano and grunge metal). 

3) Two major models out there for lambda architecture – hadoop(batch) + storm(realtime) and spark(batch) – spark streaming (realtime). There was no clear contrast online so I asked some people there about their experiences. Quote – “A company has a storm stream pipeline, a redundant storm stream pipeline for failover, if even that fails – page engineers”. Not very inspiring.

4) The remaining part of talk was about ad placements – the math and strategy. Discussions about optimization strategy (nash equilibrium, etc), response time (a decision about placing ad has to be taken within 100ms, spark stack – 40ms), ML issues (cold start problem, models per user vs models per feature-set, modeling the competition) 

I want to start with a new series where I summarize about the various meetups I go to. I want to include the following information – 


1) Actual link of meetup for anyone who is willing to go to the next one.

2) Video/Slides of the presentation

3) My summary of things


Cavets – 1) My summary is going to be far from thorough 2) Cross verify any claims/facts


The first meetup I want to start with was called “New Developments in Scalable Machine Learning”.

1) Meetup –

2) Video –

Summary – 

1) This was a panel discussion instead of traditional presentation
2) The only way to upgrade Hadoop version is to start a new company (Ted Dunning)
3) Everyone is excited and talking about Spark – mainly because we have reached a point where we can have clusters doing most of the computation in-memory (0xdata is in-memory ML computing engine)
4) Data ingestion/cleaning/munging is 80-90% of the ML pipeline, according to all the panelists
5) At production scale, the focus is on the time it takes for an ML model to score, version controlling and hot swapping of ML models
6) Deep learning has helped a lot of customers – in doing things that were not feasible in reasonable amount of time. Progression : Logistic regression -> GBM -> deep learning
7) The panelists are not too excited about GPU computing just yet – GPU computing is hard, performance improvement in only very specific applications (dense matrix multiplication)
But the most important of all, 
8) The technologies are changing rapidly. It is important for someone to learn a technology, use it in production for a while, but then be ready to move onto a better newer technology. This is going to be the norm in the immediate future. There will be a lot of relearning involved wrt technologies.


1. apt-get install ruby-full build-essential
2. apt-get install rubygems
3. gem install rails

This blog post has a nice code which dumps the visual tree in the immediate window. Check it out.

So how can we consume information?

Edit : I found this really amazing website few days after writing this post :

This is going to be a long post. Right now it is a brain dump, need to organize it better.

So in this post, I will try to reason with myself and ask about the question of, a still unanswered question, how can we systematically consume information present around us for our betterment?

Specifically, how can we better our careers?

To give a context of how I reached this question – I am subscribed to Goodreads and one of my friends read a series of books on career improvement and design patterns. This led me to read the book “A passionate programmer”. It gave some nice advise, about what to do to improve career. I also found many online courses like Stanford Classes – Google University, Udacity, etc.

Now it is established that there is a lot of material available to improve oneself – books, tutorials, online courses which mimic a class room a la Stanford classes, online courses with no end goal. I talked to few friends on how this information can be consumed and used to better our careers. Sure enough everyone was interested in it, but we stumbled on some problems, which might not need to be solved.

Before going into these problems, let me put my thoughts about how the institution of university has solved the problem of too much material. The biggest problems I think, when a person sits down to do online material, are motivation to continue working and lack of continuous rewards. The University system has solved this wonderfully. Talking of rewards first, its brilliant how the learning is divided into chunks of years. So a person learns 1st grade material, then 2nd, so on. Right from kindergarten to PhD! This gives a sense of continuous rewards. There is a yearly reward of passing a year and motivation to work towards it. This year is further divided into smaller chunks at higher level, where there is need of more motivation, thereby giving more immediate rewards.

The second problem is of motivation. University system has many factors to motivate. In the initial schooling years, its friends. Then it’s the learning. In college its the promise of a job, a career, a ticket to a good life. In still higher levels its the joy of learning and changing one’s field. Of course, these reasons are no way exhaustive, just one of the motivations.

Coming back to learning from material, after several years in an environment where an exam motivates you to learn and the rewards are more mainstream, like the promise of a job, how do we transition ourselves to a system where motivation is from self and rewards are not immediately visible, like a better way to handle people or a better code architecture? I don’t yet have an answer for this. A group for this is good, but is extremely dependent on the motivation of its members.

The next problem comes is order of information to learn and personal interests. We are used to system of CS10X courses, then CS20X courses then 30X, 40X, 50X etc. Information is ordered, there are people dedicating their entire lives to order this information ( Board of Studies). Can all information be ordered? Can design patterns, functional programming, learning about the business side of your company, customer success stories be learnt in an order? Is an ordering even required? Then comes the problem of personal interest. How does a group of people who have come together to learn, grow together and is small, also respect individual interests?

Udacity has started with the aim of an online university. Will it evolve over time to an even better method of learning? Right now the content writers gate the information flow. Will there be a time where there will be no select group of people gating information flow to learn? Is such a model even possible? Is there a market for a social media site dedicated only to foster discovery of content, learning, motivation and a rewards system for continuous learning? Will an online model of learning be able to surpass the traditional way of learning, which is severely restricted when it comes to scaling?

Time to revamp the blog

Hi readers :)

It has been long since I contributed to this blog. Now, I feel , is the time to revamp the blog. I will try to categorize my posts into following –

1. Programming/Technology

2. Politics

3. Others

for now, until the posts are numeric enough to have their own separate blogs :)

It happened! After many failed attempts and 3 and 1/2 years of wait! It was bound to happen, sooner or later!

It was 5 in the morning. The wing had a deserted look. The block always looked a little scary at odd hours. With not a sound, not a movement, not even flicker of light, I always found mega block, which is a huge building, to be hypnotic and give an eerie feeling. But today was not the day. Today was the “trip” and soon I would find people waking up and getting ready for the “trip”.

It was going as per my schedule, and by 530 I was ready to leave. But to my disappointment I found only 11 people got ready by then. It wasnt really surprising, I knew not all will be ready on time. Hungry, and ready to leave, we all set to Thadambail to have breakfast of buns and sambar. I have, of late, started to like that stuff, its something available only in this part of the state. The buns were good, not oily, with soft center and a tinge of sweet and chilly(?). The rest of the class also joined us soon. With all of us nourished, we set out on the trip by 730.

Unfortunately ( or fortunately? ) the girls did not join us. Some had backed out at the last moment due to various reasons. With it being an all guys trip, its nature had completely changed. The implications being free flow of ideas, thoughts and motives, and some relaxation of the schedule :) Nevertheless, it would have been great to have everyone on the board.

To the comfort of the driver, we had all gathered and asked him to begin the drive. This had put an end to his early morning boredom, which i suppose started much before 530 AM. The drive was smooth, first on a national highway, then a state highway, then just a road and in the end on some path. The almost 3 hour drive made me sing more songs than i have sang in the entire last one year! With some melody in our voices, some noise and a lot of screeching we did a splendid job of entertaining ourselves and even managed to screw up our larynx enough to make us shut up and sit after a couple of hours of “singing”.  The all time favorites were bidi jalileye, yaaron, dost dost na raha among several others. The gult song “ring ringa ring rigna ring ringa ring ringa reeee…” brought out the headbanger in all of us.

The drive was interrupted by an half hour pit stop at some place on the way. I have no recollection of the name of the place. Everyone took this opportunity to refresh themselves and the villagers.  As this was the last place where we could get some items to keep us nourished, people took what they thought would give them some energy: buns, puris, idlis, biscuits, chips, cold drinks, packed peanuts, DSP Black, Old Monk, Smirnoff, bananas, small, king…There was nothing much mentioning about the place except small shops, old men staring at you, carnivorous cows, autos(!!), and a fully stocked bar yet not enough variety of food to eat.

We reached a place where the trek would start. The bus would not go any further. The trek started by crossing a 10 meter wide stream of water of knee depth. This set a precedent for the rest of the trek. All of us, after wading through the stream, slipping on slippery stones, looking out for leeches and other creatures in water and having reached the other side safely, made a triumphant call, followed by a photo session as a mark of our triumph. This turned out to be just the start, and water perfectly harmless, and almost no leeches in this weather.

The trek was an easy one by trekking standards. It is just around 4km trek ( thats what I was told).  For the non-trekkers, like me, it was sufficiently tiring. Large portions of the trek was just a walk on slightly inclined ground.. I did notice how one and half hours had gone by since we started trekking. It went on smooth until the last half an hour. With uncertainty of direction, difficult terrain , growing heat, humidity and fatigue, things started to get a little challenging. The path grew narrower, less visible and obvious, and the nagging yet pleasant sound of the water falling, which made you think you were close to the falls, filled us all intermittently. We kept trekking , expecting that the fall was ahead us at the next turn, at every turn. But it was not. We were getting impatient, I started to doubt if we were on the right path. And then we saw water! We had reached! No, we didn’t. We continued on, and then came the shouts, the cries! Yes, we found it! We had reached! I, after slipping on some mud, after falling of some rock into water and some maneuvering finally got the view of it! It was beautiful. It looked much better than in the pictures! It was worth all the effort (though for a trekker, this would have been one of the easy treks).

What happened next I will cover in another post :)


Get every new post delivered to your Inbox.