Archive for August, 2014

First Tachyon Meetup

1) Meetup –

2) Video / Slides –

Summary –

Tachyon is an in-memory file system by the UC Berkley lab famous for in-memory systems – AMPLab 

1) Why is Tachyon needed?
a) Sharing data between two subsequent jobs (say, Spark jobs) requires it to go through HDFS – this slows them down 

b) Two jobs working on the same data need to create a copy of it in their process. Also placing data within JVM will lead garbage collection issues.

2) Tachyon write performance over HDFS is 100x in test conditions. A real world application performance vs memHDFS resulted in 4x faster performance

3)Tachyon writes only one copy (in memory). For reliability it uses a concept of lineage – it knows how the data was produced and it re-runs the processing.

4) Assumption about lineage – the programs/jobs must be deterministic (MR/Spark have same kind of restrictions)

5) A particular example of suitability to machine learning – the same data set needs to be iterated over several times, say to find minima

6) As of now there is no concept of security – no concept of users or ACLs

7) Concepts I didn’t clearly understand – Spark OFF HEAP stores in Tachyon (?), ramdisk

1) Meetup –

2) Video/Slides – Not available

Summary – 

1) TellApart is mainly into ad personalization for retail companies (ex. nordstorm). They have a large nice office. The main difference in their model is people have to click AND buy something – they operate on revenue sharing

2) One half of the talk was about lambda architecture. It is basically a big data design pattern where a datum needs to be acted upon immediately (streaming/real-time) and also in more elaborate manner later on (batch). Stupid example using music recommendation – if a person listens to only melancholic piano music but suddenly likes a grunge metal track, the immediate recommendation should be more grunge metal tracks, but later on (few mins/hours when the batch processing system has processed this datum) recommendation should include some heavy metal (or whatever goes along with melancholy piano and grunge metal). 

3) Two major models out there for lambda architecture – hadoop(batch) + storm(realtime) and spark(batch) – spark streaming (realtime). There was no clear contrast online so I asked some people there about their experiences. Quote – “A company has a storm stream pipeline, a redundant storm stream pipeline for failover, if even that fails – page engineers”. Not very inspiring.

4) The remaining part of talk was about ad placements – the math and strategy. Discussions about optimization strategy (nash equilibrium, etc), response time (a decision about placing ad has to be taken within 100ms, spark stack – 40ms), ML issues (cold start problem, models per user vs models per feature-set, modeling the competition) 

I want to start with a new series where I summarize about the various meetups I go to. I want to include the following information – 


1) Actual link of meetup for anyone who is willing to go to the next one.

2) Video/Slides of the presentation

3) My summary of things


Cavets – 1) My summary is going to be far from thorough 2) Cross verify any claims/facts


The first meetup I want to start with was called “New Developments in Scalable Machine Learning”.

1) Meetup –

2) Video –

Summary – 

1) This was a panel discussion instead of traditional presentation
2) The only way to upgrade Hadoop version is to start a new company (Ted Dunning)
3) Everyone is excited and talking about Spark – mainly because we have reached a point where we can have clusters doing most of the computation in-memory (0xdata is in-memory ML computing engine)
4) Data ingestion/cleaning/munging is 80-90% of the ML pipeline, according to all the panelists
5) At production scale, the focus is on the time it takes for an ML model to score, version controlling and hot swapping of ML models
6) Deep learning has helped a lot of customers – in doing things that were not feasible in reasonable amount of time. Progression : Logistic regression -> GBM -> deep learning
7) The panelists are not too excited about GPU computing just yet – GPU computing is hard, performance improvement in only very specific applications (dense matrix multiplication)
But the most important of all, 
8) The technologies are changing rapidly. It is important for someone to learn a technology, use it in production for a while, but then be ready to move onto a better newer technology. This is going to be the norm in the immediate future. There will be a lot of relearning involved wrt technologies.