1) Meetup – http://www.meetup.com/Tachyon/events/200387252/

2) Video / Slides – http://files.meetup.com/14452042/Tachyon_Meetup_2014_8.pdf

Summary –

Tachyon is an in-memory file system by the UC Berkley lab famous for in-memory systems – AMPLab 

1) Why is Tachyon needed?
a) Sharing data between two subsequent jobs (say, Spark jobs) requires it to go through HDFS – this slows them down 

b) Two jobs working on the same data need to create a copy of it in their process. Also placing data within JVM will lead garbage collection issues.

2) Tachyon write performance over HDFS is 100x in test conditions. A real world application performance vs memHDFS resulted in 4x faster performance

3)Tachyon writes only one copy (in memory). For reliability it uses a concept of lineage – it knows how the data was produced and it re-runs the processing.

4) Assumption about lineage – the programs/jobs must be deterministic (MR/Spark have same kind of restrictions)

5) A particular example of suitability to machine learning – the same data set needs to be iterated over several times, say to find minima

6) As of now there is no concept of security – no concept of users or ACLs

7) Concepts I didn’t clearly understand – Spark OFF HEAP stores in Tachyon (?), ramdisk

Advertisements