Category: Programming

Elasticon notes

500 B documents
1) JBOD over RAId
2) unicast discovery
3) men for JVM heap + fs cache
4) tune kernel params, user/process/network limits
5) JVM might corrupt data in es

6) tune JVM params, network/connectivity, recovery params, gateway params, caching params,

7) tribe node of greater than 150 node
8) refresh interval =-1
9) bulk index thread pool
10) disable _all
11) explain/validate queries
12) search templates
13) high cardinality fields – disable aggregation/sorting
14) do search on client nodes
15) monitoring – nagios
16) upgrade es during upgrade
17) 600-700 TB.
18) reindexing faster than recovery
19) bulk index while replica is zero
20) what if you kill master
21) field level statistic
22) merge count


1) field data and doc values

1) suro netflix (equivalent to herd), asgard ( netflix oss),
2) search nodes
3) Apollo discovery plugin open source – raigad does it
4) ram/2 for es, use jstat and check JVM
5) unbounded bulk indexing, unless heap error
6) file descriptor limit increase

Strata notes

Keynotes – nothing substantial, couple of them pitches by sponsors, salesforce – how mobile Internet is changing lives (couple of interesting points). Rubicon – in the moment analytics (check out). Data drill – data exploration for non IT. Data science getting it’s due (open government data) – DJ Patil first chief data scientist. New government data – open and machine readable. Big Data report?

Spark talk – hall of fame. Alibaba tabolo. Zebra fish, microscope the fish. Spark mapping the brain. Laser, activate individual neurons. Scala resource optimization, better shuttle, data frames!

Tsar – Google analytics for twitter

Streaming design patterns – kappa architecture. External lookup.

Belgian MoD – @stevenbeeckman
Sex and cash theory
Startup bus. Civilian data analyst.

Quid – contextual vs global models (see pic).

Raster maps.geotrellis – geographic data processing.

Keynote – cyber operations room, Stafford beer, chile

Open data platform

New data visualizations – length better than area. Edward tufte.

The connected cow. Estrus

Netflix. Hadoop + s3 instead of hdfs.

Crunch – faster cascading. Hive on tez fastest. Spark – easy maintenance.

Fastest SQL – hive on tez. Hawk/hive

Cloudera presentation – click stream data

Adobe presentation – middle America. Tb is standard, pb is limits. Most of the time people are mining structured data. Old organizations (>100 years) using data.

Mapr – myriad. The day yarn was announced, mesos was in production for a year. Actor based bidi RPC(?). Mesos create virtual clusters. Omega paper/Google – single scheduler framework not viable. Slider

Spark – shuffle (common inefficiency), job on driver vs worker. Rdd.toDebugString(). Collect transfers data from worker to driver. Only driver can perform operations on rdds (no rdds within rdds). For converting batch to stream – transform/foreachrdd. Testing : instead of SparkContext.stop() use LocalSparkContext.stop(). Spark-packages.

Cybernetics revolutionaries.

Big Data Frameworks

In this blogpost I want to write down details of some the big data frameworks. Ideally I want to create a nice interactive graph, but don’t what technology will be best to do that. Any suggestions? I most likely will continuously update this post, rather then create the best one at one go.

Note – this is not intended to be a complete list. Also, some of the classification/information might be wrong. I am first doing a brain dump, will later refine each section.

Some of the technologies

  1. Data Processing
    1. Batch
      1. Hadoop – old workhorse, two operators : map and reduce. Plus integrate with programming language
      2. Hive – sql-ish like syntax, works on large amounts of data
      3. Pig – similar use case as hive, but more powerful syntax
      4. Spark – new age hadoop, has several more operators than just map and reduce. Plus integrate with programming language
        1. Shark
        2. SparkSQL
    2. Streaming
      1. Spark Streaming
      2. Storm
  2. Data Storage
    1. File Systems
      1. HDFS
      2. Tachyon – in-memory file system, used alongside spark when iterating over same dataset several times
    2. Storage formats
      1. Parquet
      2. Protobuf
      3. Thrift
      4. Avro
    3. Data Compression
      1. LZO
      2. Snappy
  3. Message Queues
    1. RabbitMQ
    2. Kafka
  4. Workflow Scheduling
    1. Oozie
    2. Azkaban
    3. Luigi
  5. Storage systems
    1. CouchDB
    2. HBase
    3. Casandra
    4. Sqqrl
    5. Several, several more
  6. Cluster management OS/Containers
    1. YARN
    2. Mesos
    3. Micrososft Research REEF
  7. Visualizations
    1. D3.js
    2. Bokeh
  8. ML
    1. HexData
    2. Oryx, Mahout
    3. Spark MLib
    4. Graphlab, Giraph (Grafos)
    5. SparkR
  9. Cluster management suite
    1. Hortonworks
    2. MapR
    3. Cloudera
    4. Pivotal
  10. Search
    1. Lucene
    2. Solr/Lucene
    3. Elasticsearch/Lucene
  11. Unknown, don’t know how related
    1. Akka
  12. Langauages
    1. Java 8
    2. Scala
    3. IPython Notebook
    4. R?

Edit – I wanted to create a nice graph of all these technologies, but seems like someone has already done a good job at that –

1. apt-get install ruby-full build-essential
2. apt-get install rubygems
3. gem install rails

This blog post has a nice code which dumps the visual tree in the immediate window. Check it out.

So how can we consume information?

Edit : I found this really amazing website few days after writing this post :

This is going to be a long post. Right now it is a brain dump, need to organize it better.

So in this post, I will try to reason with myself and ask about the question of, a still unanswered question, how can we systematically consume information present around us for our betterment?

Specifically, how can we better our careers?

To give a context of how I reached this question – I am subscribed to Goodreads and one of my friends read a series of books on career improvement and design patterns. This led me to read the book “A passionate programmer”. It gave some nice advise, about what to do to improve career. I also found many online courses like Stanford Classes – Google University, Udacity, etc.

Now it is established that there is a lot of material available to improve oneself – books, tutorials, online courses which mimic a class room a la Stanford classes, online courses with no end goal. I talked to few friends on how this information can be consumed and used to better our careers. Sure enough everyone was interested in it, but we stumbled on some problems, which might not need to be solved.

Before going into these problems, let me put my thoughts about how the institution of university has solved the problem of too much material. The biggest problems I think, when a person sits down to do online material, are motivation to continue working and lack of continuous rewards. The University system has solved this wonderfully. Talking of rewards first, its brilliant how the learning is divided into chunks of years. So a person learns 1st grade material, then 2nd, so on. Right from kindergarten to PhD! This gives a sense of continuous rewards. There is a yearly reward of passing a year and motivation to work towards it. This year is further divided into smaller chunks at higher level, where there is need of more motivation, thereby giving more immediate rewards.

The second problem is of motivation. University system has many factors to motivate. In the initial schooling years, its friends. Then it’s the learning. In college its the promise of a job, a career, a ticket to a good life. In still higher levels its the joy of learning and changing one’s field. Of course, these reasons are no way exhaustive, just one of the motivations.

Coming back to learning from material, after several years in an environment where an exam motivates you to learn and the rewards are more mainstream, like the promise of a job, how do we transition ourselves to a system where motivation is from self and rewards are not immediately visible, like a better way to handle people or a better code architecture? I don’t yet have an answer for this. A group for this is good, but is extremely dependent on the motivation of its members.

The next problem comes is order of information to learn and personal interests. We are used to system of CS10X courses, then CS20X courses then 30X, 40X, 50X etc. Information is ordered, there are people dedicating their entire lives to order this information ( Board of Studies). Can all information be ordered? Can design patterns, functional programming, learning about the business side of your company, customer success stories be learnt in an order? Is an ordering even required? Then comes the problem of personal interest. How does a group of people who have come together to learn, grow together and is small, also respect individual interests?

Udacity has started with the aim of an online university. Will it evolve over time to an even better method of learning? Right now the content writers gate the information flow. Will there be a time where there will be no select group of people gating information flow to learn? Is such a model even possible? Is there a market for a social media site dedicated only to foster discovery of content, learning, motivation and a rewards system for continuous learning? Will an online model of learning be able to surpass the traditional way of learning, which is severely restricted when it comes to scaling?

Time to revamp the blog

Hi readers 🙂

It has been long since I contributed to this blog. Now, I feel , is the time to revamp the blog. I will try to categorize my posts into following –

1. Programming/Technology

2. Politics

3. Others

for now, until the posts are numeric enough to have their own separate blogs 🙂

National Instruments R & D

Hello guys, its been long since I last blogged. I have been wanting to get back to blogging from a long time, but was looking for some topic to resume, and now that I have one, I hope to continue blogging 🙂

In this blog I will talk about my interview process in National Instruments R & D. So, without wasting any more words, lets get started .

Something about the company: I really don’t know much about it, except that it produces softwares required which are used typically by electrical and electronic engineers for designing testing 🙂 I’ve always wanted to work in a products company, though of late i was getting more inclined towards systems programming, but no issues 🙂

Profile they offered :

Two  profiles were offered – one for software, one for hardware. Hardware was open for ECE and EEE, Software was open for CS, IT, ECE, EEE

They had three kinds of postings:

1)      Full time positions (the regular thing)

2)      Full time interns – I dont know much about it, except that it is for 6 months, at the end of which based on one’s performance, s/he is inducted into the company. This was for MCAs.

3)      Part time interns – This is for people who are in and around b’lore, who can visit the company 2-3 days a week

Coming to the more interesting, the interview process  🙂 :

I was sitting for the software profile, so I don’t know much about what happened to hardware people. There were basically 3 rounds, 1 written and 2 interviews. They asked about coding right from the first question in first round to the last question in the last round.

Round 1 : Written round, basically a C Apti

This round lasted for an hour, basically consisting of 2 parts. The first part contained code snippets and simple questions on it. There were around 12 questions and the answer had to be written(not MCQ). The second part involved wrting an algo/code or both for a prob statement.

Some simple problems were – find what does the following code does ( one was about printing nodes at kth level, one was about calculating sum of a number until single digit…), replace a given code with a more simpler code, finding number of graphs with n vertices, n few more simpler ones. The only question for which i didnt get a solution was

if((a == 5) || (a ==7))



The compiler generates following assembly code

cmp eax, 5

jz label

cmp eax, 7

jz label

Optimise above code with single jz statement

The second part of Round 1 was to give an algo/code a problem statement. The jist of the statement is –

N soldiers of two armies are standing in two rows facing each other. Their individual strengths is given by two arrays G and F. You are commander of army F. If G’s soldier strength is greater than or even equal to F’s, G’s soldier wins or else F’s soldier wins. The soldier who looses dies, but strength of winning soldier is unaffected. Rearrange F’s soldiers such that the sum of strengths of alive soldiers in F is maximum.

Given two arrays int G[], int F[], int n = number of soldiers in both G and F

The indices represent the strength of a soldier

G[] = { given and order fixed}, ex: G[] = {2, 10, 7}

F[] = {given, order not fixes}, ex: F[] = {2, 9, 6}

sum = 0, all are dead

Rearrange F so that the sum of strengths of alive soldier is max. The above can be rearranged as

F[] = {2, 6, 9}, sum = 9

That was pretty much the first round. Around 80-90 people wrote the first round, and around 22 were selected. 3 from IT, 4 from CS, 1 from MCA and the rest from ECE.

Round 2: First round of interviews

After the written round, the shortlisted candidates were called for an interview. It was scheduled for half an hour for each person. My interview started almost on time. They took my resume, studied it for 10 mins and then I was called in. It was a one on one interview. The interviewer started with asking me to tell about myself. After clearing the purpose of the interview and what they are looking for in a candidate in that round, he proceeded on to ask questions. The questions themselves were easy. I was asked about 4 – 5 such questions in the entire round.

I always proceeded with first telling the interviewer what my strategy would be in solving the problem, and when he was satisfied, to write the code. The following questions I can recall were asked to me:

  • Given two arrays A[a1, a2…an] and B[b1, b2…bn], write a program(WOP) to find the fraction A/B in its simplest form
  • Given an string, find the largest palindrome in it
  • Given two trees, find out if one tree is subtree of the other(this was not required to be coded)
  • I don’t remember any more questions

This round was pretty much easy, the interviewer was friendly and would change the question if you were to get stuck at some point. The sad part was some people were eliminated in this round, based on such easy questions.

Around 14 ppl were selected from this round, 2 from IT, 3 from CS, 1 from MCA, rest from ECE.

Round 3: Second round of interviews + HR

This 1-on-1 interview round was scheduled for one hour. It started almost on time. My interview was slightly different from others. My interviewer spent around 10 mins asking me about things in my resume, regarding extra curricular activities, on my role in club, specifically the workshops, events I conducted, co-coordinated, attended, regarding my role in the cultural fest, on my internships. Then he started with the technical questions.

In the entire remaining duration we discussed only one question. He asked a question, which fortunately, came in my earlier round. He asked on how you will say, given pointers to two nodes, one is a sub tree of the other. He asked it on two approaches – comparing the value of the pointer, and comparing the values of nodes themselves.

The solution is rather straightforward, if comparing memories, traverse one tree and check each node’s address with the other node. If its comparing by value, it gets more complicated as all the children of the other node also needs to be compared to infer whether it’s a subtree or not.

I was asked to code both the approaches, and he checked it against various boundary conditions. I had made some implementation mistakes here and there which he founded out, and some I realized while explaining him the code. He was cool with it, as long as I corrected it.

After the tech questions, there were two HR questions –

1) Have you spent too much time on debugging something, and if yes, where was the fault, how did you correct it and how did you go about finding it?

2) As this was an R&D position, he asked me to give an example where I have shown motivation and initiative, when I didn’t really have to do it.

I didn’t have much trouble answering the above questions.

Finally, after around 30-45 mins of discussions within themselves, they published the results. Five people made it through: 1 from CS, 1 from IT and 2 from EC. They gave 1 full time internship to a MCA.

So that’s my entire experience. They haven yet given the joining dates. From my entire experience, two things that helped me the most –

1)      The ability to code, even when they say algo is enough, after giving algo, if u have time, code it. It gives you an advantage.

2)      Extra curricular activities. Depending on the profile of the company and the job position, change your resume to highlight few key areas, instead of putting all information, specially the langs/platforms/etc you know and the subjs which you studies.

These are my personal opinions, which I feel will help you.

All the best to everyone, who is looking for placements this year or in future. If you can code and have practice, you don’t really have to worry about anything else.

Microsoft internship…part II

This is sequel to the post

Round 1: Written round
The results to the written C Apti round were announced overnight. 27 people out of 101 got selected. Here are the questions of the C apti:

a)A recurssions problem, just needed some patience

b)given a function parser(char *) which parses the input string. Write all the possbile test cases for it.

c)You have been given a memory space to work with.You only know its base address and size. You cannot use more memory. Write a function for Void * Allocate(int ) and void Delete(Void *)

d)Given two linked lists, each conatining a digit of a very  long number. Subtract the two numbers and return a linked list which contains the difference, with each node containing one digit

The questions required some thinking and the implementations were not rigourously tested.Not very tough, but a challenging round

Round 2:Logic and implementation
The next round was a group process. 27 members were divided into groups of 4 each, each group having a mentor. All of them were made to be seated in a single room.The process was simple. One question would be given and everyone has to think about its implementation and discuss the answer with his/her mentor. When the mentor gives the go ahead, you have to implement it after which the mentor would check your code and give test conditions.

There were two such questions with half an hour each for each question. The questions were as follows:

a)Given input binary tree and a number, find the path from the root to leaves such that the path sum is equal to the given number

b)Given input a Char *.It contains alphanumeric + special symbols. Every special symbol has to be replced by”_(its ASCII)_”

for example, assume ASCII of $ is 32(I dont remember the exact values, its a pain to remember ASCII table), then

Input: A$BED

Output String:A_32_BED

The constraint here being only limited additional memory being available which is exactly equal to accomodate the new string and that extra memory is available only at the end of the given string. You cannot create a new string, you have to work on existing string.

This round was doable and challenging. It was fun, I thoroughly enjoyed it.12 out of 27 people got selected

Round 3: Personal Interviews
The next round was personal interview.It was a 1 on 1 interview. I was asked only 1 question and we were discussing the same question for 50 mins.This varied with person to person.I got enough indications I wouldn’t be able to get past this round as I was encountering more problems, one after the other, in the implementation I choose.The question was pretty straightfoward though:

Given two arrays as input. Array a contains integers. Array b contains the indices of array a which needs to be deleted.

I was trying to implement a solution with complexity o(n), and the logical complexity became very high and the interviewer was not very impressed with it.

8 out of 12 people were selected to the next round.This will be after end sems, which is a pain to those who got selected as they may have to stay back during holidays and have been asked to go through the basics.

This is pretty much the entire procedure.The interviewers were very friendly.A very good experience.The entire process was just too much fun 😀 I wanna write more just for the heck of it 😀

The day finally comes in my engineering career where i sit for comapnies. Well this is for internship. So microsoft is coming to take interns for summer which is more than six months away. Well some details of how the week has been. Just finished economics test yesterday, coming now after writing the OS lab end sem exam and have to submit DBS project report tomorrow. Among all these a test by MS. The preparations are great! 😉

In another few minutes i will be leaving for the test, thought the best last moments will be to write a blog. Well the first round is written C-apti. Some hope for me there. The interview will be tomorrow. But tomorrow is long way, there are 101 people competing for this internship and i have no clue how many is on offer.

My programming skills have lost touch and polish of late, weary of what i will do with interview. I can probably discuss Chandrayaan-1 and India’s secret nuclear submarine Advanced Technolgy Vechicle(ATV). I wish he lets me speak about Computer Graphics and i get a chance to work on DirectX.

All other strategies are dynamic! This is going to be more of a test in distracting him into subjects i know. Will update after my tests.


P.S. : Here is the sequel