Friday, February 20, 2015

The Philosophy of PANNS -- A "Biased" Story for the Nearest Neighbor Search in High Dimensional Spaces

I put Panns source code online about a half year ago, more precisely, in the middle of 2014. After that, except occasionally I replied to some emails inquiring about the technical details of this small tool, I almost completely forgot such a thing due to my stressful PhD studies. Until one day, one of my colleagues started talking about Panns during the lunch time, I was a bit surprised that the tool was gaining its popularity. What's more, some people even compared it to other popular similar tools, which really flattered me quite a lot though Panns performed actually badly (regarding its building and querying time) in the comparison. Still, I was very happy about the fact that people are using it, testing it, and maybe also tweaking it.

So why did I decide to write a post about Panns? Well, I am not really going to present the algorithm line by line, nor I am going to talk about any theoretical stuff here. Instead, I want to talk about a little bit philosophy. Yes, you heard me, philosophy. ;) After I started my PhD journey, I realized that you need to have a philosophical stance on almost everything. Especially if you are working in the system science field, you must know there is simply no such a silver bullet which can solve all your problems. During the development, you need to make various design decisions in order to balance the different tradeoffs. All such design decisions are based on your philosophical stance on the system.

In the following text, I want to explain the philosophy of Panns. Then you will understand why it is designed in such a way which further explains why it is much better than others on certain aspects and doomed to lose in other aspects. So, let's start with a brief background story.

How did Panns start in the first place?

My PhD research has been focusing on Information-Centric Networking. It does not matter at all if you have never heard of this term. The point is we want to interconnect the content within a network with their semantic meanings, which essentially requires a language model to back up the idea. In the beginning of 2014, we started a new project -- Kvasir. In Kvasir project, we explored different innovative ways of seamlessly integrating intelligent content provision and recommendation into the web browsing by utilizing the aforementioned idea. Kvasir eventually turned into a more and more interesting  and promising project, so we decided to develop a real system. Panns, technically speaking, is only a small function module in Kvasir system.

Why vector-based language model?

We chose to use vector-based language model in Kvasir. Certainly I know that there is a whole bunch of different language models out there for use. However, there is no evidence in the literature showing  one specific model dominates others. Besides that, if there is really a reason for choosing vector-based models, it is because I want to use LSA (Latent Semantic Analysis) and I am simply good at algebraic models.

Why Python?

The reason is not because Python is my working language. The primary consideration is based on the teaching purpose. Since I am working in a university wherein teaching is an important part of my daily work along with research. We know C/C++ can definitely lead to better performance comparing to Python. However, we also noticed that most students are more comfortable with Python partly due to its popularity and partly due to its simplicity and friendly learning curve. We did worry that using C/C++ might cause the code less accessible to the students. With Python, I can write more elegant code and the students can focus on the key ideas instead of being distracted by mechanical details, of course, at the price of performance. However, in reality, Panns performance is not that bad in our context.

Why random projection trees?

The essential problem we are trying to solve is looking for k nearest neighbors for a given point in very high dimensional spaces. What's more, you need to finish the job fast. This is a really hard question and there is a large body of literature behind it. In general, you need to trade accuracy for speed, and it is totally fine in the language model context where you are trying to find the most similar articles. The small numerical error will not destroy the results and users mostly still feel the returned articles are relevant. Choosing random projection trees (RP-tree) is based on the following several considerations - first, simplicity on the structure, which makes theoretical analysis feasible which further shows the algorithm has a bounded error. Second, both tree building and random projections can be easily parallelized on a computer cluster. As you can see, we are pursuing horizontal scalability.

Why separating the index file into two parts?

The index file generated by Panns are separated into two parts. One contains all the RP-trees, and the other contains the raw vectors. The rationale behind is that the tree-structure needs to be small so that it can reside within the physical memory for fast access, whereas the raw data vectors are usually much larger and it is better to utilize mmap technique so that we do not need load the whole file into the memory. Such design is based on the well-known fact that content popularity distribution is highly skewed and only a very small amount of content are indeed very popular. It further indicates only a small part of the data set is frequently visited.

Why spending so much time in indexing?

Panns spends much longer time in indexing comparing to other tools. The theoretical argument goes like this: as the number of points grows larger and larger, as the dimensionality increases higher and higher, those highly condensed points tend to be uniformly distributed in the space. The split point needs to be carefully placed on the random projection to reduce the probability of misclassifications. Recall that RP-tree is essentially a clustering algorithm. If you find the theoretical argument is boring, then the following argument may make more sense: we care more about the accuracy and we have time to wait for the indexing to finish. Because once the index is constructed, we seldom re-build it within a short period. In that case, we do not really mind to wait a bit longer. 10 seconds, 10 minutes, really make no difference to us. Besides, the indexing can be much faster if parallel indexing mode is enabled.

When do you really need Panns?

Simply put, if you have a huge data set (e.g., over several millions of points) of very high dimensionality (e.g., at least over one thousand), and you are happy with several hundred milliseconds query speed. Then Panns is for you. Or if you are a student who just start this subject and want to get some hands-on experience, Panns is for you. For others, well, why not give it a try :)

Is it finished?

Of course not. But as I promised, I only talk about the design philosophy in this post. I am not planning to do any experiments or plot anything right now. In the [next post], I would like to spend some more time in explaining the performance of Panns in some measurement study, and  talk about why the result is so and what causes such behaviors. So we can have a better understanding on what Panns is really good at.