Tuesday, March 24, 2015

The Philosophy of PANNS -- A Very Preliminary Evaluation & A Not-So-Fair Comparison (maybe)

Though I have promised a long time ago [here] to perform a simple evaluation on PANNS, I finally got it done just today due to my overwhelming workload before moving to the Cambridge.

I. Caveat

Before we start, I must clarify several things beforehand. The purpose is not merely trying to avoid criticism, but to emphasize the context and goal of making PANNS as a small tool.

First, frankly, a large part of the hyped data science deals with data engineering stuff. However, it is well-known that there is no silver bullet in the engineering world. It further means that a good engineer has to look into a specific application area to determine the proper tool. Different data sets may lead to different conclusions when you are evaluating your machine learning/data mining algorithms, as we will see in this post.

Second, I only evaluated the accuracy and the index size here. I admit that PANNS will be beaten miserably by other software in terms of index building time. However, we assume that index will not be built frequently. Once the index is built, it will be only queried all the time. Besides, the index should be as small as possible so that it can work on the large data set. Please check "The Philosophy of PANNS" for more details.

Third, I only compare to Annoy in this post. I am not saying the other candidates are bad. Simply because there are already such comparisons and Annoy seems superior to others in many aspects. In fact, Annoy can be an order of magnitude faster than PANNS in terms of speed. But you will also see how PANNS will win out in other aspects in the follow of this post.

II. Evaluation

The data set used in the evaluation is synthetic. Each data point in the data set is 2000-dimension and follows a standard normal distribution $\mathcal{N}(0,1)$. The index contains 256 RPTrees. The accuracy and index size are measured, the numbers presented are the average results of 50 experiments. The two tables below summarize our results, one for Euclidean similarity and one for Angular similarity. In most cases, I prefer tables to figures when presenting numbers, so I did not bother to plot the result.

 Data Set Size 5000 10000 15000 20000 25000 30000 Accuracy - Panns 75.0 % 51.2 % 39.2 % 30.4 % 27.2 % 25.2 % Accuracy - Annoy 58.2 % 38.0 % 29.4 % 23.2 % 20.6 % 18.0 % Index - Panns 29.6 MB 59 MB 88 MB 116 MB 149 MB 174 MB Index - Annoy 56.0 MB 112 MB 169 MB 224 MB 279 MB 334 MB

Table 1. Comparison between PANNS and Annoy using Euclidean similarity.

 Data Set Size 5000 10000 15000 20000 25000 30000 Accuracy - Panns 75.0 % 53.6 % 36.0 % 37.0 % 27.0 % 25.0 % Accuracy - Annoy 65.0 % 36.8 % 26.4 % 26.4 % 19.2 % 17.2 % Index - Panns 29.6 MB 59 MB 88 MB 116 MB 149 MB 174 MB Index - Annoy 35.0 MB 70 MB 93 MB 140 MB 159 MB 188 MB

Table 2. Comparison between PANNS and Annoy using Angular similarity.

From both tables, the first thing we noticed is the difference in the index size. PANNS is able to save much more space when using the same amount of RPTrees. In the case of Euclidean similarity, the PANNS index is only half size of the Annoy. This benefit becomes even more noticeable when dealing with extremely large data sets and many RPTrees. From this results, we can understand why Panns lost quite a lot in some evaluations where the index only had small number of RPTrees.

In terms of accuracy, we can see PANNS consistently outperforms Annoy in all cases in our evaluation. But the difference starts diminishing as there are more and more data points. However, since PANNS uses much less space for storing the index, we can incorporate more RPTrees in the index given the same index size to achieve better accuracy. However, it becomes difficult to argue whether it is fair comparison to some extent.

In terms of building time, it is so true that Annoy is faster. However, I must point out this only holds when you use serial execution. PANNS provides parallel index building to take advantage of multiple cores on your workstation. In my case, it turned out PANNS is much faster than Annoy because I have 16 cores on my workstation. I hope this does not count as cheating ;-)

In terms of scalability, I also tried building the index from the English Wikepedia dump which consists of 3.8 million documents approximately. PANNS was able to get the job decently done (though took a long time) whereas Annoy always failed due to memory issue. However, I think further investigation is definitely needed.

III. Inconclusive Conclusion

Well, Erik already gave good summaries on "what to use in which context" in his post. I only provide some complementary advice in the following.

In general, PANNS generates much smaller index without sacrificing the accuracy. This becomes more important when you are dealing with large data sets and still want to incorporate as many RPTrees as possible to achieve satisfying accuracy.

The future trend is parallel computing. We need squeeze out all your computational resources from your devices. The slowness of PANNS can be effectively ameliorated by using parallel building and parallel query on multiple cores (even across the network). Please check out our paper on parallel RPTree building on Apache Spark [here].

We are still carrying on the research work in improving the PANNS accuracy. As mentioned, this more relates to the data engineering stuff. There are even better algorithms which can outperform PANNS from 20% to 100%, of course at the price of even longer building time. In the future, we will gradually incorporate these algorithms with a balance between efficiency and accuracy. All in all, PANNS is going to remain as simple and compact as possible for teaching purpose, as we already mention in the [previous article].