Sando Code Search Tool gets revved up! (In more ways than one...)

Wednesday, September 03, 2014

Sando Code Search Tool gets revved up! (In more ways than one...)

David C. Shepherd - Searching the Linux Source Tree in 0.5 Seconds

Our recent work on the Sando Code Search extension, a tool which leverages Lucene to search code, has been focused on making it more scalable and robust. To demonstrate our progress I'll provide demos of both Sando and FindInFiles (i.e., a grep-like feature in Visual Studio) searching the entire Linux kernel. As you'll see, there's a fundamental difference between Lucene-based search tools and regular expression based search tools.

Before we begin, let's first briefly examine the Linux source tree. At the time of our demo it contained 47,528 files which occupied 1.71 GB on disk. Most of these files were C code, yet there was also a fair amount of documentation and configuration files. Sando and FindInFiles both search all text files.

Searching the Linux Source Tree with FindInFiles

To use FindInFiles I configured it to search the directory containing the Linux code, entered my search, and selected Find All. In this running example the user is searching for encryption algorithms, specifically those related to AES, and thus they use the regular expression query "encrypt*aes". Executing this search caused FindInFiles to run its regular expression matching algorithm against every line of every file in that directory, recursively. As you can see in "Starting the Search", this utilized about 50% of the CPU on an eight core machine for a considerable amount of time.

Starting the Search: Notice when the FindInFiles search begins the CPU utilization becomes 50% on a 8-core machine.

After about one minute and forty seconds the search completed, having searched 47,407 files. Unfortunately, no lines matched this particular search (see "Finishing the Search"). As often happens with a regular expression based search, the word ordering in the query did not match the word ordering in the code. In this situation the user would likely have to run another search with re-ordered search terms (e.g., "aes*encrypt") to find relevant code.

Finishing the Search: After about 1m 40s the search completes; no results were found after searching 47,407 files.

Searching the Linux Source Tree with Sando

Next we searched the same Linux source tree using Sando. Unlike FindInFiles, which is based on regular expression matching, Sando is built upon information retrieval technology (think Google). It leverages Lucene.NET to pre-index source code and provide ranked results almost instantly. Typing in the same query as before minus the regular expression syntax (i.e., "encrypt aes") you can see below that results are returned almost instantly. Just as importantly, the most relevant results are returned first with less relevant results toward the bottom. Additionally, in Sando's UI, selecting a result in the list provides a preview of the program element with matching terms in bold.

Searching with Lucene: The same search returns almost instantly when using Lucene-based searchers.

Of course, there is a cost to pre-indexing. For the Linux source tree that cost is about 50 minutes of low CPU background processing. Fortunately, this only happens once after which incremental updates and switching branches trigger at most a few seconds of indexing. Additionally, for most medium-sized projects initial indexing completes in a matter of seconds. For instance, Sando can index its own source code in less than ten seconds.

..."

David reached out to me today with news about the updated Sando Code Search Tool/VS Extension and I just loved how he used VS and Sando to index and search the Linux source tree...

Also make sure you click through to the full post to not only see the pretty animated Gif's but to all see a number of other code search tools for VS and beyond. I dig that he took the time to highlight other similar tools.

Finally the source for this project is also still on CodePlex, https://sando.codeplex.com. :)