When Xamarin meets Lucene…

Introduction

As soon as we are dealing with a bigger amount of data, it can be complicated to find what you are actually looking for. It is obvious that we can ease the task of finding information by structuring our data and by offering an intuitive user interface.

Nonetheless, there are several scenarios where a search engine can come in handy.

Probably the best example is our good old friend the Internet. Information is stored and obtained in various ways and it is an immense yet growing collection of information resources. If you do not exactly know what you are looking for, your search engine of choice is an essential helper to point you into the right direction.

Implementing search capabilities in your desktop application is no rocket science because you can rely on powerful search engines that do the difficult work for you. It is rather a matter of configuration than implementing complex algorithms yourself. Especially when software is growing up, handcrafted search functionality is simply not satisfying anymore.

What do I expect from a “good” search engine? At first the obvious: return me the most accurate data I am looking for. It should find my information even if I misspell it (we all make mistakes). It should suggest me similar results and it should do all that fast. Pretty basic needs but quite some work if you have to implement this from scratch.

A search engine for mobile?

Some months ago, we had the opportunity to work on a very interesting project. The goal was to build a rich product catalog with enhanced search features that runs on mobile devices. Back then, our data was stored in a closed SAP environment and the easiest would have been to create a web service that provides the data (and handles all the searching, filtering, etc.). Hovewer, one challenging requirement was the offline capability: once the data has been synchronized with the device it needs to be searchable without internet connection. This means that we need a client side search engine and so the journey began..

One problem in finding a search engine for mobile devices is the diversity of programming languages. Assuming that you have an application that runs on iOS and Android, you also need a search engine that is supported by both platforms. We did some research to find mobile optimized search engines without much success. In the meanwhile it became clear that we had to write our Application in C# using Xamarin. Our client wanted to maintain the codebase by themselves afterwards.

Note: The Xamarin platform enables developers to write iOS, Android, Mac and Windows apps with native user interfaces using C#. Xamarin utilizes Mono, a free and open source project to run Microsoft .NET applications on different platforms. You can re-use your existing C# code and share a significant amount across device platforms.

How mobile is your .NET?

Xamarin provides this convenient tool called the .Net Mobility Scanner. It shows you how much of your .NET code can run on other operating systems.

Suddenly we had this funny idea to scan an existing .Net search engine we usually use for desktop applications.

We've scanned  Lucene.Net  and the result was quite interesting: 95% of your code is ready for mobilization! For iOS and Android itself, it even reached 99% compatibility. There was actually only one piece of code which was not supported using a System.Configuration dependency – nothing critical.

You can find the scan results here:

scan.xamarin.com/Home/Report

Note: Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. It has been ported to other programming languages including C#, targeting .NET runtime users.

The scan results raised several questions. Could we adapt Lucene.Net to actually run on mobile devices? Would it perform well? Is it stable enough? Did others already try it out? One thing was clear, we all agreed on giving it a try.

I looked around but found only one guy on Twitter that used this library for iOS & Android projects. He told me that it actually works fine but memory consumption was always a bit of a problem. Nonetheless, we wanted to try it out ourselves so we downloaded the Lucene.Net source code and quickly fixed the  1% issue with the System.Configuration dependency. Everything was ready to do an extensive testing.

Make your data searchable

In order to make your data searchable, the first thing you need to do is building an index. Lucene stores its data as documents containing fields of text. You can basically index everything that contains textual information. Take your data and create a document for each one with certain fields and save these documents to a physical directory on your filesystem.

Simplified example of how this could look like (using Lucene.Net v.3.0.3):

public void BuildIndex ()
{
  var indexPath = Path.Combine(
    Environment.GetFolderPath(Environment.SpecialFolder.MyDocuments),
    "index"
  );
  var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
  var indexDirectory = FSDirectory.Open(indexPath);
  var writer = new IndexWriter(indexDirectory, analyzer, IndexWriter.MaxFieldLength.LIMITED);
  var data = new ListData>()
  {
    new Data { Id = 0, Text = "Introducing the Technology" },
    new Data { Id = 1, Text = "Xamarin meets Lucene" },
    new Data { Id = 2, Text = "A full-featured search engine for mobile" },
    new Data { Id = 3, Text = "Make your data searchable" }
  };

  foreach (var row in data)
  {
    Document doc = new Document();
    doc.Add(new Field("Id", row.Id.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
    doc.Add(new Field("Text", row.Text, Field.Store.YES, Field.Index.ANALYZED));
    writer.AddDocument(doc);
  }

  writer.Optimize();
  writer.Commit ();
  writer.Dispose();
  indexDirectory.Dispose();
}

This is our small Data class:

public class Data
{
    public int Id { get; set; }
    public string Text { get; set; }
}

If you want to try it out yourself you'll need to download install Xamarin ( xamarin.com/download) and the slightly modified Lucene.Net library ( github.com/chrigu-ebert/Xamarin-Lucene.Net).

Indexing in Lucene.Net

You just indexed a couple of documents with Lucene.Net, yeah! Let's have a look at the code example above. There are a couple of important things you need to keep in mind.

We open an index directory using FSDirectory.Open in which Lucene will store its indexed data. If you open a directory or file you should always close it by calling the corresponding Dispose method: indexDirectory.Dispose(). If don't do this you might corrupt your index because of locked files. The same applies to the IndexWriter which actually writes data into the directory.

You might have noticed that the IndexWriter needs an analyzer instance, in our case the StandardAnalyzer. When you want to insert data into a Lucene index, or when you want to get the data back out of the index you will need to use an Analyzer to do this. Lucene provides many different analyzer classes such as:

  • SimpleAnalyzer
  • StandardAnalyzer
  • StopAnalyzer
  • WhiteSpaceAnalyzer

There are ones for working with different languages, ones which determine how words are treated (and which words to be ignored) or how whitespace is handled. Understanding analyzers is somehow tricky and as we do not want to loose time, we simply use the StandardAnalyzer. It works very well especially on english content.

Last but not least we loop over our data and create a new Document for each and pass it to the IndexWriter. Each Document contains a set of fields which contain the data that we want to make searchable. Normally we store Field content as string but there is also a NumericField type which is very powerful, if you search by numeric ranges.

It is important to understand the Field attributes especially the store and index values to avoid common mistakes:

nameThe name of the field, used to build queries later
valueThe string representation of your data
storeSpecifies if you want to store the value of the field in the index or not. It does not affect the indexing or searching with Lucene. It just tells Lucene if you want it to act as a datastore for the values in the field.

If you useField.Store.YES, then when you search, the value of that field will be included in your search result documents. If you are storing your data in a database and only using the Lucene index for searching, then you can get away with Field.Store.NO on all of your fields. However, if your are using the index as storage as well, then you will wantField.Store.YES.

indexField.Index.ANALYZED:Index the tokens produced by running the fields value through an Analyzer. This makes a lot of sense on longer texts to improve performance significantly but you might run into problems if you try to sort analyzed fields or if you want to find exact matches on single terms (e.g. unique IDs).

Field.Index.ANALYZED_NO_NORMS:

Index the tokens produced by running the fields value through an Analyzer, and also separately disable the storing of norms. No norms means that a few bytes will be saved by not storing some normalization data. This data is what is used for boosting and field-length normalization. The benefit is less memory usage as norms take up one byte of RAM per indexed field for every document in the index, during searching. Only use this flag if you are sure that youre not using that normalization data.

Field.Index.ANALYZED_NO:

The field will not be indexed and therefore unsearchable. However, you can use Index.No along with Store.Yes to store a value that you dont want to be searchable.

Field.Index.ANALYZED_NOT_ANALYZED:

Index the fields value without using an Analyzer, so it can be searched. As no analyzer is used the value will be stored as a single term. This is useful for unique Ids like product numbers or if you want to sort the results using this field.

Field.Index.ANALYZED_NOT_ANALYZED_NO_NORMS:

Index the fields value without an Analyzer and also disable the storing of norms.

Finally, you'll have to call writer.Commit() to persist the changes. It is always a good thing to use writer.Optimize() from time to time to re-structure your index and improve search-performance. If your index is getting bigger, the optimization can take some time (several seconds).

Are you still with me? At this point, you hopefully understand, how you can make your data searchable. Get yourself a cookie, congrats!

Searching in Lucene.Net

Searching data using Lucene is incredibly powerful. I could write books just about that but this is not the goal of this blog post. We will do some really simple searches to explain the basics. Based on this you can build your own, amazingly complex queries.

We will use the following helper method to execute basic queries:

public List<Data> GetDataForQuery(Query query, int limit = 50)
{
  var data = new List<Data>();
  var indexPath = Path.Combine(
    Environment.GetFolderPath(Environment.SpecialFolder.MyDocuments),
    "index"
  );
  var indexDirectory = FSDirectory.Open(indexPath);
  using (var searcher = new IndexSearcher(indexDirectory))
  {
    var hits = searcher.Search(query, limit);
    Console.WriteLine(hits.TotalHits + " result(s) found for query: " + query.ToString());
    foreach (var scoreDoc in hits.ScoreDocs)
    {
      var document = searcher.Doc(scoreDoc.Doc);
      data.Add(new Data()
      {
        Id = int.Parse(document.Get("Id")),
        Text = document.Get("Text")
      });
    }
  }
  indexDirectory.Dispose();
  return data;
}

We simply use FSDirectory and IndexSearcher to open our index and to perform Lucene queries. We loop over the results and return them as Data list. As you might have noticed, similar to the IndexWriter, we have to explicitly Dispose the IndexSearcher (automatically done due to the using statement) and the indexDirectory.

Since we stored both field values during the indexing (Field.Store.YES), we can retrieve them using document.Get("FieldName">). Our helper method takes two arguments: a Lucene Query object which will be explained below and a limit parameter. The hits variable contains a property called hits.TotalHits. which gives you the total amount of documents that does match your given query. If you have thousands of documents stored in your index, it doesn't make sense to return them all. Usually it is enough to just return a certain subset (limit) where you know that there are probably more results.

Getting all documents becomes as simple as that:

var query = new MatchAllDocsQuery();
var data = GetDataForQuery(query);

The following example shows how to find a document by id using a TermQuery:

var term = new Term("Id", 2);
var query = new TermQuery(term);
var data = GetDataForQuery(query);
if(data.Any())
{
  //writesAfull-featuredsearchengineformobile
  Console.WriteLine(data.FirstOrDefault().Text);
}

The following example shows a common mistake. We're trying to use a PrefixQuery to find documents which have a Text-field that starts with “Intro”:

// this does not return any data!
var term = new Term("Text", "Intro");
var query = newPrefixQuery(term);
var data = GetDataForQuery(query);

Actually, we do have a document that has the following Text “Introducing the Technology” and it should actually work. But since our Text field is analyzed using the StandardAnalyzer, the text is tokenized and in this case stored lowercase. If you would change your Term into new Term("Text", "intro") it would return the document.

An easier way is using the Lucene query syntax [x] and the QueryParser:

var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
var parser = newQueryParser(Lucene.Net.Util.Version.LUCENE_30, "Text", analyzer);
var query = parser.Parse("Xamarin");
var data = GetDataForQuery(query);

Because we use the same analyzer (StandardAnalyzer) as we did while indexing the documents, the sample code above returns our document.

You can perform wildcard queries using an asterisk (*) at the end of a word:

// this will match the word Technology
var query = parser.Parse("Tec*gy");

The query parser can do a lot more. Using a tilde character (~) at the end of a word, indicates a fuzzy query:

// this will matchXamarin assuming that were a little drunk
var query = parser.Parse("amixarin~");

Summary

As you can see, indexing and searching data is actually pretty simple. The examples above are just scratching the surface. As soon as you start combining queries using BooleanQuery band giving weight to certain fields, it starts to get really serious. So far we didn't even talk about filtering and sorting .

I strongly suggest you to give it a try. We have worked months on a project using Lucene and Xamarin together and indexed thousands of documents. The performance and possibilities are simply amazing.

If you are curious and already tried it out, you could also have a look at the Linq to Lucene project. I didn't try it out on a mobile device so far but it helps a lot to get started.

The code examples are tested on Lucene.Net version 3.0.3. Things might have changed significantly on older/newer versions. The stable version on apache.org didn't change for quite a while. If you want to get the latest version which is under active development, you can clone the Github repository (links below).

Links