Search engines are only for searching text, right? Wrong! At their heart, search engines are all about quickly and efficiently filtering and then ranking data according to some notion of similarity (a notion that’s flexibly defined in Lucene and Solr). Search engines also deal effectively with both sparse data and ambiguous data, which are hallmarks of modern data applications. Lucene and Solr are capable of crunching numbers, answering complex geospatial questions (as you’ll see shortly), and much more. These capabilities blur the line between search applications and traditional database applications (and even NoSQL applications).
For example, Lucene and Solr now:
- Support several types of joins and grouping options
- Have optional column-oriented storage
- Provide several ways to deal with text and with enumerated and numerical data types
- Enable you to define your own complex data types and storage, ranking, and analytics functions
To get started, you need the following prerequisites:
- Lucene and Solr.
- Java 6 or higher.
- A modern web browser. (I tested on Google Chrome and Firefox.)
- 4GB of disk space — less if you don’t want to use all of the flight data.
- Terminal access with a
bash(or similar) shell on *nix. For Windows, you need Cygwin. I only tested on OS X with the
wgetif you choose to download the data by using the download script that’s in the sample code package. You can also download the flight data manually.
- Apache Ant 1.8+ for compilation and packaging purposes, if you want to run any of the Java code examples
- Download this article’s sample code ZIP file and unzip it to a directory of your choice. I’ll refer to this directory as $SOLR_AIR.
- At the command line, change to the $SOLR_AIR directory:
- Start Solr:
- Run the script that creates the necessary fields to model the data:
- Point your browser at http://localhost:8983/solr/#/ to display the new Solr Admin UI. Figure 1 shows an example:
Figure 1. Solr UI
- At the terminal, view the contents of the bin/download-data.sh script for details on what to download from RITA and OpenFlights. Download the data sets either manually or by running the script:
The download might take significant time, depending on your bandwidth.
- After the download is complete, index some or all of the data.
To index all data:
To index data from a single year, use any value between 1987 and 2008 for the year. For example:
- After indexing is complete (which might take significant time, depending on your machine), point your browser at http://localhost:8983/solr/collection1/travel. You’ll see a UI similar to the one in Figure 2:
Figure 2. The Solr Air UI