This article discusses the Google Search Appliance, and some aspects to consider when deciding whether it might be the appropriate search solution.
Blue Fish Development Group
701 Brazos St. #700
Austin, TX 78701
(512) 469-9300
This article discusses the Google Search Appliance, and some aspects to consider when deciding whether it might be the appropriate search solution.
If you’re reading this article, chances are pretty good you’re familiar with Google. You may even have used it to find this article. Over the past decade, Google has established itself as the most widely used web search engine, and we all know that when we’re looking for something on the web, we start with Google. It has become such a pervasive part of our web experience that we even use ‘google’ as a verb (both the Oxford English Dictionary
What you may not be as familiar with are Google’s enterprise offerings. Google says that their mission is “to organize the world’s information and make it universally accessible and useful.”
There are a number of reasons you may be interested in a Google Search Appliance to tackle your enterprise search problems. Not the least of which is the Google brand. Google has made a name for itself as the premier web search engine, and years of using Google has made users comfortable with the way Google works. They are accustomed to the way it looks and feels, and they are highly trusting in the results they receive. If introducing a different tool might be jarring to your users, this is an element that is not to be discounted.
Google also commands an impressive engineering department that is hard at work improving search algorithms and developing toolkits to leverage the latest technologies. They are masters at leveraging the open source community to further speed development efforts. While the core algorithms of what makes Google tick are kept tightly under wraps, the interfaces used to manipulate Google and many of its associated components (Google Maps, Google Calendar, etc.) are opened up to the world to encourage forward development by a broader community.
One differentiating factor that you have probably already figured out for yourself is that the Google Search Appliance is an appliance. By combining a hardware and software solution (the GSA runs a special in-house flavor of Linux called ‘Google Linux’), Google has locked down the control points to improve ease of installation and configuration. This also gives them the opportunity to highly optimize and tune their system, in contrast to the broader platform support that some other search vendors provide.
When you set up one of these appliances (a rack-mount box with a bright yellow swiss-cheese motif), you are presented with the minimal set of configuration options needed to get up and running. Configuration is all accomplished through a straight-forward web interface, and it’s almost trivial to point it at your intranet or file shares to being crawling and indexing content right away.
The extreme degree of standardization inherent in an appliance deployment model has other advantages as well. Multiple GSAs can be “chained” together for increased capacity and performance with minimal configuration. Updates can be reliably retrieved by the appliance with little worry about version conflicts or platform quirks. Troubleshooting can be greatly simplified by removing a multitude of variables that might be available in an environment where hardware, operating system, and software come from multiple different vendors.
This approach also has its drawbacks, however. The search algorithms used by the appliance can’t readily be tuned if the need arises, and deployments requiring any significant customization can become more cumbersome.
Crawling and extracting content from websites both internal and external is supported out-of-the-box. So is crawling file systems and querying SQL databases, and these are all easily configurable through the GSA’s configuration web interface. However, this may only cover a small fraction of business information within an organization.
Many organizations have large numbers of critical documents in other places that they need to search. These documents might reside in content management systems such as Documentum or FileNet. They might reside in source control systems like CVS or Subversion. They might also reside in collaboration systems such as eRoom or SharePoint, or elsewhere. For these types of additional content repositories, Google provides a connector framework that can be used to connect to them and extract content. The connector framework provides out-of-the-box support for Documentum, FileNet, SharePoint, and LiveLink Additional connectors for other systems, or those providing advanced functionality, are available from 3rd party developers (from costs ranging from free to tens of thousands of dollars). Connectors can also be custom-built if necessary for obscure or home-grown systems that house important content.
These connectors, however, require their own application servers to run, and can’t reside on the appliance itself. In a large, heterogeneous enterprise environment, the cost and work involved in setting up and maintaining a large number of connectors may become a significant amount of overhead that can undermine the value provided by a consolidated search interface.
Google Enterprise can extract and index content from over 200 document formats, leveraging the Oracle (formerly Stellent) libraries for extracting content from obscure document formats. The supported list is about what you would expect, encompassing most standard text, markup, image, word-processing, spreadsheet, presentation, and database types, along with a myriad of others. Conspicuously missing currently is support for Office 2007 documents, although I’ve been assured that’s coming in an update in the next few months.

Figure 1: Google Enterprise Interface
The GSA provides an out-of-the-box search interface that is about as close to google.com as you can get.
Reporting comes standard as well. Out-of-the-box reports provide administrators visibility into what searches are being performed, and this data can be sliced across data sets, time, user identity, and search terms. You can also see how the results fared against the users queries. How many users clicked on the top result? How many clicked on the 2nd? How many had to click down to the next page? And so forth.
All the previous points aside, what you really want to know is whether a Google Search Appliance will help your users find the information that they are looking for.
Google has built their name and reputation on the success of google.com and its sometimes uncanny ability to find the most useful, relevant results - the ones you’re looking for. One of the primary ideas behind the relevance ranking scheme is a patented algorithm known as PageRank.
This works like a charm in the context of the World Wide Web, but behind the firewall of an organization, the picture is likely to be very different. Documents that reside on a file share, or in a content management system, don’t generally link to one another. So there is no vast, interconnected web to leverage. The content of documents and their metadata may be the only pieces of information available to a search engine, and in many cases, that metadata may be extremely sparse.
The smart folks over at Google, who recognize that what works for web search doesn’t necessarily work for enterprise search, have been busily writing more capable algorithms to handle the problems inherent in the enterprise search space. A not-so-quick look through Google’s many filed patents shows more than a few that are clearly targeted towards the complexities of enterprise search, such as “Methods and Systems for Determining a Meaning of a Document to Match the Document to Content” or “Method and Apparatus for Characterizing Documents Based on Clusters of Words”. These are just a couple examples.
The Google Search Appliance leverages “hundreds of factors, only one of which is PageRank”
I have yet to see a detailed and unbiased “search-off” between various competitors in the enterprise search space. (Perhaps if I can accumulate enough free time in the future, I may attempt something like this myself.) As such, it’s difficult to say how the out-of-the box results relevance of the Google Enterprise solution really stacks up against the competition. Google claims as many success stories as any other search solution provider, and the reality is that because the organization of information within different enterprises can vary so drastically, some search algorithms may simply work better for some data than others.
Google is so confident in their results relevance that they (unlike many competitors) do not provide much configuration or tooling for tuning search results. Even if the out-of-the-box results from a GSA work very well for many situations, when they fall short, there’s not much recourse. Tuning can be a key element of a successful search solution, especially within enterprises where years of cultural evolution of how knowledge is handled may make for unusual conventions or organizational schemes that could easily skew the results of a search system that doesn’t understand them. If, for example, a particular custom piece of metadata is critical to the organization of data within your enterprise, a system would need to recognize that piece of metadata and be able to present it to the users in order to be most effective.
One method of tuning that is available is ’source biasing’. This allows an administrator to specify that results from a particular source (say, a well-controlled content management system) be regarded as more authoritative than those from other sources (say, an informal wiki page).
You can also define synonyms as part of the system configuration, and these synonyms can be configured to either be automatically included as part of the search itself, or simply to provide the user with search suggestions. For example, if a user enters ‘SOP’ as their search term, the system could search for both ‘SOP’ and ’standard operating procedure’ under the covers, or alternately, suggest “did you mean ’standard operating procedure’?” depending on configuration.
One potential way to alleviate the pitfalls of imperfect (or unknown) relevance quality is to arm users with rich metadata. When such metadata is available, and the search system understands how to leverage it, users can use this metadata to quickly hone in on what they are after by selecting search refinements driven by this metadata. Although the Google Search Appliance provides minimal support for retrieving, displaying, and leveraging metadata, it is not a focus area for the product. Contrast this with Endeca, who have mastered the complexities of leveraging rich metadata in a search context.
The GSA capabilities regarding metadata are extremely limited out-of-the-box, but can be augmented by a free set of code provided by Google.
The problem is that sometimes, the metadata that would be most useful in this context simply doesn’t exist, or at least, doesn’t exist reliably. Few organizations have the content management discipline to ensure that every document on their network is tagged with accurate information about who wrote it, what version it is, which departments it may be relevant for, etc.
Google’s stance is that since this sort of metadata is generally so unreliable behind the firewall, there’s little point in going to lengths to drive the search results based on said metadata, and it is instead more advantageous to focus on the default (non-metadata driven) results rankings. In many cases, this is absolutely true. However, if the information management discipline is mature enough to provide this metadata, it would be an egregious oversight to omit its use from a search solution.
Several other features provided by the Google Enterprise are worth mentioning here. These features bring a little more polished feel to the search solution and, in addition to making it feel “more like Google” can bring some interesting and useful tools to the fingertips of your users. In essence, the goal of all these features is to provide the users with short cuts that allow them to get to what they’re looking for as quickly and easily as possible. Note that this is by no means an exhaustive list of current features, and history has shown that as new features enter the Google ’stack’, Google often makes an effort to incorporate analogous capabilities into the GSA.
Try going to google.com and entering “weather” along with your zip code as the search terms. The first result that comes back isn’t exactly a result like the others, but chances are pretty good that it’s telling you the information that you were after, namely, “What’s the weather like in the zip code I entered?” This is what Google calls “OneBox”, and they have brought it to their enterprise solutions as well.
OneBox works by recognizing patterns in the search terms, and firing off a real-time query to an external system to bring back information. In the weather example, the information presented above the results is not stored in Google’s index. It is harvested at query time from a site that keeps current weather data.
OneBox modules can connect to everything from weather services and stock quotes to salesforce.com installations and company directories. Here again, Google has provided a framework to enable 3rd parties to create OneBox modules, so the door is open to connect to any system that you think might provide value to your users. Google hosts a gallery of OneBox modules
Similar to how google.com displays a couple of sponsored links at the top of the search results for search queries matching certain words, a GSA can display what it calls ‘Keymatches’ in the same way. If you happen to know that a particular document is the authoritative source for a subject in your organization, but it doesn’t always show up at the top of the search results for certain terms that people may use, you can create a Keymatch for that search term that directs users to the correct document.
You even have the option of opening this Keymatch authoring capability up to other users within your organization, not just administrators. Users who have some authoritative knowledge about certain information areas within the organization can now become empowered to quickly and easily help steer other users in the direction of the “right” information.
Of course, implementing this sort of scheme definitely requires some discipline and trust within your user community. You can imagine the usability nightmare that could emerge if dozens of users all entered different Keymatches for the same search term!
For certain pieces of information that are searched for most often, Google has provided a facility for users to get the information they’re after without even finishing their search. Much like the “type-ahead” functionality seen now on many websites, the “Search-As-You-Type” feature communicates with the server under the covers to perform queries as you are typing in your request. For matches that occur against a well-known set of results (say, dictionary definitions, stock quotes, or company directory listings) the results can be shown immediately to the user in a JavaScript pop-up window.
Being able to easily find the phone number or email address of a colleague by simply typing the first few letters of their name can be a huge time saver, especially in an organization with hundreds or thousands of employees.
For all the slick features that the Google Search Appliance brings to the table, there are some features available from other enterprise search vendors that are notably absent in Google’s offering.
I’ve already touched on a couple of these. Perhaps most notable is the almost non-existent search tuning capability. Although in many ways search tuning can be a headache unto itself, the lack of any capability for tuning may be a turn-off for some potential customers. If the out-of-the-box results work well for your data and your users, then you probably don’t care about any tuning capabilities. However, if you do need to improve those results, you will miss the tuning capabilities dearly. And the unfortunate reality is that you may not know which category you fall into until well into the implementation of a particular solution.
Another significant gap is around leveraging rich metadata, as I discussed briefly in the section entitled “Metadata to the Rescue?” Again, this is a situation where, if you don’t need the metadata capabilities (you may just not have the metadata available to work with), then you won’t miss them, but if those capabilities will be instrumental in building an effective search solution, their absence will be a deal-breaker.
For organizations looking forward to increased content management discipline and maturity, this may be an especially important factor. Just because that metadata doesn’t exist now doesn’t mean it might not in the future, and when it does exist, it will be important to have the capabilities required to make the most of it.
Another key factor to consider is security integration. The GSA can integrate with various different security systems, such as LDAP and NTLM, and it can use a variety of authentication methods such as HTTP Basic, HTTPS, x509, and SAML. However, the GSA is only able to use a late-binding security model. That is, security credentials are checked for each search result at the time they are returned to the user. A late-binding model ensures constant security enforcement, but can often be more difficult to implement and less performant than an early-binding model. It may also require significant integration if secure documents reside in 3rd party repositories, as the GSA must be able to “ask” that repository at query time about the security status of particular documents with respect to the user’s credentials. This integration can become increasingly complex if multiple different repositories with different security models need to be searched together.
What much of this boils down to is that the GSA is intended as a “general-use” search tool. After all, it is sold as an appliance. It is intended to minimize effort and maximize value out-of-the-box. It is not intended as a platform for building focused, search-based applications.
You should keep in mind the following points when evaluating Google Enterprise as a potential search solution:
I hope this article has provided some helpful insights into the Google Search Appliance, and some of the relevant aspects to consider. If you have any questions or comments, I encourage you to comment on this article.
Subscribe to our newsletter to be notified when new articles are posted. You can unsubscribe at any time.
[...] Appliance than I thought there was. Dan Burton, a consultant here at Blue Fish, recently wrote an overview of the Google Search Appliance, and I learned a few things I didn’t previously know. One innovative technique they use is [...]
Blue Fish Development Group » Blog Archive » Maybe Google Search Appliance is More Googley Than I Thought | April 9th, 2008 1:16 pm