If you’re reading this article, chances are pretty good you’re familiar with Google. You may even have used it to find this article. Over the past decade, Google has established itself as the most widely used web search engine, and we all know that when we’re looking for something on the web, we start with Google. It has become such a pervasive part of our web experience that we even use ‘google’ as a verb (both the Oxford English Dictionary
1 and Merriam-Webster
2 authoritatively incorporated ‘google’ as part of the English language in 2006).
What you may not be as familiar with are Google’s enterprise offerings. Google says that their mission is “to organize the world’s information and make it universally accessible and useful.”
3 The World Wide Web obviously represents a huge portion of the world’s information, but it is by no means all of it. Much more information is locked away behind corporate firewalls, residing in documents on file shares, tucked away deep inside content management systems and source control systems, embedded in databases, and hidden in all manner of other places. But how do you organize it and make it accessible and useful? Google’s answer to this problem is what they call “Google Enterprise” and the “Google Search Appliance” (GSA). So what does a Google Search Appliance provide? What’s missing? How does it work? And is a Google Search Appliance the right choice for your enterprise search needs?
The Power of the Google Name
There are a number of reasons you may be interested in a Google Search Appliance to tackle your enterprise search problems. Not the least of which is the Google brand. Google has made a name for itself as the premier web search engine, and years of using Google has made users comfortable with the way Google works. They are accustomed to the way it looks and feels, and they are highly trusting in the results they receive. If introducing a different tool might be jarring to your users, this is an element that is not to be discounted.
Google also commands an impressive engineering department that is hard at work improving search algorithms and developing toolkits to leverage the latest technologies. They are masters at leveraging the open source community to further speed development efforts. While the core algorithms of what makes Google tick are kept tightly under wraps, the interfaces used to manipulate Google and many of its associated components (Google Maps, Google Calendar, etc.) are opened up to the world to encourage forward development by a broader community.
Search as an Appliance
One differentiating factor that you have probably already figured out for yourself is that the Google Search Appliance is an appliance. By combining a hardware and software solution (the GSA runs a special in-house flavor of Linux called ‘Google Linux’), Google has locked down the control points to improve ease of installation and configuration. This also gives them the opportunity to highly optimize and tune their system, in contrast to the broader platform support that some other search vendors provide.
When you set up one of these appliances (a rack-mount box with a bright yellow swiss-cheese motif), you are presented with the minimal set of configuration options needed to get up and running. Configuration is all accomplished through a straight-forward web interface, and it’s almost trivial to point it at your intranet or file shares to being crawling and indexing content right away.
The extreme degree of standardization inherent in an appliance deployment model has other advantages as well. Multiple GSAs can be “chained” together for increased capacity and performance with minimal configuration. Updates can be reliably retrieved by the appliance with little worry about version conflicts or platform quirks. Troubleshooting can be greatly simplified by removing a multitude of variables that might be available in an environment where hardware, operating system, and software come from multiple different vendors.
This approach also has its drawbacks, however. The search algorithms used by the appliance can’t readily be tuned if the need arises, and deployments requiring any significant customization can become more cumbersome.
Crawling and Indexing Content
Crawling and extracting content from websites both internal and external is supported out-of-the-box. So is crawling file systems and querying SQL databases, and these are all easily configurable through the GSA’s configuration web interface. However, this may only cover a small fraction of business information within an organization.
Many organizations have large numbers of critical documents in other places that they need to search. These documents might reside in content management systems such as Documentum or FileNet. They might reside in source control systems like CVS or Subversion. They might also reside in collaboration systems such as eRoom or SharePoint, or elsewhere. For these types of additional content repositories, Google provides a connector framework that can be used to connect to them and extract content. The connector framework provides out-of-the-box support for Documentum, FileNet, SharePoint, and LiveLink Additional connectors for other systems, or those providing advanced functionality, are available from 3rd party developers (from costs ranging from free to tens of thousands of dollars). Connectors can also be custom-built if necessary for obscure or home-grown systems that house important content.
These connectors, however, require their own application servers to run, and can’t reside on the appliance itself. In a large, heterogeneous enterprise environment, the cost and work involved in setting up and maintaining a large number of connectors may become a significant amount of overhead that can undermine the value provided by a consolidated search interface.
Google Enterprise can extract and index content from over 200 document formats, leveraging the Oracle (formerly Stellent) libraries for extracting content from obscure document formats. The supported list is about what you would expect, encompassing most standard text, markup, image, word-processing, spreadsheet, presentation, and database types, along with a myriad of others. Conspicuously missing currently is support for Office 2007 documents, although I’ve been assured that’s coming in an update in the next few months.
Serving Up Results
The GSA provides an out-of-the-box search interface that is about as close to google.com as you can get.
For anyone who has used google.com, the organization, layout, and style will be extremely familiar and instantly usable. For those customers wishing to customize the look and feel of their search results, the interface can be adjusted using a set of templates and style sheets to provide a customized look and feel. The search results information can also be returned as raw XML, which can be useful if you are trying to integrate the results powered by the GSA into another application within your enterprise.
Reporting comes standard as well. Out-of-the-box reports provide administrators visibility into what searches are being performed, and this data can be sliced across data sets, time, user identity, and search terms. You can also see how the results fared against the users queries. How many users clicked on the top result? How many clicked on the 2nd? How many had to click down to the next page? And so forth.
All the previous points aside, what you really want to know is whether a Google Search Appliance will help your users find the information that they are looking for.
Why PageRank isn’t Enough
Google has built their name and reputation on the success of google.com and its sometimes uncanny ability to find the most useful, relevant results – the ones you’re looking for. One of the primary ideas behind the relevance ranking scheme is a patented algorithm known as PageRank.
5 This leverages the vast, interconnected nature of the web to judge an individual page’s importance based on how, and how often, it is linked to by other pages. The basic idea is that the more often authoritative pages link to a page, the more authoritative that page itself is.
This works like a charm in the context of the World Wide Web, but behind the firewall of an organization, the picture is likely to be very different. Documents that reside on a file share, or in a content management system, don’t generally link to one another. So there is no vast, interconnected web to leverage. The content of documents and their metadata may be the only pieces of information available to a search engine, and in many cases, that metadata may be extremely sparse.
The smart folks over at Google, who recognize that what works for web search doesn’t necessarily work for enterprise search, have been busily writing more capable algorithms to handle the problems inherent in the enterprise search space. A not-so-quick look through Google’s many filed patents shows more than a few that are clearly targeted towards the complexities of enterprise search, such as “Methods and Systems for Determining a Meaning of a Document to Match the Document to Content” or “Method and Apparatus for Characterizing Documents Based on Clusters of Words”. These are just a couple examples.
6 The Google Search Appliance leverages “hundreds of factors, only one of which is PageRank”
7 to ultimately rank results and return them to the user. These take into account not only literal document content, but also document structure, format, source, age, and other metadata. Google prides itself on the fact that this combination of advance ranking techniques provides high quality out-of-the-box relevance ranking, on par with the quality of results users are accustomed to getting from google.com.
I have yet to see a detailed and unbiased “search-off” between various competitors in the enterprise search space. (Perhaps if I can accumulate enough free time in the future, I may attempt something like this myself.) As such, it’s difficult to say how the out-of-the box results relevance of the Google Enterprise solution really stacks up against the competition. Google claims as many success stories as any other search solution provider, and the reality is that because the organization of information within different enterprises can vary so drastically, some search algorithms may simply work better for some data than others.
(Almost) No Tuning Allowed
Google is so confident in their results relevance that they (unlike many competitors) do not provide much configuration or tooling for tuning search results. Even if the out-of-the-box results from a GSA work very well for many situations, when they fall short, there’s not much recourse. Tuning can be a key element of a successful search solution, especially within enterprises where years of cultural evolution of how knowledge is handled may make for unusual conventions or organizational schemes that could easily skew the results of a search system that doesn’t understand them. If, for example, a particular custom piece of metadata is critical to the organization of data within your enterprise, a system would need to recognize that piece of metadata and be able to present it to the users in order to be most effective.
One method of tuning that is available is ‘source biasing’. This allows an administrator to specify that results from a particular source (say, a well-controlled content management system) be regarded as more authoritative than those from other sources (say, an informal wiki page).
You can also define synonyms as part of the system configuration, and these synonyms can be configured to either be automatically included as part of the search itself, or simply to provide the user with search suggestions. For example, if a user enters ‘SOP’ as their search term, the system could search for both ‘SOP’ and ‘standard operating procedure’ under the covers, or alternately, suggest “did you mean ‘standard operating procedure’?” depending on configuration.
Metadata to the Rescue?
One potential way to alleviate the pitfalls of imperfect (or unknown) relevance quality is to arm users with rich metadata. When such metadata is available, and the search system understands how to leverage it, users can use this metadata to quickly hone in on what they are after by selecting search refinements driven by this metadata. Although the Google Search Appliance provides minimal support for retrieving, displaying, and leveraging metadata, it is not a focus area for the product. Contrast this with Endeca, who have mastered the complexities of leveraging rich metadata in a search context.
The GSA capabilities regarding metadata are extremely limited out-of-the-box, but can be augmented by a free set of code provided by Google.
The problem is that sometimes, the metadata that would be most useful in this context simply doesn’t exist, or at least, doesn’t exist reliably. Few organizations have the content management discipline to ensure that every document on their network is tagged with accurate information about who wrote it, what version it is, which departments it may be relevant for, etc.
Google’s stance is that since this sort of metadata is generally so unreliable behind the firewall, there’s little point in going to lengths to drive the search results based on said metadata, and it is instead more advantageous to focus on the default (non-metadata driven) results rankings. In many cases, this is absolutely true. However, if the information management discipline is mature enough to provide this metadata, it would be an egregious oversight to omit its use from a search solution.
Several other features provided by the Google Enterprise are worth mentioning here. These features bring a little more polished feel to the search solution and, in addition to making it feel “more like Google” can bring some interesting and useful tools to the fingertips of your users. In essence, the goal of all these features is to provide the users with short cuts that allow them to get to what they’re looking for as quickly and easily as possible. Note that this is by no means an exhaustive list of current features, and history has shown that as new features enter the Google ‘stack’, Google often makes an effort to incorporate analogous capabilities into the GSA.
Try going to google.com and entering “weather” along with your zip code as the search terms. The first result that comes back isn’t exactly a result like the others, but chances are pretty good that it’s telling you the information that you were after, namely, “What’s the weather like in the zip code I entered?” This is what Google calls “OneBox”, and they have brought it to their enterprise solutions as well.
OneBox works by recognizing patterns in the search terms, and firing off a real-time query to an external system to bring back information. In the weather example, the information presented above the results is not stored in Google’s index. It is harvested at query time from a site that keeps current weather data.
OneBox modules can connect to everything from weather services and stock quotes to salesforce.com installations and company directories. Here again, Google has provided a framework to enable 3rd parties to create OneBox modules, so the door is open to connect to any system that you think might provide value to your users. Google hosts a gallery of OneBox modules
9 that are written by 3rd parties and made available to other GSA users, usually for free. Many enterprise software vendors, such as Salesforce.com, Cognos, BusinessObjects, and others have published OneBox modules for use with their respective systems. One particularly slick module allows you to see not just other employee’s contact information, but also their free/busy schedules via their Exchange calendars.
Similar to how google.com displays a couple of sponsored links at the top of the search results for search queries matching certain words, a GSA can display what it calls ‘Keymatches’ in the same way. If you happen to know that a particular document is the authoritative source for a subject in your organization, but it doesn’t always show up at the top of the search results for certain terms that people may use, you can create a Keymatch for that search term that directs users to the correct document.
You even have the option of opening this Keymatch authoring capability up to other users within your organization, not just administrators. Users who have some authoritative knowledge about certain information areas within the organization can now become empowered to quickly and easily help steer other users in the direction of the “right” information.
Of course, implementing this sort of scheme definitely requires some discipline and trust within your user community. You can imagine the usability nightmare that could emerge if dozens of users all entered different Keymatches for the same search term!
Being able to easily find the phone number or email address of a colleague by simply typing the first few letters of their name can be a huge time saver, especially in an organization with hundreds or thousands of employees.
For all the slick features that the Google Search Appliance brings to the table, there are some features available from other enterprise search vendors that are notably absent in Google’s offering.
I’ve already touched on a couple of these. Perhaps most notable is the almost non-existent search tuning capability. Although in many ways search tuning can be a headache unto itself, the lack of any capability for tuning may be a turn-off for some potential customers. If the out-of-the-box results work well for your data and your users, then you probably don’t care about any tuning capabilities. However, if you do need to improve those results, you will miss the tuning capabilities dearly. And the unfortunate reality is that you may not know which category you fall into until well into the implementation of a particular solution.
Another significant gap is around leveraging rich metadata, as I discussed briefly in the section entitled “Metadata to the Rescue?” Again, this is a situation where, if you don’t need the metadata capabilities (you may just not have the metadata available to work with), then you won’t miss them, but if those capabilities will be instrumental in building an effective search solution, their absence will be a deal-breaker.
For organizations looking forward to increased content management discipline and maturity, this may be an especially important factor. Just because that metadata doesn’t exist now doesn’t mean it might not in the future, and when it does exist, it will be important to have the capabilities required to make the most of it.
Another key factor to consider is security integration. The GSA can integrate with various different security systems, such as LDAP and NTLM, and it can use a variety of authentication methods such as HTTP Basic, HTTPS, x509, and SAML. However, the GSA is only able to use a late-binding security model. That is, security credentials are checked for each search result at the time they are returned to the user. A late-binding model ensures constant security enforcement, but can often be more difficult to implement and less performant than an early-binding model. It may also require significant integration if secure documents reside in 3rd party repositories, as the GSA must be able to “ask” that repository at query time about the security status of particular documents with respect to the user’s credentials. This integration can become increasingly complex if multiple different repositories with different security models need to be searched together.
What much of this boils down to is that the GSA is intended as a “general-use” search tool. After all, it is sold as an appliance. It is intended to minimize effort and maximize value out-of-the-box. It is not intended as a platform for building focused, search-based applications.
You should keep in mind the following points when evaluating Google Enterprise as a potential search solution:
- As an appliance, the GSA provides extremely easy installation, configuration, and update mechanisms.
- The dedicated hardware/software solution is highly optimized, and it performs and scales well.
- The GSA handles basic content sources with great ease. Web content, file shares, and relational databases are very easy to set up. Extracting content from other repositories may require more cost, advanced configuration, hardware, maintenance, or even coding, depending on the particulars.
- The GSA has a “black-box” set of relevance algorithms, with little to no opportunity to influence results ranking. The efficacy of the out-of-the-box relevance may vary depending on your data, and although anecdotally is generally quite good, if it doesn’t work well for you, there is little you can do about it.
- The user interface provided by the GSA is familiar and intuitive because of its resemblance to google.com, and can be customized for look and feel as needed.
- A number of extras like OneBox and Search-As-You-Type can greatly enhance the search experience, but also may require additional configuration or coding.
- Open APIs encourage 3rd party development of additional functionality for the GSA. A growing library of OneBox modules and content connectors is already available, so you may find that what you need has already been written by someone else.
- The metadata handling capabilities of the GSA are very basic, and are simply not geared towards a rich-metadata environment or application. If you intend to leverage rich metadata as part of your search solution, you may be better off looking for a search platform, rather than a search appliance.
- The GSA handles several standard security modes, but is not well-equipped to deal with reconciling multiple security models across different repositories, and does not support an early-binding security model, which is less secure but may be more convenient for some applications.
I hope this article has provided some helpful insights into the Google Search Appliance, and some of the relevant aspects to consider. If you have any questions or comments, I encourage you to comment on this article.