As more and more applications reach out to a global audience, internationalization is becoming an increasingly important aspect of these applications. Since Endeca applications are focused on Information Access, considerations about internationalization are often critical success factors for these globally targeted applications. This article discusses a few of the challenges surrounding building effective internationalized applications on the Endeca platform, and a few approaches to tackling those challenges.
In general there are two primary areas in an application where internationalization becomes an important consideration: the underlying data and the user interface. Endeca’s own documentation does a good job of discussing best practices for handling data in multiple languages in Endeca’s forge process (see the Chapter entitled “Using Internationalized Data” in the Endeca IAP Developer’s Guide). As such, I’m going to focus more in this article on some of the challenges in working with the Endeca pipeline configuration and application user interface.
Imagine an Endeca-powered web application that allows users to search for documents in a company knowledge base. Since company’s customers are all over the world, the application is translated into several languages. The most popular documents within the knowledge base are translated into multiple languages as well. The application gives the user the option to select which languages they can read, so that any documents not written in those languages are filtered out when the user searches the knowledge base.
There are several different ways that languages play an important part in this example. One of the important distinctions that needs to be made in any internationalized application is the difference between interface languages, record languages, and interest languages.
The interface language is the language that a particular user chooses to use when they are using the application. All the text controls and labels in the application should be translated into this language when displayed to the user. For instance, if a user selects “français” as their interface language, they might see a button labeled “Paramètres” instead of “Settings”.
The record language is the language describing a particular record or document. Most often this is the language a particular document or record is written in. Usually there is just one record language per record, but sometimes there can be several. For instance, a single User’s Guide document might contain sections written in different languages that contain the same basic material.
The interest languages are the languages that a user selects when using the application that dictate which content is shown to that user. Some applications tie this to the interface language, but that may not be appropriate for all applications. Users may be interested in content in multiple languages even though they must necessarily choose a single interface language. In the example above, not all documents are translated into all languages, so some content may only be available in the default language. Additionally, it may be valuable for application administrators to have access to records across all languages available in the system.
For the various representations of languages in an application, it is highly recommended to use ISO standard language codes. This will make the language determination clear and consistent. For the majority of cases, the familiar 2-letter ISO 639-2 codes
When dealing with an internationalized application, there are a couple different ways that you can structure your properties and dimensions. In all cases, the goal is to provide the user with dimension values in their selected interface language. For these examples, consider a dimension that describes geographical information called “Geography”. We’ll just define the first level of hierarchy to keep things simple.
User interface frameworks that provide support for internationalization generally have the concept of a resource key. The resource key is a string that is used to look up (generally from resource bundles) the string to display to a user. For example, the resource key might be “n_america”, which would provide the string “North America”, or “Amérique du Nord”, or “Nordamerika” back to the application to display depending on whether the user’s interface language is set to English, French, or German, respectively.
However, when localizing an Endeca-powered application, it’s not so straightforward. Let’s examine how we could naively approach rendering our Endeca dimensions using resource bundles. In this approach we make each value of the dimension a resource key, and then use that resource key to look up the appropriate string to display to the user for that dimension value at render time. This is shown in the Developer Studio screenshot below.
Using Resource Keys as Dimension Values
This has the advantage of keeping your dimension configuration simple and straightforward. It also means that the only place you need to maintain translations for the dimension values is in resource bundles handled by the user interface.
However, this approach has several shortcomings as well. First it means that the data in your records has to be tagged using the resource key value only, rather than the various translations of a value. This may or may not be problematic, depending on how your data is set up. (You could set up synonyms for each dimension value for each language to get around this, but it rather defeats the point of keeping your dimension configuration simple and isolating the translations in resource bundles.)
It also means that you can’t effectively use dimension search with this approach. Since the dimension values are actually the resource keys (which are never displayed directly to the user) the user can’t search using the translated values, which only exist in resource bundles handled by the user interface framework and are not available to the Endeca engine.
Another minor drawback is that the display order for Alpha-sort dimensions may appear jumbled, since the alphabetic sort order for the resource keys won’t always match the sort order of the translations into a given language. This can be dealt with in the user interface code by having it re-sort the values appropriately after they are translated, but it is additional work that you may want to avoid.
A more complex, but more complete approach is to instead create, for each intended dimension, a separate dimension for each available interface language. Suppose our application will support English, French, Spanish, and German as interface languages. In that case, the configuration would look something like the following screenshots.
Defining Dimensions per Language
Prefixing each dimension with its 2-letter language code makes them easy to identify. Notice also that I have selected “French” as the dimension language. Selecting “French” as the dimension language tells Endeca to treat values in this dimension as French words, which helps dimension searches against this dimension be more effective, as rules for matching words in French can be applied.
If the application requires dimension search to work in multiple languages, this is a preferable approach, since the dimension values themselves will be translated into the available interface languages. We can also maintain dimension ordering in the Endeca configuration, and dispense with any special user interface processing that might otherwise be required. How the data is mapped into these dimensions, and what classification synonyms may be required still depends on how the inbound data is structured, but since we are maintaining the translated dimension values here, there is no harm in maintaining translations as appropriate.
Recall that in addition to the interface language, and the record language, we also need to consider the user’s interest languages. The specifics of the application will dictate how the user’s interest languages are determined. For example, they might be stored in a user profile, assigned by an administrator, or selected automatically based on the user’s interface language, locale, or other settings.
In order to provide the user with the appropriate options, the record language will need to be represented as a property and/or dimension value on each record. This will allow the application to use the user’s interest language to filter their results accordingly. This is achievable simply by making the record language a filter-enabled property. However in many cases, especially when the user is allowed multiple interest languages, implementing the language as a dimension is more appropriate, as it brings all the power of Endeca dimensions – Guided Navigation based on language, refinement statistics, multi-valued selections, etc.
Just like the previous example of Geography, you will want to translate the dimension values for the Language dimension into multiple languages themselves! Following our previous example, you could create a dimension called “en:Language”, one called “fr:Langue”, one called “es:Lengua”, etc. These would contain values that correspond to the translations of the possible record languages into each of the possible interface languages.
Multiple Language Dimensions
There are a few things to notice about this example. First, notice that we have made the two-letter ISO language code a classifiable synonym for each dimension value. This makes it easy to standardize the incoming data on using those codes for the record language. Notice also that Italian is not one of the interface languages that we have created dimensions for, but it is listed as one of the dimension values. This is because (in our hypothetical application) we are not offering an Italian translation of the user interface, but we do have some records in Italian that will be in our index. It may often be the case that an application has a small number of interface languages, but a much larger number of record languages.
Manually creating all these dimension values for each interface language dimension in Dev Studio can be tedious, especially when there will be many different record languages in the index. One option we have employed is to make these language dimensions autogenerated. Then have a Java manipulator in your pipeline look at the two-character language code for each record and set the values for each of the Language dimensions to their appropriately localized value using the Java Locale object. For example if the current record has “en” as its language, the manipulator sets a property called “fr_lang” to “Anglais”, “en_lang” to “English”, etc. You need to store the set of available interface languages so your Java manipulator would know which languages to translate the two-letter code into. Then you can map “en_lang” to “en:Language”, “fr_lang” to “fr:Langue”, etc. If you unexpectedly get a document with a new record language, it would automatically show up in the language dimensions. If you add a new interface language, however, you’d still need to create the new language dimension for it and update your list of interface languages for the Java manipulator so it would get populated appropriately.
In the scenario where we create a language-specific dimension for each logical dimension in our index, we will then need to filter the dimensions displayed to the user so that only those corresponding to the user’s interface language are displayed.
The simplistic way to do this is to issue a query that returns all dimensions, and delegate to the user interface the job of only displaying the ones appropriate for a user’s selected interface language. If we’re sticking to a naming convention like the one shown above for our dimensions, this is straightforward to implement. In this case, the user interface simply displays only the dimensions that are prefixed with the code that matches the user’s interface language.
However, this may have an extremely detrimental affect on application performance. If our application has 20 available interface languages, we are (nearly) multiplying the amount of work that the MDEX engine has to do and the amount of information being transferred over the network by 20 for every query that is executed.
A slightly more involved, but much more performant solution is to take advantage of the
setSelection() method of the
ENEQuery object to specify which properties and dimensions to compute and return. The dimensions returned for a given user can be restricted to those matching the user’s chosen interface language. It may be necessary for the application to dynamically determine and/or cache which dimensions are appropriate for which interface languages so that it can take full advantage of the reduction in MDEX engine work and network bandwidth.
You should also set the language for each query that you send to Endeca using the
setLanguageId() method of the
ENEQuery object. For example if you know you are querying only French documents, set the query language to “fr” so French stemming will be used in the search.
As you can see from some of the complexities we’ve discussed, properly internationalizing an Endeca-powered application can be full of new and interesting challenges. I hope this article has provided a helpful head start for those embarking on such an endeavor. If you have any questions or comments, I encourage you to comment on this article.