When migrating content from one system to another, don’t forget about the data that accompanies the content.
Blue Fish Development Group
701 Brazos St. #700
Austin, TX 78701
(512) 469-9300
When migrating content from one system to another, don’t forget about the data that accompanies the content.
Migrating enterprise content from one system to another requires undertaking two separate yet intertwined migration efforts: moving the actual content files and moving the related data. When planning a migration initiative, it’s easy to overlook the data migration that is implied in the process. Yet the majority of problems encountered during content migrations are actually problems with the data, which includes the obvious metadata as well as the not so obvious data components such as ownership, security and permissions, folders, workflow and lifecycle state, version stack, and audit history. This article describes the types of problems you can expect to see when moving the content’s data and explores options for planning and performing the migration of the data that is related to your content.
Upon first glance, migrating content from one repository into another - or even from a filesystem into a repository - seems simple. There are tools that allow you to map the attributes from the source system to their corresponding attributes in the target system, then push a button and watch the content move. However, in practice it’s rarely this simple. To understand the potential pitfalls in even the simplest content migrations, let’s turn to the familiar iceberg metaphor.
Figure 1: The Tip of the Iceberg
The visible portion of the iceberg represents the attributes that can be mapped in a straightforward fashion. For example, object_name in the source system might map to object_name in the target system, while doc_no in the source system might map to document_number in the target. But lurking below the waterline is a continent of challenges and issues - issues related to attribute transformations, attributes requiring enrichment, and exceptions. We can cast these challenges as different strata in the iceberg; the deeper you go, the more complex or challenging the issues become.
Figure 2: The Full Iceberg
Each layer of the iceberg presents its own unique challenges and must be planned for differently. The one common theme is that your approach and plan must anticipate that you will find problems once you begin looking at migrated content and data in the target system.
Many of the attributes in your source repository will map seamlessly to your target repository. Since these mappings are, well - straightforward, this part of your migration specification usually gets the most attention. Your intuition for how to plan this part of the migration will probably serve you well - it boils down to doing a little bit of analysis as you plan.
At Blue Fish we’ve developed a set of best practices that has served us well across a broad range of content migration efforts. Start with your target repository’s object model and understand where each attribute will come from. If your source system is a Documentum docbase or another formal content repository, it’s likely that many of the attribute values will map directly from attributes in that system. If your source is a filesystem, there will be fewer explicit attributes to pull from. In either case, some values may require transforming, which we will discuss in the next section.
Next, analyze your source documents. Does your legacy system have all the metadata that the target repository requires? If not, we deal with those missing values in the next two sections.
Assess the quality of the metadata. Do the values exist? Are they complete? Are they valid? Are they in the expected format? For example, are all dates “mm/dd/yyyy”, or are some “dd/mm/yyyy” or “dd-mm-yyyy” or “dd-MMM-yy” or “Month dd, yyyy” or “mm/dd/yyyy hh:mm:ss” and so on. Were users consistent in how they tagged and classified documents? Did the legacy system enforce consistency? An invalid or incomplete value moved into a non-critical target attribute may be a nuisance, but a missing or invalid value moved into a mandatory target attribute will cause your migration to fail.
Things to watch out for:
The first submerged layer of the iceberg consists of the complex mappings we call attribute transformations. For example, if in the source system the author_name is “John Smith” but it must be “Smith, John” in the target, the value must be transformed during the migration into the new format.
To use a transformation, you must be able to programmatically derive the value of the target attribute from information in the source repository, such as the document’s folder path, owner, or other attributes. Many transformations are expressed in terms of conditional statements. For example, you may need to set an archived attribute to true only if the document is more than 3 years old.
Some common types of transformations we’ve seen include:
Planning for this part of the iceberg is similar to planning for the straightforward mappings. In fact, you can often do the analysis for transformations at the same time you are analyzing for the straightforward mappings. The main reason that we think of transformations as more challenging is that they can frequently get out of control, adding a lot of complexity to what you may have thought was a simple migration. It’s not uncommon to build up a transformation that evaluates the values of three or more source attributes. In a recent migration, I recently had a transformation that required evaluating three document level attributes and one folder level attribute to determine whether a document’s status was “draft” or “released” in the target system.
Things to watch out for:
After you deal with attribute transformations, you’ll face an even greater challenge: target attributes whose values cannot be mapped or programmatically derived from information in the source repository. Typically, this is the case when either the source system has no corresponding attribute or the source attribute was not mandatory (and therefore not populated). Enrichment is basically a last resort when there is no programmatic means for providing the target attribute values.
These attributes require users to “enrich” the data in a manual or semi-automated fashion, supplying the missing data based on their knowledge of the content, the system, and the related business processes. For example, if a document is missing its title, the author of that document must go back and add a title before the document can be migrated.
When planning enrichment activities, give some thought to who is best suited to do the work. Since enrichment activities require users with an understanding of the business context, plan to have the right business user(s) enrich the right set of attributes. You can divide the content into enrichment groups and route each group to the user(s) with the relevant knowledge. This is especially useful if the schedule is tight since it allows you to parallelize the enrichment process, Or if knowledge about specific attribute values is segmented across departments, you can slice the enrichment process and have different users enrich different attributes. And make sure you work with the appropriate managers to allocate time for the selected business users.
Even though knowledgeable users are providing the missing values, don’t expect them to get it right on the first try. Build in safeguards. For example, if you are using an Excel spreadsheet to capture missing attribute values, use drop down lists to ensure consistency and completeness and avoid data entry mistakes. If a drop down list must match the data dictionary in the target system, make sure you export the latest version of the list before starting enrichment, and periodically refresh the list if needed.
In addition to drop down lists, provide a default value so that the attribute value cannot end up blank. As a last resort you could default to the object owner’s name for future resolution. Have the enrichment process provide immediate feedback for potential errors. For example, use the built-in data validation capabilities of Excel to prompt the user to correct errors immediately. And lastly, be prepared to validate or otherwise check the results of enrichment activities.
To determine the duration of the expected enrichment activities, run a sample set through your enrichment process. Since attribute enrichment is basically a linear process, you can safely project a completion date based on the time it took to complete the sample set, after adding in a reasonable amount of schedule contingency.
Don’t try to use a statistical sample (again, more on this later). Instead, query your source repository to determine exactly which documents are missing necessary attribute values. You’ll need a separate query for each attribute. In a Documentum repository, an example would be “select * from dm_document (all) where title is nullstring”.
Finally, select a migration tool that directly consumes the enriched attribute values. This speeds up the process and avoids translation or transcription errors.
Things to watch out for:
Now we are into the deepest and darkest region of the iceberg. This is where most migrations go wrong, because the approach failed to expect the unexpected. Most exceptions won’t be discovered until you start migrating data into the target repository. Identifying and addressing them is where you will spend most of your time, so you’d best plan for it.
The following list of examples is by no means complete but will give you a flavor of the variety and severity of exceptions you can expect to encounter.
There are four ways of dealing with exceptions:
One of the things that makes exception handling particularly hard is that some exceptions will be “hiding behind” other exceptions. What I mean by this is that a document might have two exceptions, but you won’t know about the second exception until you resolve the first exception. For example, you may find a set of documents whose document_type attribute is not populated. To resolve this exception, you decide to change your migration specification, populating the document_type based on the naming convention of the folder the document is linked into. You then re-run your migration, only to discover that some folders don’t conform to the naming convention.
We’ve found the most effective way of dealing with exceptions is to find them early and fix them iteratively. Start performing migrations right away, perhaps into a test environment. Then run your migration again - you’ll find new exceptions that need to be resolved. Plan on having many, many iterations. Most migration tools can’t process documents incrementally, so either select one that can or plan each iteration to have enough time to roll back your migration and rerun it from scratch.
At Blue Fish, we use a methodology we call Agile Migrations, where we perform frequent, incremental migrations during which we only re-migrate the small number of documents that have changed. Having worked on migrations for years, I can tell you that this beats the old “big-bang” approach hands down.
Additional planning steps that will keep you out of trouble include:
One of the most common migration planning mistakes is to employ an approach that runs one or more sample sets through the migration process and then projects a migration timeline based on the analysis of the samples. If you’ve learned one thing from this article, hopefully it is that it’s very hard to predict how many errors and exceptions you’ll encounter during the migration process. Even a statistically valid sampling of the data will give you a false sense of security since you are certain to run into exceptions that were not found in the sample set.
We learned this lesson the hard way. Several years ago we embarked on an effort to migrate 600,000 documents from three Documentum source repositories and one fileshare into a new consolidated Documentum repository. Determined to exercise due diligence in the planning stage, we started off by compiling a list of 150 assumptions about the state of the content and metadata. For example, we assumed that every document had an object name. We scripted DQL statements and programmatically validated all of these assumptions.
After validating the assumptions, we divided the documents into 350 batches. The plan was to run 20 of the batches, representing 6% of the total content to be migrated, through the migration process. We expected this test drive to uncover 80% of the exceptions. However, all 20 batches had their unique problems that required specific handling. At this point we realized we could not predict the number of exceptions and shifted our approach to expect exceptions and facilitate addressing them as they arose.
In summary, the smart and sane approach is to expect exceptions and be prepared to handle them when, not if, they arise.
Before tackling a content migration initiative, remember the iceberg. The straightforward attribute mapping is just the tip. Each layer has its own challenges and requires different planning. Exceptions take most of the time, and planning to respond quickly to the exceptions boils down to three things:
Following this approach and setting the right expectations up front will prepare everyone for the challenges. When data issues and exceptions are spotted in the target system and quickly addressed, the migration team and business users alike will realize the process is “working as designed”. Everyone will see the progress and feel confident that the migration effort will be successful and will be completed on time. And you will feel a sense of accomplishment and relief knowing you successfully charted the iceberg and navigated around it instead of colliding with it.
To learn more about the tools and processes we use here at Blue Fish, visit the Content Migrations section of our Web site.
Subscribe to our newsletter to be notified when new articles are posted. You can unsubscribe at any time.
You must be logged in to post a comment.