Bugs are never fun, especially the ones which must be tracked across several different platforms, applications and test cases. The Information Access (IA) team encountered such a bug during advanced testing of the Documentum Adapter. This article will discuss the approaches taken when faced with a bug that spans two platforms (Documentum and Endeca), and discuss our experience in testing potential solutions to resolve the problem.
The Documentum Adapter for Endeca ultimately makes it possible for business users to leverage the search and refinement power of Endeca with the critical information stored in Documentum repositories. The adapter allows Endeca to index Documentum objects and their properties, enables the full text extraction and indexing of supported object types, and provides the ability to map Documentum taxonomies to Endeca search refinements. This particular bug was encountered during the full text extraction process.
Blue Fish encountered a time out scenario in a production environment while performing extended test cases of the adapter. When the Endeca Document Converter encountered files of unknown formats or very large sizes (ten megabytes or larger) it would continue to process or attempt to process the content for five minutes before timing out and moving onto the next file. The session timeout causes Documentum to close the IDfCollection session (not the IDfSessionManager session) and throws an error when attempting to advance to the next record in the data reader. This event makes the next content record inaccessible and fatally ends the entire crawl.
The discovery process proved to be more challenging than just following the stack trace. What the team did know about the bug was that the exception was being thrown when trying to advance the record set. But it was additionally difficult because the adapter had to be deployed and launched through Endeca to run, the bug was only reproducible during certain types of crawls, and it was unknown what particular aspects of these repositories was causing the exception.
The key to solving this bug was not in the exception messages provided by the adapter logs, nor was it in the Endeca logs created during content crawling. It was in correlating the pieces of information provided by these two logs. At the time the Documentum exception was thrown, a distinct correlation could be seen at the same time in the Endeca log. There was an exception from the Endeca Document Converter being caused by an unknown document type. The only time the Documentum data reader session was closed was when this five minute timeout was observed in the Endeca Document Converter. Once the source of the bug was known, the next problem became implementing the solution.
Quick and Easy
The first thought to try to solve the problem was a straight forward multi-threaded solution that should have allowed the converter to attempt to convert the document while enabling the adapter to move on with its record processing. This was implemented and tested. The exception still happened and no change in behavior was experienced. To our surprise, it appeared the Endeca Document Converter was using 100% of the CPU when doing conversion, causing thread starvation! Since the adapter thread was starved for CPU time, it could not move on to the next record until the converter released some CPU cycles. After remaining in this state for five minutes the SessionClosed exception occurred again.
After regrouping it was time to take a closer look at the architecture, to see where the point of failure was and how it might be re-factored to eliminate the problem of records timing out. The architecture of the Documentum Adapter yielded two different potential solutions – either separate the Documentum work (getting the files) from the Endeca work (crawling the files), or keep track of the record being processed, and should it fail, then re-query the record set and skip the failed record.
Solution 1: Separate the Work
The first solution required pulling all the records from Documentum into memory. This approach would allow the adapter to close the Documentum data reader session before moving onto indexing the records into Endeca. Using this methodology, the Endeca Document Converter had as much time as it needed to timeout, and could advance to the next record without being concerned about a Document session being closed. This quickly led to the realization that Documentum databases are huge, and the calculations showed that persisting these records to memory would require gigabytes of memory which would be untenable for most clients.
|Process memory size (MB)||Mem chg from baseline size (KB)||Num of docs held||Num docs chg from baseline||KB/doc|
Table 1 – Consistent growth of java.exe process for each doc added
Table 1 shows the impact to the Java.exe process as more documents are held in memory for processing. The table shows that each document held in memory takes approximately 13 kilobytes, and after 4000 documents being held, we were already at 84 megabyte process size. Realistically, Documentum repositories can have upwards of hundreds of thousands or even millions of objects. Going forward such memory requirements were deemed unrealistic and this approach was abandoned.
Solution 2: Skip the Failing Record
With the first approach no longer an option, efforts refocused on providing a failsafe for the query timeouts. By keeping track of which record was failing and skipping it altogether the adapter provides a more robust error handling mechanism for handling the timeout as well as other edge cases that might go wrong while processing the record; e.g., a record could be extremely large and just take that long to process, or it may be missing some crucial data which Endeca needs to process. By solving one timeout problem, it potentially addressed other problems that might occur during records processing. By adopting this approach, the original architecture was maintained and more extensive testing requirements were avoided.
The new solution worked on the principle that the record set returned from Documentum was ordered by object id. So if a record failed during processing, a new query would return the records which came after the failed one.
The first thing that can be learned from this is that sometimes an API does not provide the level of access or control necessary for optimal operation. Had the Documentum API provided some way to set the IDfCollection timeout (which it did not) and had the Endeca Document Converter provided some way to configure its timeout, the IA team would not have had to solve the problem by modifying the adapter code so heavily.
Since the error that was happening was in a piece of code bridging two existing platforms, and could not be isolated independently, it was necessary to deploy builds of the adapter to test it. Troubleshooting this issue was truly an edge case because the adapter bridges two platforms.