One of the key reasons we like Ephesoft so much is its straightforward implementation of
Intelligent Document Capture concepts. Ephesoft’s Search Classification mode automatically
learns from the semantic content on the sample pages provided, and builds a Lucene database from this information.
From there, this database is leveraged to adaptively interpret and classify any new documents that are fed
into the system.
As new documents are fed in, Ephesoft’s high-level classification process is basically:
“How do I figure out what kind of document this is? I’ll compare this doc’s written content with the learning samples
I have, and make an educated guess from there.” In this way, IDR mimics the common-sense semantic approach that is also
generally employed by humans — and therein lies its flexibility and utility.
Until now, most of this classification and interpretive work was hidden under the hood in Ephesoft. But in version 3.1,
this classification logic is now exposed in the batch class administration area, via the new Test Classification
feature. We are very excited about this addition to the software, and we’ll explain more below.
The new Test Classification feature functions similarly to the existing Test KV feature. First, you’ll need to set up
a new batch class with a few document types, and then go through the normal document training process. If any explanation
is needed for these steps,
Ephesoft’s documentation is a fine start.
Once the learning is complete, we can make use of the new Test Classification feature.
Inside the batch class directory is a new addition — test-classification — where new test candidates can be
loaded in the same paradigm as for the test-extraction that existed before. To use this you’ll need to copy in
some document or page samples that (ideally) are representative of the larger document population which will be fed into
Ephesoft later during the production phase.
From there, just find and press the Test Classification button in the batch class administration
area. The first time this runs, Ephesoft will use whatever classification methods have been assigned to each document type,
and apply those analyses to the files that have been loaded into the test-classification folder. This can actually
include any of Ephesoft’s recognition modes — image classification, search classification, barcode recognition, etc.
— but we’ll focus on search classification since that’s what we end up using most frequently with our clients.
The first time this runs, there may be a few moments’ delay as Ephesoft runs OCR analysis on the files for the first time. If your environment is not especially powerful, or if you’ve loaded a large set of samples, you’ll want to allow several minutes for the first run to complete. Whenever this process is finished, the classification results will pop up on the screen in a modal window, and the full power of this new feature should immediately be evident:
Essentially Ephesoft is now providing us with a fully transparent readout of all its classification logic, on a
per-document and per-page basis. The table as a whole reveals how the entire set of files is interpreted, how each
are grouped into individually classified docs, what each doc is composed of internally, and with what level of internal
confidence. Furthermore, if any questions arise regarding how these confidences and classifications were reached,
one can simply open up the _HOCR files that now accompany each page in the test-classification folder, exactly
in the same way that.
We are really pleased with how Ephesoft has set up this new feature, and we think it will be an invaluable addition for
use cases that involve “fuzzy” data sets with a lot of document variability. This feature allows the same level of
instrumental fine-tuning that we’ve enjoyed for development of extraction logic, now to be applied to the classification
of different document types. In certain use cases this will prove even more valuable than extraction testing, because
some document populations have such a high degree of variability that classification really is the biggest value that
Ephesoft brings. And without correct classification on each document, whatever custom extraction logic has been developed
will effectively be moot. We give a big thumbs-up to Ephesoft for giving us such a valuable insight into this most
important step in the document capture workflow.