Xerox Scientists Develop New Way to Categorize Documents13 Jul, 2005
Xerox Scientists Develop New Way to Categorize Documents
Xerox Corporation scientists have injected more human know-how into text mining, the practice of using computer analysis of documents to extract new information. The result is better categorization, with higher-quality, customized results.
In a paper titled "Work Practice in Research: A Case Study" being presented at the International Council on Systems Engineering symposium, Nathaniel G. Martin, an ethnographer and computer scientist in the Xerox Innovation Group in Webster, New York, described the new technology.
Categorization is a powerful form of text mining. It associates a document with subject categories that a computer learns from a "training set" of documents that a subject matter expert has classified by hand. The new software program improves the speed and accuracy of categorizing systems because it helps the subject matter expert interactively create the training set, choosing and refining the categories and the conditions under which they are applied.
It is a technique that could improve results from traditional categorizing systems and is particularly useful for classifying short documents, according to Martin.
The scientists' discovery grew out of request from a Xerox engineering group for help analyzing service logs, the record of calls from service technicians in the field to company engineers about problems with production printer and copier operation. The engineering group was manually classifying these logs so they could identify and devote their efforts to solving the most important problems.
They asked XIG scientists to develop an algorithm that would automate the way service log problems were grouped into categories. A traditional categorizing system would have learned from the work they had done, following the classification pattern already defined by the user. The categories would then remain static.
However, when Martin and his colleagues used ethnographic techniques like conducting open-ended interviews and videotaping an engineer as he continued to categorize the service logs, they realized that what he was doing did not fit the traditional description of categorizing.
This new technique reduced the time required to categorize the service logs from a week to a few minutes, and the group is more productive. Now the new software program is being used in other Xerox organizations to analyze unstructured responses such as comments from customers. Xerox has applied for a patent on the technology.
In addition, at the INCOSE symposium Anthony M. Federico, Vice President of Platform Development for the Xerox Production Systems Group, will give a keynote speech on "System Engineering in Advanced Color Imaging." Symposium attendees can also tour Xerox's Webster research and manufacturing complex to learn about the principles of color digital printing and how paper choice impacts printing.