A major purpose is to transform the textual product description, within each category, into a table of normalized attributes and values that reflects the product's characteristics in a clear, consistent and unambiguous manner. We love to call it "the single version of the truth".
The challenge is to take classified product descriptions, in any language, dialect, or structure, that includes synonyms, abbreviations, spelling mistakes, different unit of measurement systems, etc., and to transform it, automatically and accurately, into a normalized table of appropriate technical attributes and values.
The roll of the Extractor is to automatically transform any classified product description into a normalized table of attribute and values, efficiently and properly.
We use the Extractor after the classification. The extracting process starts by auto-extraction of the data using existing Knowledge Bases. This is followed by an interactive session where subject domain experts "train" (teach) the system by extracting one product description manually, then allowing the Extractor to use these example and automatically extract values from other product descriptions. The above session repeated by providing more manual examples from the un-extracted products, until all required products are properly extracted.
The technology behind the Extractor consists of methods such as fuzzy-match, pattern - recognition, Induction Logic Programming (ILP) and "Machine Learning by Examples", adapted to handle unstructured data with no semantics.
There are several deliverables when using the Extractor. The first deliverable is a table of normalized technical attributes and values that describe the products in an unambiguous way; the second is a collection of pattern examples, organized as a knowledge base, to be used in the future to auto-extract other products; another deliverable are lexicons that link raw data terms, in a context of a category, to a proper standard lexical term.
|