Inventors:
Alfredo Alba - Morgan Hill CA, US
Chad E. DeLuca - San Jose CA, US
Vuk Ercegovac - Campbell CA, US
Thomas D. Griffin - Campbell CA, US
Jun Rao - San Jose CA, US
Eugene J. Shekita - San Jose CA, US
Asim V. Singh - San Jose CA, US
Yuanyuan Tian - San Jose CA, US
Kevin B. Wang - Mountain View CA, US
Assignee:
International Business Machines Corporation - Armonk NY
International Classification:
G06N 5/02
Abstract:
Embodiments of the invention relate to building a distributed reverse semantic index. In one general embodiment a plurality of documents are received with each document having at least one defined rule and or semantic. The documents are distributed among a plurality of nodes of a system. The documents are processed in a generally parallel fashion. Processing the documents includes processing text data of each of the document and breaking each document into fields to index the text data to create index data by deferring how to categorize the text data based upon the defined rule and or semantics. The indexed data is combined back together to create an indexer-agnostic semantic index including a plurality of the semantic index shards and to semantically classify the documents based on the index shards into groups based on document type to create the distributed reverse semantic index.