On a robust document classification approach using TF-IDF scheme with learned, context-sensitive semantics.

Pandit, Sushain

On a robust document classification approach using TF-IDF scheme with learned, context-sensitive semantics.

File

on_classification.pdf (62.09 KB)

Date

2008-01-01

Authors

Pandit, Sushain

Organizational Units

Organizational Unit

Computer Science

Department

Computer Science

Abstract

Document classification is a well-known task in information retrieval domain and relies upon various indexing schemes to map documents into a form that can be consumed by a classification system. Term Frequency-Inverse Document Frequency (TF-IDF) is one such class of term-weighing functions used extensively for document representation. One of the major drawbacks of this scheme is that it ignores key semantic links between words and/or word meanings and compares documents based solely on the word frequencies. Majority of the current approaches that try to address this issue either rely on alternate representation schemes, or are based upon probabilistic models. We utilize a non-probabilistic approach to build a robust document classification system, which essentially relies upon enriching the classical TF-IDF scheme with context-sensitive semantics using a neural-net based learning component.