Mining Massive Text Data and Tracking Target Features


2009-12-25  09:30 - 10:20
Room 308, Mathematics Research Center Building (ori. New Math. Bldg.)

We present a systematic data mining procedure for exploring large free-style text datasets to discover useful features and develop tracking statistics (often referred to as performance or risk measures). The procedure includes text classification, construction of tracking statistics, inference under error measurements. The main difficulty in deriving this inference scheme is the accounting for classification errors, for which we propose two types of approaches: "plug-in" and "projection" methods. We also consider the bootstrap calibration for fine tuning. Finally, as an illustrative example, the proposed data mining procedure is applied to analyzing an aviation safety report repository to show its utility in aviation risk management or general decision-support systems. Although most illustrations here are drawn from aviation safety data, the proposed data mining procedure applies to many other domains, including, for example, mining medical reports for tracking medical errors or possible disease outbreaks.