Unsupervised semantic organization of spoken documents by topic analysis


2009-12-25  11:30 - 12:00
Room 308, Mathematics Research Center Building (ori. New Math. Bldg.)

The spoken documents are very difficult to be shown on the screen, and very difficult to retrieve and browse. It is therefore important to develop technologies for global semantic organization for the entire spoken documment archive. A framework based on unsupervised topic analysis was proposed to fulfill global semantic organization, offering to the user a global picture of the semantic structure of the archive. Probabilistic Latent Semantic Analysis (PLSA)/Non-negative Matrix Factorization (NMF) was used as a typical example tool of unsupervised topic analysis and found to be helpful. Different from the conventional document clustering approaches, with PLSA the relationships among the topic clusters and the appropriate terms as the topic labels can be very well derived. Chinese broadcast news were used as the example spoken documents. Choice of different units other than words to be used as the terms in the processing is also considered in the system based on the special structure of the Chinese language.