In many cases, all measurable, observed data and information have been recorded resulting in thousands (sometimes millions) of written, logged or acquired by speech-to-text recognition records. In such huge amount of data it is practically impossible to distinguish a valuable “signal” from the worthless “noise”. How to examine this tremendous amount of data to dramatically reduce or even prevent critical incidents and accidents?
Computerized selection of a few vital and valuable few records by discovering hidden patterns and relationships in data and texts is a long-awaited solution worldwide.
Data Mining and Text Mining
Data Mining is the process of discovering hidden patterns and relationships in data.
Text Mining involves the application of Data Mining tools to textual data in order to extract patterns from natural language, i.e. mostly unstructured data where identical things are described in different words and vice versa, different things may be described in similar words. Text Mining is different from the web search, when user is looking for something already known or has been written by someone else. Providing efficient Text Mining solution is an indispensable part of FavoWeb intelligent incident data collection and management system.
FavoWeb FRACAS (Reporting, Analysis and Corrective Action System) supports the crucial task of Safety and Security mission: pattern recognition, classification, categorization and labeling of data sets and free texts.
FavoWeb FRACAS now includes a safety text-categorization system capable to assign incoming new failure/incident reports to one or more of predefined categories, on the basis of their textual content.
FavoWeb FRACAS Text Mining
• Complex approach to large-scale text mining tasks provides uniquely comprehensive solution
• High Dimension (large amount of input parameters – single words of a vocabulary)
• Sparse Document Vectors (small number of distinct words in each document)
• Heterogeneous Use of Terms (same category documents may have small overlap)
• High Level of Redundancy (many different features relevant to the classification)
Text Mining for Prediction
Prediction is the ultimate goal of FavoWeb FRACAS text mining.
FavoWeb process of Text Mining is a complex and complete solution for all three main stages of Text Mining:
1. Text Pre-processing - data cleaning and transformations, selection of subsets, preliminary feature selection, reduction of the large number of parameters to their manageable amount.
FavoWeb tools possibilities:
• Binary and word-frequency coding
• Reduction of vocabulary dimension by stemming, lemmatization, word frequency, etc.
2. Model Building and Validation - considering various models and choosing the best one based on their predictive performance to assign new reports to one or more set of predefined categories on the basis of their textual content.
FavoWeb FRACAS model validation stage utilizes all known modern approaches:
• Classical and fast SVM method
• Cross-validation to perform tuning of Kernel Type and Penalty value
• One-vs-one and One-vs-rest for multi-class categorization
• Algorithms for un-balanced data sets and data sets with blank values
3. Deployment - using the best model selected at the previous stage and applying it to new data to generate predictions or estimates of the expected outcome.
FavoWeb uses the following approaches:
• New point recognition for binary, One-vs-one or One-vs-rest multi-class categorization
• Evaluation of Accuracy
• Evaluation of Confidence of new point recognition
Text Mining Applications
• aviation safety
• access and surveillance security
• profiling tax cheaters
• anti-terrorist efforts
• Aviation & Aerospace Safety
• Fault-prone Products Detection