Of course, such talks always contained a tutorial part on how to assemble a text mining workflow:
1. Importing data from various format (text of course, but also pdf, doc, xml, pubmed, web scraping, and more);
2. Filtering and cleaning from punctuation, stop words, and other useless part of the text;
3. Stemming, i.e. reducing each word to its stem so that "promise" and "promising" refer to the same concept;
4. Tagging specific entities in text, such as part of speech, names, locations, medical terms, sentiment, and more;
5. Bag of Words or keywords extraction to represent the original text
6. Classification based on previous tagging
Tagging and classification were often the steps with the highest expectations. Most frequently asked questions rotated around them, like "where do I find the tags for the classes I need"? That means, "where do I find an ontology that associates words with classes for my problem"? Let's suppose I want to classify restaurants based on their ethnicity. I need a subset to use as training set where documents are tagged with the restaurant ethnicity. Where can I find that?
I believe there is quite a bit of misunderstanding here. A general Text Mining tool does not usually provide tagged data for specific classification problems. It is the user's responsability to collect and tag the data, i.e. to define the ontology, and to build a text mining solution based on such data.
Some text mining solutions offer an editor to manually tag each sentence with the right class. But this is of course a tedious and long process! Another option could be to buy a vertical solution that analyzes a specific text problem in a specific context, like for example a solution for medical document classification. In the medical field many ontologies, associating words to medical classes, are widely available. The more restricted the text contexts and the classification problems are, the easier it is to find a pre-packaged ontology or tag dictionary.
My experience with text mining, though, is that a data scientist has often to deal with a new classification problem in a wide and unknown text context. Defining the ontology in a quick way then becomes a bit of an art. Here are a few tricks I have used in the past, to avoid the manual tagging.
- Using the document origin as the document class, like for example the search keys that have been used to retrieved the documents
- Borrowing the ontology from a similar text context: even if this is not a 100% match, it might still be sufficient to quickly tag the data
- Importing a dictionary from the web (Stanford dictionary for sentiment analysis for example) whose tags can be easily translated into the tags you need.