NLP / NLU: Text Processing
Text processing involves reading text from various sources, removing irrelevant structures like HTML tags and normalising the text so it is easier to work with
Capturing Text Data
The simplest source is a plain text file, but text sources can include JSON, XML, CSV, TSV, a SQL database, HTML, document OCR, or any other source where text can be communicated.
Cleaning
Depending on the source of the text data, the text can be incredibly messy. Consider extracting the plain text from a web page that includes broken tags, javascript, CSS, and embedded images. BeautifulSoup
is a very useful Python library that takes care of the edge cases and problematic parts of parsing complicated HTML. The output, however won’t always be perfect. There is a good chance you will still need to use regexes to handle what remains which could still include javascript anad a large amount of whitespace. BeautifulSoup
also allows you to walk the DOM and extract the text based on prior knowleged of the HTML document structure.
Normalisation
Plain text is great but is still himan language with embelishments and adornments that are convenient for a human reader but irrelevant for a machine. A common step is to convert all text to lower case as the case of the text doesn’t impact the meaning of each word (at least in most cases). It is also common to remove punctuation where the low level details of the structure of the text doesn’t aid the task at hand. Lower case conversion of punctuation removal are the two most common steps for text normalisation, but it is worth keeping in mind that other steps may be necessary depending on what you are trying to achieve.
Tokenisation
“Token” is a fancy word for symbol. In most cases, the granularity of tokens are at the word level. Tokenisation can be achieved using basic Python, but the tools that make up the NLTK library can be more useful as it handles edge cases, such as terating ‘Dr.’ as a token with the period, which would require hand coded logic when only using the basic Python functions.
Stop Word Removal
Stop words are uninformative words in text such as “the”, “in”, “at”, “is”, etc. These words donm’t add a lot of meaning to a sentence. They are also very commonly occuring words that con contribute to a lot of the volume in text. Their removal can help shrink the vocabulary needed to describe the data and as result help optimise tasks that take longer for larger vocabularies. You can see what words NLTK considers to be stop words in English by executing the following snippet:
from nltk.corpus import stopwords
print(stopwords.words("english"))
This is based on the stopwords from a single corpus. Different corpora may consider different words to be stopwords. It is worth keeping in mind that the usefulness or irrelevance of specific stop words is very application dependent.
Part-of-Speech Tagging
Part-of-Speech tagging is the act of annotating the text with the part-of speech each word takes within the grammitical structure of the text. NLTK includes a pos_tag
function that will take a tokenised sentence and returns a POS tag for each word. POS tagging allows for grammtical parsing of text.
Named Entity Recognition
Named entities are noun phrases that refer to some specific object, person, or place. NLTK provides the ne_chunk
function that takes a tokenised POS tagged list of tuples and determines which chunk of the text is most likely a named entity. In the wild, performance is not always great, but training on a large corpus definitely helps.
Stemming and Lemmatisation
Further simplifying text data requires additional normalisation steps to help deal with variations and modifications of words. Stemming is the processing of simplifying words to the stem of the variation. For example, “branching”, “branches”, “branched”, all have the same stem “branch”. This help reduce complixity while still maintaining the meaning that is carried by words. Stemming is meant to be a very fast and crude operation carried out by very simple search and replace rules. This can result in stem words that are not actual words. This is ok because machines can still infer the meaning encoded by these stems.
NLTK includes multiple stemming algorithms to choose from including PorterStemmer
, SnowballStemmer
, and other language-specific stemmers.
Lemmatisation is another process to reduce the complexity of words, but in the case the process uses a dictionary to map different variants of a word back to its root. With this approach, we are able to translate non-trivial variations of words back to their root. For example, “is”, “was”, “were”, can be converted back to the root “be”. The default lemmatiser in NLTK uses the Wordnet database to reduce the words to the root form. Lemmatisation needs to know or make assumptions about the part-of-speech for each word so roots can be disambiguated properly.
Typical Workflow
First normalise the input text. Then tokenise the normalise text. Remove stop words. Depending on the application, it may then be sueful to apply either stemming or lemmatisation as the last step in the text preprocessing flow. That said it is not uncommon to first apply lemmatiation and then stemming. The output of this workflow can then be used as the input for further analysis.