NLP / NLU: Introduction
NLP Overview
Natural language allows us to develop complex thoughts and reason about them. Much progress has been made, but computers still struggle to properly parse and understand natural language.
Structured Languages
Human languages are not precisely structured. Structured languages include mathematics, programming languages, formal logic. The structure ensures what is communicated does not include ambiguity. This lack of ambiguity means structured languages are suitable and easy for computers to process them and execute the instructions written in them.
Grammar
Structured languages are easy to parse to their following strict grammars. Violations of grammatical rules in structured languages are reported as syntax errors. Grammars can be defined in Backus-Naur Form (or BNF).
Structured Text
Natural languages also have grammatical rules, but the way people speak and communicate informally often brake these rules. Humans are able to accurately parse the unstructured communication and even handle the ambiguity present. The presence of grammatical structure even in unstructured text means that computers can begin to parse the information out of the text even if they are unable to understand it like humans can.
Context is Everything
Part of the problem that prevents computers from fully understanding unstructured text is the critical role context plays in determining the underlying meaning.
Now consider the statement “The Sofa didn’t fit through the the door because it was too narrow”. ‘It’ clearly refers to the door in this statement. But waht about the statement “The sofa didn’t fit through the door because it was too wide”? Here ‘it’ clearly refers to the sofa, but when comapring these two cases ‘narrow’ and ‘wide’ implies ‘it refers to different objects because our innate understanding of the context described in each statement. Unless a computer was able to parse the underlying symbolic relationships, in this case the spatial relationships between the sofa and door, it woudn’t be able to parse the semantics of the statements and disambiguate what ‘it’ refers to.
NLP and Pipelines
Natural language proicessing is one of the fatest growing technologies in the world. NLP pipelines are the sequential steps that take raw text and produce structured output.
NLP pipelines consist of three stages:
- Text Processing
- Feature Extraction
- Modeling
Each stage transforms text in some way and produces output that the next stage uses as input.
It is worth keeping in mind that the workflow building these pipleines are not strictly linear. You will iterate over the three steps while refining what each step needs to achieve in order to achieve the desired result from the whole pipeline.
Text Processing
Text processing involves extracting plain text from whatever source the text originated from. This means removing and tags or structures present in the raw input but are not relevant for what the pipeline is trying to achieve.
Feature Extraction
We now have clean, normalised text, but before feeding this into a model, we need to take the text and convert it into features the model expects to receive and removes any irrelevant relationships encoded within the text. For example, ascii represnetations of the text implies sequential relationships between characters which can be misleading for a model.
Statistical models need some form of numerical representation of the text to compute with. This can be in form of features like “bag of words”, TFIDF, Word2Vec, or glove. There are many ways of representing textual information and it is only with practice that you will learn what is likely needed to solve a given problem.
Modeling
The final stage takes the extracted features and produces a model that can take input and make predictions to solve a given problem. They can be statistical in nature, deterministic in nature, require training, or even be composed of multiple models to create an ensemble aggregation of distinct model components.