Arushi Raghuvanshi – Hear an overview of key concepts for leveraging NLP in production applications like voice assistants, question answering, search, topic summarization, and more.

This is a companion blog post to my talk at the Women in Big Data event on August 20, 2020. The talk slides are available here

There are different challenges between academic or theoretical NLP and practical or applied NLP. There are quite a few online resources on the academic side, including published papers, lessons on AI and NLP theory and fundamentals, blog posts breaking down the latest and greatest models, etc. There is less information online about using all of this in practice, in real-world, customer-facing applications, which is what I’ll cover here. I will outline some key differences between academia and industry, introduce core NLP concepts that are applicable to a variety of applied use cases, go through best practices and tools for collecting data and building an application, and discuss how to securely improve deployed models over time.

Academia vs industry

The first key difference between academia and industry is data. The data available to build a production application may be very limited in quantity compared to standard research datasets. For example, SQuAD, a popular question-answering dataset, has over 100,000 questions, but developers may only have a couple hundred representative question-answer pairs to build a production system. Production data may also be noisier or have different characteristics than standard data sets. For example, it could contain a lot of domain-specific terms like product names. Because of this, pre-trained or out of the box models might not work well.

The second difference is the need for reliability and interpretability in user-facing applications. There has been a trend towards deep learning models that perform well on large amounts of data, but they may pick up on unintended or undesirable data trends because real-world data naturally has bias. For example, many companies have shown up in the news for accidentally building sexist or racially biased models. When building models in practice, it’s important to think about bias and design models that are easy to evaluate, consistent, and only rely on intended features.

Next, academic papers tend to focus on a single, well-defined component. While these components may work well individually, they often break down as part of a larger pipeline. For example, to build a voice assistant you may need a pipeline of speech recognition, natural language understanding, and question answering components. If the speech recognition is off, it makes it more difficult to understand what the user is asking, and even more difficult to answer their question or complete the task.

In academia, accuracy tends to be the main metric that researchers are trying to improve on, but in practice, developers also care about usability and scalability. In production, device constraints, inference time, interaction design, and other factors play a role in the overall success of an application. Often, these factors that contribute to usability are more important than minor accuracy improvements, requiring a different approach to model optimization.

Finally, security is an important factor in real-world applications. AI is a data-driven field. More data leads to better models, but developers must be careful about keeping this data secure and not violating customer trust. There have been many recent news articles about data breaches and a general sentiment of users feeling like companies are spying on them or not respecting their privacy.

These are some of the fundamental differences between AI theory and practice. Next, I’ll share some best practices and tools to solve NLP problems for production systems.

Applied NLP overview

The solution to many of the problems outlined above is to break down complex problems into a series of models that can be evaluated well. Instead of training a single, deep, end-to-end, black box system, train multiple, simpler, more well-defined models. Since each of these models is solving a simpler problem, they require less data to achieve high accuracy and less compute power to train. It is also easier to evaluate each of these subcomponents quickly and thoroughly, which makes it easier to efficiently deploy fixes to issues like picking up on unintended data trends.

With that in mind, we’ve found that most practical applications of NLP can be modeled as the following four general categories, and more complex problems can be handled with a pipeline of these models:

Text classification

For all NLP models, the input is a piece of text. With text classification, given a query or piece of text, the model outputs a single label.

One example application of this type is sentiment analysis. Given some text, the model can output a label of positive, negative, or neutral. Another example is topic classification. Consider an application with pre-defined topics of weather, news, and sports. Given a user query, the model can output the appropriate label. Third, extractive summarization or highlight extraction can be modeled as a text classification problem. For each sentence in the text, the model can output a binary label of whether that sentence is a highlight (included in the summary) or not.

example of text classification using inputs and outputs — Examples of sentiment analysis, domain classification, and highlight extraction.

Some models that can be used for text classification are logistic regression, support vector machines, random forest, decision trees, and neural networks (of which there are many network architectures available).

Some features that can be used to train these models include:

N-grams, which are sequences of n words or tokens in the order they appear
Bag of words, which is a count of all the words in the text (without paying attention to the order)
Word shape or orthographic features that consider if there are capitalizations, punctuation, numerics, etc.
Length of the text
Gazetteers, which are indexes or dictionaries containing domain-specific vocabulary and their frequencies – the feature is whether words or phrases in the input text appear in the domain gazetteer
For NN models, the input can be a character, word, or sentence level embedding or vector representation

While it’s good to be aware of these models and features, there are many libraries, toolkits, and frameworks with implementations of these models and feature extractors. Most of the work you’ll do for AI in practice will be framing the problem and collecting data. The model itself will often be a few lines calling a library. When starting out, it’s more important to focus on collecting the right data and framing the problem than getting caught up in the details of the model implementations.

Sequence labeling

The next model category is sequence labeling. Now, given a piece of text as input, the model will output a label for every word or token in the text. One application of this is entity recognition, which is extracting key words or phrases and their labels. Another application is part of speech tagging.

Entity recognition with IOB tagging input and output — Example of entity recognition with IOB tagging

Models that can be used for sequence labeling include maximum entropy markov models (MEMM), conditional random fields (CRF), long short-term memory networks (LSTM), and more complex recurrent neural network architectures (bi-LSTM, bi-LSTM + CRF, etc.).

Good features to use are the same as for the text classification model described above.

Sequence to sequence

When given some text input, sequence to sequence models output a sequence of tokens of arbitrary length. Some applications for this are machine translation or natural language generation. MT requires a lot of data to get right, so in practice it is generally better to use a pre-trained model or one available through an API. NLG generally doesn’t reliably work well enough to be used in production. In practice, developers usually use rules-based or templated responses instead. Given that most developers aren’t training these models from scratch, I won’t go into architecture details on this type here.

Information retrieval

The last category is information retrieval. IR is the problem of retrieving a document from an index or database based on a search term. Some applications of this are Question Answering, Search, and Entity Resolution. For example, say someone wants to know which artist played the song Bohemian Rhapsody, and you have an index that contains songs and artist names. You can search that index with the song title Bohemian Rhapsody to get the document with the artist field populated as Queen.

Example of structured question answering for conversational interfaces.

Note that this is more complicated than a simple database lookup because it incorporates fuzzy matching. Some relevant features that can be used to get optimal rankings include:

Exact matching
Matching on normalized text
N-grams for phrase matching
Character n-grams for partial word matching and misspellings
Deep embedding based semantic matching, leveraging models such as BERT, GloVe, or sentence transformers
Phonetic matching, which can directly use phonetic signals from the speech recognition model, or generate phonemes from the transcribed text using models such as double metaphone or grapheme to phoneme

Note that there are some areas of NLP that I didn’t cover. I didn’t touch on unsupervised models at all. But the majority of practical NLP applications can be modeled as one of these four categories, or for more complex problems, a combination of them.

Example application

To make this more concrete, let’s walk through an example application that uses the concepts we’ve discussed so far. More specifically, I’ll be giving you an example of building a food ordering conversational interface with the MindMeld platform. This is a complex problem, so it involves a pipeline of multiple models shown here:

Let’s consider the example query “I’d like a hummus wrap and two chicken kebabs.”

The Domain Classifier is a text classification model that assigns an incoming query into one of a set of pre-defined buckets or domains. The given query would be labeled as the food ordering
Intent Classifiers are also text classification models that predict which of the domain’s intents is expressed in the request. In this case, an intent classifier could label the query as the build order
Entity Recognizers discern and label entities — the words and phrases that must be identified to understand and fulfill requests — with sequence labeling models. For our example query, this would extract hummus wrap and chicken kebabs as dish entities and two as a number
Entity Role Classifiers add another level of labeling a role when knowing an entity’s type is not enough to interpret it correctly. These are also text classification models. The number entity two can be further classified as the quantity role (to differentiate it from a size role, e.g. 16 drinks vs a 16 ounce drink).
An Entity Resolver maps each identified entity to a canonical value using Information Retrieval. For example, hummus wrap can be mapped to the closest canonical item of Veggie Hummus Wrap, ID:‘B01CUUBRZY’.
The Language Parser finds relationships between the extracted entities and groups them into a meaningful hierarchy using weighted rules. In this case, two and chicken kebabs would be grouped together.
The Question Answerer supports the creation of a knowledge base, which encompasses all of the important world knowledge for a given application use case. The question answerer then leverages the knowledge base to find answers, validate questions, and suggest alternatives in response to user queries. This is an Information Retrieval model. Since the user has not specified a restaurant name, the question answerer can be used to find restaurants that carry the requested dishes.
The Dialogue Manager analyzes each incoming request and assigns it to a dialogue state handler, which then executes the required logic and returns a response. This is a rule-based system. In this case, it would use a template to construct a response like “I found veggie hummus wrap and two chicken kebabs available at Med Wraps and Palmyra. Where would you like to order from?”
Finally, the Application Manager orchestrates the query workflow — in essence, directing the progress of the query between and within components.

MindMeld implements all of these models for you with some reasonable defaults. Once you’ve added your data, you can simply run the following in the command line to train all of these models and start testing them:

If you would like to further experiment with one of the models, let’s take an intent classifier for example, you can do so with the following syntax in python:

To download the code and try it out yourself you can make a copy of this Google colab notebook and follow the commands. More information is available in the MindMeld documentation.

Now that you understand some fundamental NLP concepts and how to frame an NLP problem, the next step is to collect data.

Data collection

Before jumping into data collection, it’s always a good idea to check if there are any pre-trained models you can use. Hugging Face is a popular platform that has implementations of many state of the art models. CoreNLP, spaCy, and NLTK are platforms that have implementations of many NLP fundamentals, such as named entity recognition, part of speech tagging, etc. And you can always do a simple Google search to look for additional models. Even if these pre-trained models don’t perfectly fit your use case, they can still be useful for fine tuning or as features.

Example of using pre-trained sentence transformers found via Hugging Face

Example of using pre-trained Named Entity Recognition from spaCY

If you are training a model, first check to see if there are any existing datasets available. There are a lot of open-source datasets that can be used as a starting point. There may be data within your organization that you may want to use. Or you might be able to scrape or compile data from a website or publicly available API.

While it’s good to check for existing models and data, don’t hesitate to build a new dataset if one doesn’t already exist that accurately represents your use case. Representative data is essential to building a high-quality application. Crowdsourcing tools can be useful for generating initial data.

Example platforms for crowdsourcing data collection.

When leveraging crowdsourcing tools, it’s important to define your task well. If the task description is too specific, you will get lots of very similar looking data, but if it’s too general, a lot of the data may be irrelevant or not useful. To strike the right balance, iterate. Work in small batches, see how the results look, and update your task description accordingly.

Some data collection platforms help match you with workers who are trained in your specific use case, which is really useful if you want clean, consistent data. For cases where you want more variation or generally want to see how the public responds to certain prompts, it may be better to go with tools that anyone can contribute to. You can also do things like target specific geographic areas to get a variation in slang and regional language that people might use.

Whatever approach you take, consider implementing validation checks to automatically discard any excessively noisy or irrelevant data. You can target workers with better ratings to help reduce noise, but even then, you should implement some automated validation like checking length, removing whitespaces, and making sure at least some words appear in the relevant language dictionary.

In addition to collecting the text itself, remember that we want to collect labels for our models. It’s incredibly important for these labels to be clean, because without clean data our models can’t learn. If you use crowdsourcing tools or data teams for this, you should give contributors some training and evaluation before they start labeling. You can have multiple people label the same queries, and only accept labels with a certain level of agreement. Once you have an initial model, you can help speed up labeling time by using model predictions to bootstrap labels. This transforms the label generation task into a verification task, which is generally faster and easier.

Finally, if you don’t have any other resources, you can create and label your data yourself, in house. This can be a great way to bootstrap an initial model. It gets you to think more closely about the data you are trying to collect, and you can add data over time from user logs or other sources as resources become available.

Toolkits and frameworks

Once you’ve framed your problem and collected data, the next step is to train your model. Scikit-learn is a popular toolkit for classic models that we talked about like logistic regression, support vector machines, and random forest.

For neural networks, you can use libraries like PyTorch or Tensorflow. Here’s a great tutorial on using a PyTorch LSTM for part of speech tagging, and here’s one for Tensorflow.

Some more NLP specific toolkits are CoreNLP, NLTK, spaCy, and Hugging Face. I mentioned these toolkits before in the context of pre-trained models, but they are also very useful as feature extractors. These toolkits can be used to generate features from text, like n-grams and bag of words. These feature vectors can then be fed into models implemented via, for example, scikit-learn.

For more complex problems involving multiple NLP components, namely conversational interfaces, you can use a variety of platforms including MindMeld, Dialogflow, Amazon Lex, wit.ai, RASA, and Microsoft LUIS. These platforms have a lot of preset defaults for feature extractors and models and have the whole pipeline set up, so all you have to do is provide your data and implement any custom logic. Even if you’re not building a full conversational interface, these platforms can be really useful for their subcomponents, like question answering or custom entity extraction.

Finally, there are tools on the infrastructure side that can be particularly useful for AI. Elasticsearch is useful because it is not only a database, but also a full-text search engine with a lot of IR capabilities built-in. AWS, Google compute engine, and other similar platforms are great for cloud compute to train heavier models efficiently. Kubernates is a platform for easy deployment and scaling of your systems. And DVC is a tool for data versioning, so that if you have multiple people training models, they can be synchronized on the data they are using.

Improving models in a secure way

The key to intelligent ML systems is to improve them over time. All of the leaders in the AI space have become so by leveraging usage and behavior data from real users to continually improve their models. As an organization, it is essential to do this in a secure way.

The most important thing to start with is communication. It is important to clearly communicate if any user data will be stored, how long it will be stored for, who will be able to access it, and what it will be used for. Even if you are abiding by data policies, if users are unaware of these agreements, it may come across as ‘spying.’ This communication can be done at onboarding, with user agreements, through an FAQ section of a website, via a published white paper, or any other accessible location.

In order to define these data policies, some things to think about include what data needs to be stored to improve your system. Can you store only some extracted trends or metadata, or do you need to keep the full raw logs? You should only store what is absolutely necessary to add value to the end user and always remove any extra sensitive or personally identifiable information. Think about how long this data will be stored. Will it be deleted after a set amount of time, say one year, or is it crucial to store it indefinitely until the user requests it to be deleted? Who will be able to access the data? If the data is never read or inspected by humans, people may be more comfortable with their data being used. If that is not possible, it is good to make the data available only to a small team of analysts who have a high level of data security training. Finally, what will the data be used for? If it provides value to the end user, they are more likely to allow you to use it. When possible, it is beneficial to provide useful reports to end users or customers and measurable accuracy improvements on models.

Once you’ve defined a data policy, you need to build a secure data pipeline that can enforce this policy.

example data pipeline — Example data pipeline. User queries and model outputs are stored in a secure temporary cache until they can be processed and saved in a more permanent data store with relevant access permissions.

For example, you need to keep track of information like from which user each piece of data came from, so you can delete it if they ask for it to be removed. The platform needs to be able to enforce permissions, so only authorized individuals are able to access data stores. You can also build models to remove sensitive information. For example, if you don’t need to store person names and those exist in your data, you can use an entity recognition model to recognize those person names and replace them with a generic token.

Once you have data, an efficient way to improve models is with Active Learning. In production, raw data is cheap, but labeling data is not. We can use model uncertainty to select which queries to label first to improve models quickly.

To help do active learning on a regular basis, you can build out a semi-automated pipeline that selects logs from the data store, bootstraps annotations, which can be verified by a human labeler, and checks to see if the accuracy increases with the new data. If it does, the new model can be deployed, and if not, the data can be sent to the developer team for further inspection and model experimentation. In addition to increasing the training set with this pipeline, it’s good to add to the test set. For the test set, it’s better to randomly select queries to get an accurate distribution of user behavior.

You can further speed up this pipeline by using auto labeling. Tools like snorkel enable labeling data automatically, with an algorithm or model, rather than manually with a human labeler. The auto labeling system can abstain from labeling queries for which there is low confidence. These can be sent to human labelers or ignored. Either way, it allows for some model improvement without a human-in-the-loop, which is beneficial for security reasons and time or resource constraints.

About the author

Arushi Raghuvanshi is a Senior Machine Learning Engineer at Cisco through the acquisition of MindMeld, where she builds production level conversational interfaces. She has developed instrumental components of the core Natural Language Processing platform, drives the effort on active learning to improve models in production, and is leading new initiatives such as speaker identification. Prior to MindMeld, Arushi earned her Master’s degree in Computer Science with an Artificial Intelligence specialization from Stanford University. She also holds a Bachelor’s degree from Stanford in Computer Science with a secondary degree in Electrical Engineering. Her prior industry experience includes time working at Microsoft, Intel, Jaunt VR, and founding a startup backed by Pear Ventures and Lightspeed Ventures. Arushi has publications in leading conferences including EMNLP, IEEE WCCI, and IEEE ISMVL.

Click here to learn more about the offerings from Webex and to sign up for a free account.

Applied natural language processing— Using AI to build real products