- Blog home
- >
- Engineering
- >
- What’s new with entity resolution in MindMeld?
Entity resolution (ER) is the process of disambiguating an entity. It’s achieved by mapping a textual mention to the most appropriate real-world name present in a lookup knowledge base (KB). For example, resolving `madrid fc` to `Real Madrid Club de Fútbol,` where the former is one of the paraphrases of the latter. Entity resolution is often referred to by alternate names, such as entity matching, entity linking, record linkage, or record deduplication.
Entity resolution is available as part of MindMeld’s NLP Pipeline and is used to disambiguate any entities identified in the user input, matching against a pre-populated KB. Check out the official documentation to learn how to create a KB for entity resolver and work with it while building a MindMeld app.
Until recently, MindMeld offered two choices for entity resolution: one based on the Elasticsearch full-text search and analytics engine, and another based on a simple exact matching algorithm as a fallback option. With the growing diversity in the applications of MindMeld, these options might not always be feasible. To offer extended support, MindMeld now provides two additional options for entity resolution: one based on TF-IDF, and another based on pre-trained neural model representations (BERT, GloVe, fastText etc.). Notably, these new options don’t have any Elasticsearch dependency (and the services thereof).
Before jumping into further details about these new options, let’s do a quick recap of how an entity knowledge base is structured in MindMeld.
As observed, the `cname` field (canonical name) along with the `permit list` field exemplify some popular usages for each food item along with the `id` field referring to a unique, officially recognized record in the knowledge base. These three fields mainly constitute an entity object in the KB. The text in the `cname` field is generally used in conversational responses, and those in the `permit list` field, together with the canonical name, serve as aliases while disambiguating. Often, the best results from an entity resolver are obtained when the `permit list` field is comprehensively populated (e.g., including alternate usages, spelling errors, short forms, etc.). This type of curation can become a tedious process in some applications but is unavoidable when dealing with entities in a highly specialized domain.
In resolvers other than exact matching, the first step is to obtain a vector representation for the input text that needs to be disambiguated, and for all the entries in the KB (cnames plus permit list) that serve as aliases for disambiguation. Then, by using some form of vector similarity matching (e.g., cosine similarity), the aliases are scored and ranked.
In the newly added entity resolution choices, the TF-IDF-based resolver curates a variety of n-gram features (i.e., surface-level text features) before computing cosine similarities on the sparse vectors. On the other hand, a pre-trained embedder-based resolver matches using cosine similarity on dense vector representations of text.
Leveraging pre-trained embedders for entity resolution has some advantages over other approaches. For example, they offer semantic understanding of text without having to extensively populate permit lists (e.g., `performed below expectations` is equivalent to `poor performance`) and provide an easy transition into multilingual entity matching (e.g., inferring `tercio` in Spanish is same as `third` in English). However, discrepancies between pretraining and inference, such as difference in lengths of input texts, put pre-trained embedders at a disadvantage. In addition, inference times of embedder models can be higher than other resolver options due to the underlying dense vector computations. Nevertheless, when finetuned appropriately, embedder models can surpass other resolver options based mainly on surface-level text features.
In the analysis below, pre-trained embedder-based resolvers are compared with Elasticsearch and TF-IDF entity resolvers. The datasets curated for this comparison involve both surface-level text matching as well as semantic matching.
Upon experimenting with various in-house curated datasets, the following are the average performances of different entity resolvers for short-text entity matching. Top-1 retrieval score is reported as the accuracy measure here:
The pre-trained BERT variants are available as part of Huggingface sentence-transformers, and the plot presents the scores of only the top 5 best-performing variants. Pre-trained word-embedding models such as fastText generally perform worse than BERT embedder models or TF-IDF-based resolvers. Such low performances could be attributed to the domain shift and lack of finetuning.
A further analysis using different configurations of the best performing BERT variant (‘distilbert-base-nli-stsb-mean-tokens’) yields the following results:
The results show that alternate similarity scores such as BERTScore aren’t competitive. In addition, using cosine similarity while concatenating different layers of the BERT model leads to performance gains matching performances of Elasticsearch. This is intuitive as different layers of BERT can capture complementary information. Even after quantizing the BERT variant for lesser memory footprint and lesser time complexity, the performance degrades only by 2-3%.
Additionally, when evaluated over randomly noised data containing spelling mistakes in the input, the TF-IDF-based resolver outperforms others. This could be due to the diverse set of n-grams being captured by this resolver. (For this experiment, permit list texts are re-used as test instances, and spelling mistakes are induced into them. Hence, at 0% noise, a 100% accuracy is observed as all test entities are also present in permit lists!)
Finally, the following plot illustrates the differences in inference time complexities across the different resolver choices:
(Left-to-right: most accurate BERT, TF-IDF, Elasticsearch) Inference time per entity when measured across different sized knowledge bases. The X-axis shows the size of the knowledge base, while the Y-axis shows the time per entity in milliseconds. Yellow is inference time to encode input text, and in green is the inference time for similarity computation.
The time complexities of TF-IDF and Elasticsearch are quite comparable, whereas the best BERT variant, although quantized, is 20 times slower. This improves to a 10x slowdown when we don’t concatenate the top 4 layers but leads to a loss in accuracy.
MindMeld entity resolver configurations entertain a variety of configurable parameters based on the resolver being used. The following snippet, when supplied in an app’s `config.py,` utilizes a pre-trained BERT model of your choice from Huggingface:
ENTITY_RESOLVER_CONFIG = {
‘model_type’: ‘resolver’,
‘model_settings’: {
‘resolver_type’: ‘sbert_cosine_similarity’,
‘pretrained_name_or_abspath’: ‘distilbert-base-nli-stsb-mean-tokens’,
…
}
}
You can use other embedder models by modifying the `embedder_type` parameter:
ENTITY_RESOLVER_CONFIG = {
‘model_type’: ‘resolver’,
‘model_settings’: {
‘resolver_type’: ’embedder_cosine_similarity’,
’embedder_type’: ‘glove’,
…
}
}
You can also specify run-time configs such as `batch_size` when using an embedder model, along with embedder model-specific configurations. To load a TF-IDF-based resolver, you can do the following:
ENTITY_RESOLVER_CONFIG = {
‘model_type’: ‘resolver’,
‘model_settings’: {
‘resolver_type’: ‘tfidf_cosine_similarity’,
…
}
}
For each entity object in the KB, special embeddings, which are the mean/max pool of all aliases’ embeddings, are also computed and used for resolution if configured. Such special embeddings often improve the accuracy of resolvers with only a marginal increase in computational cost. For full details and all configurable choices, see the `Configurations` section in the official documentation.
Overall, the Elasticsearch-based resolver is recommended unless a special scenario necessitates not to use it. As a fallback, use embedder model-based resolvers when there is a requirement for more semantic matching, or a TF-IDF-based resolver if no such requirement exists. The ER module in MindMeld does not yet provide any APIs to benchmark which resolver works best for your application. Stay tuned as we plan to add that support , along with ways to finetune embedder model-based resolvers.
Visit our home page or contact us directly for assistance.
Click here to learn more about the offerings from Webex and to sign up for a free account