AI/ML

Amazon Comprehend for Named Entity Recognition

Posted by

On March 15, 2022

This is the first of a series of blogs on AI services provided by AWS. This blog discusses an NLP service that uses machine learning to unravel valuable insights from text – Amazon Comprehend, and in particular how Comprehend can be used for Named Entity Recognition. We also compare Comprehend in this regard with spaCy – one of the most popular libraries in Python for NLP.

The blog is organized as follows:

What is Amazon Comprehend?

Named Entity Recognition

Using Comprehend for NER

Comparison with spaCy

Limitations of Comprehend
Conclusion

What is Amazon Comprehend?

Comprehend is one of the AI language services offered by AWS, which helps in unstructured text analysis. This provides a number of functionalities, ranging from detecting language and sentiment of a text to extracting named entities, key phrases, Personally Identifiable Information (PII) and tagging parts of speech.

The other language services provided by AWS include Transcribe (automatic speech recognition), Translate (for fluent translation), Polly (text to speech conversion ) and Lex (for building chatbots).

As mentioned above, this blog will be focused on Named Entity Recognition using Comprehend.

Named Entity Recognition

It is the process by which named entities are identified and recognized. A named entity is a noun which denotes a person, location, organization, time, etc.

For example, the sentence ‘Elon Musk founded SpaceX in 2002.’ has three named entities :

Elon Musk – Person
SpaceX – Organization
2002 – Time

Using Comprehend for NER

The entities for which NER can be done using Comprehend fall into two categories :

1. Built-in entities:
These are the most common types into which named entities can be classified in general, and are available with Comprehend by default. These include:

Person – Names of people (‘Mark Zuckerberg’)
Organization – Large organizations, companies, religious groups, sports teams, etc (‘Facebook’)
Location – Names of places/countries/cities (‘London’)
Date – Specific dates, days, months, time (‘February 24, 2022’)
Event – An event, such as a festival, organization, etc (‘Christmas’)
Quantity – Some amount like number, percentage, etc (‘400’)
Commercial Item – A branded product (‘Galaxy S22’)
Title – Name of a movie, book, song, etc (‘The Power of Subconscious Mind’)
Other – Entities that don’t qualify for any of the above entity types (‘COVID-19’)

2. Custom entities:

These are entities which are of specific interest for a particular use case, and are not provided by the pre-trained Comprehend model.
These may include skills, designations, weapon types, etc, depending on the use case.

In this blog, we will focus on built-in entity detection using Comprehend. Detection of custom entities, which involves training Comprehend, will be the subject matter of the next blog.

Languages supported

Comprehend supports the following languages for Entity Recognition : Hindi, English, German. Spanish. Italian. Portuguese. French. Japanese. Korean and Arabic.

Detect Entities using Comprehend

The following operations can be used to detect entities in a document or a set of documents.

DetectEntities: For real-time results on a single text/string less than 5000 bytes of UTF-8 encoded characters.
BatchDetectEntities: For entity detection on a set of upto 25 documents, each less than 5000 bytes of UTF-8 encoded characters.
StartEntitiesDetectionJob: For asynchronous entity detection on a set of documents

Comprehend can be used in different languages/frameworks ( C++, .Net, Python, Go, Javascript, Ruby, etc) and AWS provides appropriate SDKs for the same.

Here, we illustrate the use AWS Comprehend for Entity Detection in Python, using the DetectEntities approach.

Python Code for Real-time Entity Detection using DetectEntities

import boto3
comprehend_client = boto3.client('comprehend')

text = 'My friend Karan Sharma visited me around Christmas in Mussoorie. On the very first evening, we ordered pizzas and Coca-cola from Dominos, and enjoyed watching Game of Thrones. We spent some three days together, before he left on December 29. Soon thereafter, we had COVID-19 cases reported in town.'

entities = comprehend_client.detect_entities(Text = text, LanguageCode = 'en')

Output

The result of Entity detection contains all detected entities, their types and scores for each of them – which indicate how much confidence Amazon Comprehend has in assigning the type for the respective entity.

In this example, all the eight built-in entity types, mentioned above, have been detected.

{'Entities': [{'Score': 0.9994445443153381,
   'Type': 'PERSON',
   'Text': 'Karan Sharma',
   'BeginOffset': 10,
   'EndOffset': 22},
  {'Score': 0.9369783401489258,
   'Type': 'EVENT',
   'Text': 'Christmas',
   'BeginOffset': 41,
   'EndOffset': 50},
  {'Score': 0.9983173608779907,
   'Type': 'LOCATION',
   'Text': 'Mussoorie',
   'BeginOffset': 54,
   'EndOffset': 63},
  {'Score': 0.943016529083252,
   'Type': 'QUANTITY',
   'Text': 'first evening',
   'BeginOffset': 77,
   'EndOffset': 90},
  {'Score': 0.628623366355896,
   'Type': 'COMMERCIAL_ITEM',
   'Text': 'Coca-cola',
   'BeginOffset': 114,
   'EndOffset': 123},
  {'Score': 0.9578254818916321,
   'Type': 'ORGANIZATION',
   'Text': 'Dominos',
   'BeginOffset': 129,
   'EndOffset': 136},
  {'Score': 0.9986073970794678,
   'Type': 'TITLE',
   'Text': 'Game of Thrones',
   'BeginOffset': 159,
   'EndOffset': 174},
  {'Score': 0.9522485733032227,
   'Type': 'QUANTITY',
   'Text': 'three days',
   'BeginOffset': 190,
   'EndOffset': 200},
  {'Score': 0.9994234442710876,
   'Type': 'DATE',
   'Text': 'December 29',
   'BeginOffset': 229,
   'EndOffset': 240},
  {'Score': 0.9356407523155212,
   'Type': 'OTHER',
   'Text': 'COVID-19',
   'BeginOffset': 266,
   'EndOffset': 274}],
 'ResponseMetadata': {'RequestId': 'dbc83d8f-d4ca-490e-8a23-cc2646d9c832',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'dbc83d8f-d4ca-490e-8a23-cc2646d9c832',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '1013',
   'date': 'Sun, 27 Feb 2022 14:55:16 GMT'},
  'RetryAttempts': 0}}

Comparison with spaCy

spaCy is a free open-source library for NLP in Python. It provides extensive support for NER, POS tagging, dependency parsing, word vectors and a lot more.

In this section, we illustrate how Amazon Comprehend performs better compared to spaCy for certain use cases.

Like Comprehend, spaCy also contains a number of different entity types. Below is the code for entity detection using spaCy. Both this and the Comprehend code shared above are executed for a specific set of sentences listed thereafter.

Python Code for Entity Detection using spaCy

import spacy
from spacy import displacy
NER = spacy.load("en_core_web_sm")
text =   'INPUT TEXT'    #Input for NER Detection
textres= NER(text)
entity_list=[]
for ent in textres.ents:
  dictres = dict([('Type:', ent.label_), ('Text:',ent.text), ('BeginOffset:',ent.start_char), ('EndOffset:',ent.end_char)])
  entity_list.append(dictres)
print(dict([('Entities:', entity_list)]))

Input for Comprehend/spaCy Comparison

‘I ordered a bottle of Coca-Cola along with pizzas.’
‘Coca-Cola was ranked as the top carbonated soft drink company in the United States last year.’
‘John moved to 1313 Mockingbird Lane in 2012.‘
‘I am going home for Diwali.’
‘I am going to buy a Lamborghini Diablo this month.’
‘COVID-19 was caused by SARS-CoV-2 virus.’

Observations

The observations from the result, along with both the outputs, are listed below.

Observation	Sentence	Comprehend Output	spaCy Output
Comprehend is better able to distinguish entity types depending on the context.In these examples, the type of entity Coca-Cola represents changes with the context. While Comprehend detects the change, spaCy does not.	‘I ordered a bottle of Coca-Cola along with pizzas.’ versus ‘Coca-Cola was ranked as the top carbonated soft drink company in the United States last year.’	{‘Entities’: [{‘Score’: 0.9521364569664001, ‘Type’: ‘COMMERCIAL_ITEM‘, ‘Text’: ‘Coca-Cola‘, ‘BeginOffset’: 22, ‘EndOffset’: 31}]} versus {‘Entities’: [{‘Score’: 0.9911801218986511, ‘Type’: ‘ORGANIZATION‘, ‘Text’: ‘Coca-Cola’, ‘BeginOffset’: 0, ‘EndOffset’: 9}, {‘Score’: 0.9979116320610046, ‘Type’: ‘LOCATION‘, ‘Text’: ‘United States’, ‘BeginOffset’: 69, ‘EndOffset’: 82}, {‘Score’: 0.9921985268592834, ‘Type’: ‘DATE‘, ‘Text’: ‘last year‘, ‘BeginOffset’: 83, ‘EndOffset’: 92}]	{‘Entities:’: [{‘Type:’: ‘ORG’, ‘Text:’: ‘Coca-Cola’, ‘BeginOffset:’: 22, ‘EndOffset:’: 31}]} versus {‘Entities:’: [{‘Type:’: ‘ORG‘, ‘Text:’: ‘Coca-Cola’, ‘BeginOffset:’: 0, ‘EndOffset:’: 9}, {‘Type:’: ‘GPE‘, ‘Text:’: ‘the United States‘, ‘BeginOffset:’: 65, ‘EndOffset:’: 82}, {‘Type:’: ‘DATE‘, ‘Text:’: ‘last year’, ‘BeginOffset:’: 83, ‘EndOffset:’: 92}]}
Comprehend is better able to identify addresses as locations than spaCy.	‘John moved to 1313 Mockingbird Lane in 2012.’	{‘Entities’: [{‘Score’: 0.9993337988853455, ‘Type’: ‘PERSON‘, ‘Text’: ‘John‘, ‘BeginOffset’: 0, ‘EndOffset’: 4},{‘Score’: 0.9881025552749634, ‘Type’: ‘LOCATION’, ‘Text’: ‘1313 Mockingbird Lane‘, ‘BeginOffset’: 14, ‘EndOffset’: 35}, {‘Score’: 0.9984847903251648, ‘Type’: ‘DATE’, ‘Text’: ‘2012’, ‘BeginOffset’: 39, ‘EndOffset’: 43}]}	{‘Entities:’: [{‘Type:’: ‘PERSON‘, ‘Text:’: ‘John‘, ‘BeginOffset:’: 0, ‘EndOffset:’: 4},{‘Type:’: ‘DATE’, ‘Text:’: ‘1313‘, ‘BeginOffset:’: 14, ‘EndOffset:’: 18},{‘Type:’: ‘DATE‘, ‘Text:’: ‘2012‘, ‘BeginOffset:’: 39, ‘EndOffset:’: 43}]}
Comprehend seems to be trained on a more varied dataset than spaCy.For example, even though both Comprehend and spaCy have the entity type ‘event’, the former recognizes the Indian festival Diwali as an event while the latter does not.	‘I am going home for Diwali.’	{‘Entities’: [{‘Score’: 0.9419634342193604, ‘Type’: ‘EVENT‘, ‘Text’: ‘Diwali‘, ‘BeginOffset’: 23, ‘EndOffset’: 29}]}	{‘Entities:’: [ {‘Type:’: ‘PERSON‘, ‘Text:’: ‘Diwali‘, ‘BeginOffset:’: 20, ‘EndOffset:’: 26}] }
Comprehend is better at detecting the names of commercial items, than spaCy.	‘I am going to buy a Lamborghini Diablo this month.’	{‘Entities’: [{‘Score’: 0.8843019604682922, ‘Type’: ‘ORGANIZATION‘, ‘Text’: ‘Lamborghini‘, ‘BeginOffset’: 20, ‘EndOffset’: 31}, {‘Score’: 0.9460159540176392, ‘Type’: ‘COMMERCIAL_ITEM‘, ‘Text’: ‘Diablo‘, ‘BeginOffset’: 32, ‘EndOffset’: 38}]}	{‘Entities:’: [{‘Type:’: ‘PERSON‘, ‘Text:’: ‘Lamborghini Diablo‘, ‘BeginOffset:’: 20, ‘EndOffset:’: 38},{‘Type:’: ‘DATE‘, ‘Text:’: ‘this month‘, ‘BeginOffset:’: 39, ‘EndOffset:’: 49}]}
Comprehend ends up detecting more entities overall, and tags the non-identified entities with the OTHER tag, while spaCy misses these entities.	‘COVID-19 was caused by SARS-CoV-2 virus.’	{‘Entities’: [{‘Score’: 0.9892012476921082, ‘Type’: ‘OTHER‘, ‘Text’: ‘COVID-19‘, ‘BeginOffset’: 0, ‘EndOffset’: 8}, {‘Score’: 0.9807910919189453, ‘Type’: ‘OTHER‘, ‘Text’: ‘SARS-CoV-2‘, ‘BeginOffset’: 23, ‘EndOffset’: 33}]	{‘Entities:’: []}

Can we say that Comprehend is better than spaCy overall ? No, because this comparison is based only on Entity Recognition, and both Comprehend and spaCy provide a host of other features, which need to be independently evaluated.

This will be taken up in subsequent blogs.

Limitations of Comprehend

Comprehend is undoubtedly a great model for NLP, but its usage has its own set of limitations.

1. AWS Comprehend is a charged service, unlike for example, spaCy.

The requests to Comprehend are measured in units of 100 characters (1 unit = 100 characters) with a 3 unit minimum charge per request.
For entity detection in particular, the charges (price per unit) are as follows:

Upto to 10M Units – $0.0001

From 10M-50M Units – $0.00005

Over 50M Units – $0.000025

For volume higher than 100M units per month, the AWS team needs to be contacted separately.

2. For real-time entity analysis, each call to Detect Entities can have only upto 5000 bytes of characters, and even when the inputs are sent in a batch, the maximum number of documents allowed is 25, with the aforementioned byte limit applying on each one of them.
This does not hold for the EntitiesDetectionJob, which, however, is an asynchronous operation.

Conclusion

In this blog, we discussed AWS Comprehend and in particular, how it can be used for Named Entity Recognition. We restricted this to built-in entities and fetching results for a single text. We also saw how Comprehend compares with spaCy in this regard.

In our next blog, we will illustrate how we can train Comprehend for custom entity recognition. With this, we will be able to detect any kind of entities we are interested in, for our specific use cases. This will unleash the full potential of Comprehend !

References

Amazon Comprehend

Industrial-Strength Natural Language Processing in Python

Also checkout our previous blogs,

Orchestrating Pipelines using MLOps Workload Orchestrator

Combining SageMaker Pipelines with SageMaker Projects

Generating ML Workflows using Amazon SageMaker Pipelines