I was searching for some pre-trained models that would read text and extract entities out of it like cities, places, time and date etc. automatically as training a model manually is time consuming and needs a lot of data to train if somebody has already done it why not reuse it.
Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a sub-task of information extraction that seeks to locate and classify named entities in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
spaCy is the leading open-source library for advanced NLP. spaCy has excellent pre-trained named-entity recognizers in a number of models. Note that we used “en_core_web_sm” model. I have read that some spaCy models are case-sensitive.
Here is the output of the paragraph I had entered in the tool
If you look at spaCy documentation, it gives the explanation of these entity types
- PERSON (People, including fictional): It classified “AI”, “CAGR”, “Tencent” wrongly as person in our context.
- NORP (Nationalities or religious or political groups): It classified ‘Asian’ and “Chinese” correctly as nationality.
- GPE (Countries, cities, states): It classified country “U.S.” correctly but misclassified “Alibaba” and “AI” in our context.
- ORG (Companies, agencies, institutions etc): It classified “Baidu”, “Google”, “IBM”, and “Microsoft” correctly.
- CARDINAL (Numerals that do not fall under another type): It classified “one” and “three” correctly.
- PERCENT(Percentage, including “%”): “45%”, “50%” and “65%” were classified correctly.
- DATE (Absolute or relative dates or periods): “2017” was classified correctly.
For the entry types which are not correct , we need to re-train the model with our own contextual data as training set.
A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc.
A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between “head” words and words which modify those heads.
The figure below shows a snapshot of dependency parser of the paragraph above. Full image can be viewed in Dependency Visualizers here.
Dependency Parsers can read various forms of plain text input and can output various analysis formats, including part-of-speech tagged text, phrase structure trees, and a grammatical relations (typed dependency) format.
Dependency Parsing can be used to solve various complex NLP (Natural Language Processing) problems like Named Entity Recognition, Relation Extraction, translation. For more details on Dependency parsing, watch this Stanford video.
Read about Parsey McParseface (and SyntaxNet), open source dependency parser here