In this post I'll take a quick look at TextBlob - an open source python library for natual language processing (nlp) and text processing.
You can find the official repo and website here:
https://github.com/sloria/TextBlob
https://textblob.readthedocs.io/en/dev/
Some of the primary capabilities of TextBlob include:
- Word and Sentence Tokenization
- Part-of-speech tagging (PoS)
- Noun-phrase extraction
- Sentiment analysis
- Word Lemmatization
- Word Pluralization
- Word Singularization
- Spelling correction
- Text Classification (rule-based vs machine learning)
- Language translation and detection
- Word and phrase frequencies
- WordNet integration
There are a variety of ways to install and use TextBlob. See the website link above for more options.
$ pip install -U textblob
$ python -m textblob.download_corpora
You can view the Blob class api here:
https://textblob.readthedocs.io/en/dev/api_reference.html#module-textblob.blob
Word and Sentence Tokenization
Tokenization is the process in NLP that breaks text into words, punctuation, and other smaller pieces of text (tokens). In addition to word tokenization, TextBlob can also break text into individual sentences. This can be a lot more convenient (and effective) than using other methods to split sentences into words.
Word Tokens:
print(blob.words)
# output: ['Yesterday', 'we', 'went', 'to', 'the', 'movie', 'theater', 'and', 'saw', 'Star', 'Wars', 'It', 'was', 'very', 'good']
Sentence Tokens:
print(blob.sentences)
# outpout:
# [
# Sentence(" Yesterday we went to the movie theater and saw Star Wars."),
# Sentence("It was very good.")
# ]
Part of speech tagging
Part of Speech (PoS) tagging assigns linguistic (primarily grammatical) information to word tokens.
Here we use the tags method on our textblob instance to get the part of speech tags back as tupples (an ordered list of elements), with each tuple in the form of word, PoS tag.
print(blob.tags)
# output:
# [('Yesterday', 'NN'), ('we', 'PRP'), ('went', 'VBD'), ('to', 'TO'), ('the', 'DT'), ('movie', 'NN'), ('theater', 'NN'), ('and', 'CC'), ('saw', 'VBD'), ('Star', 'NNP'), ('Wars', 'NNP'), ('It', 'PRP'), ('was', 'VBD'), ('very', 'RB'), ('good', 'JJ')]
Part of Speech tagging types:
CC coordinating conjunction CD cardinal digit DT determiner EX existential there (like: “there is” … think of it like “there exists”) FW foreign word IN preposition/subordinating conjunction JJ adjective ‘big’ JJR adjective, comparative ‘bigger’ JJS adjective, superlative ‘biggest’ LS list marker 1) MD modal could, will NN noun, singular ‘desk’ NNS noun plural ‘desks’ NNP proper noun, singular ‘Harrison’ NNPS proper noun, plural ‘Americans’ PDT predeterminer ‘all the kids’ POS possessive ending parent‘s PRP personal pronoun I, he, she PRP$ possessive pronoun my, his, hers RB adverb very, silently, RBR adverb, comparative better RBS adverb, superlative best RP particle give up TO to go ‘to‘ the store. UH interjection errrrrrrrm VB verb, base form take VBD verb, past tense took VBG verb, gerund/present participle taking VBN verb, past participle taken VBP verb, sing. present, non-3d take VBZ verb, 3rd person sing. present takes WDT wh-determiner which WP wh-pronoun who, what WP$ possessive wh-pronoun whose WRB wh-abverb where, when
If you wanted to get all words of a particular type (or types) you could use something like:
types = ['VB', 'VBD', 'VBG', 'VBN', 'VBZ']
verbs = list()
for word, tag in blob.tags:
if tag in types:
verbs.append(word.lemmatize())
print(verbs)
Sentiment Analysis
Sentiment Analysis can be used to classify the sentiment of text. In the case of TextBlob it will classify it as a range from negative to positive, with neutral being in the middle.
In TextBlob, sentiment is represented by two numbers - polarity and subjectivity.
Polarity: This represents how negative or positive the sentiment is, and is represented as a float value within the range -1.0 (negative sentiment) to 1.0 (positive sentiment). 0 would represent neutral sentiment.
Subjectivity: This represents whether the sentiment is based on a trustworthy or verifiable source, or if is just an opinion, emotion, or judgement. It is represented as float between 0.0 and 1.0, with 1 representing a more subjective (opinion based) sentiment.
To retrieve sentiment on a text string just create a textblob object and call the sentiment method on it:
from textblob import TextBlob
blob = TextBlob("Yesterday we went to the movie theater and saw Star Wars. It was very good.")
print(blob.sentiment)
# output
# Sentiment(polarity=0.9099999999999999, subjectivity=0.7800000000000001)
Above you can see that it considers this a very positive sentiment. It also considers it as subjective (an opinion).
TextBlob uses a rule-based approach to sentiment analysis, as opposed to a machine learning based approach. This has some limitations including the lack of factoring in context for words. It looks at the words and frequency to determine sentiment, not the context or placement of the words.
This may work fine for some domains and use cases, but may not give the desired results in other cases.
Now lets take a look at variations of the above sentence that it might consider less subjective.
For example, I would consider the following sentence to be slightly less subjective but it did not:
from textblob import TextBlob
blob = TextBlob("Yesterday we went to the movie theater and saw Star Wars. I read a review that said it was very good.")
print(blob.sentiment)
# output
# Sentiment(polarity=0.9099999999999999, subjectivity=0.7800000000000001)
No change in the subjectivity score above.
Replacing the sentence with "Yesterday we went to the movie theater and saw Star Wars. Everyone that saw it said it was very good.", also did not change the subjectivity score.
However, if all you are looking for is a threshold of polarity, then this may not matter.
You can also check the sentiment on individual sentence tokens rather than then entire blob. For example:
blob = TextBlob("Yesterday we went to the movie theater and saw Star Wars. It was very good.")
for sentence in blob.sentences:
print(sentence.sentiment)
# output:
# Sentiment(polarity=0.0, subjectivity=0.0)
# Sentiment(polarity=0.9099999999999999, subjectivity=0.7800000000000001)
Here we can see that the first sentence did not have any polarity or subjectivity associated with it - it was neutral sentiment. Which makes sense. And for the second sentence, we have our polarity and subjectivity scores as we saw above.
Handling Negatives:
blob = TextBlob("Yesterday we went to the movie theater and saw Star Wars. It was not very good.")
print(blob.sentiment)
# output
# Sentiment(polarity=-0.26923076923076916, subjectivity=0.46153846153846156)
Above I changed the text to use a negative - It was not very good.
You can see from the polarity score that it recognized the negation of very good and now assigned it a negative polarity score to properly reflect a negative sentiment.
However it now changed the subjectivity score. I'm not quite sure why "It was not very good" vs "It was very good" would have different levels of subjectivity.
Another example with a negative:
blob = TextBlob("John Jacobs is not a bad person.")
print(blob.sentiment)
# output
# Sentiment(polarity=0.3499999999999999, subjectivity=0.6666666666666666)
This seems about right for sentiment. Like any statement like this, it is open to interpretation as to how positive it might be.
Classification
In addition to Sentiment Analysis (which can be thought of as a form of classification), you can create custom classifiers.
Classification is the process of assigning text a label. In TextBlob, to create a custom classifier, you have to provide training data that matches text with the a relevant label.
When used, TextBlob will pick the label with the highest confidence score.
For example, lets say your domain has some specific language to mean great, such as "kablamo".
If we update our first sentence to us it and check the results we get:
text = "Yesterday we went to the movie theater and saw Star Wars. It was kablamo."
blob = TextBlob(text)
print(blob.sentiment)
# output
# Sentiment(polarity=0.0, subjectivity=0.0)
Let's create some training data which is made of of text and the appropriate label that should be applied to it:
training = [
('The new Superman movie was kablamo.','pos'),
('I thought Star Trek Discovery season one was kablamo','pos'),
('Erin Gray was Kablamo as Colonel Wilma Deering','pos'),
('Mark Wahlberg was not very kablamo in Planet of the Apes.','neg'),
('That movie was definitely not kablamo','neg'),
('They pulled off one of the most kablamo movies ever','post'),
]
Next we create our custom classifier object and pass it to the TextBlob() method, overriding the default.
We then call the classify() method on our textblob object.
classifier = classifiers.NaiveBayesClassifier(training)
text = "Yesterday we went to the movie theater and saw Star Wars. It was kablamo."
blob = TextBlob(text, classifier=classifier)
print (blob.classify())
# output
# pos
You can also call the classify method directly on some text. For example:
classifier = classifiers.NaiveBayesClassifier(training)
text = [
"Yesterday we went to the movie theater and saw Star Wars. It was kablamo.",
"Last week I saw Megawars and it was not very kablamo.",
"I watched a very kablamo movie on Netflix last night called Operation Gizmo."
]
for sentence in text:
print(classifier.classify(sentence))
# output:
# pos
# neg
# pos
You can retrain the classifier model by adding additional data using the update() method. For example:
new_data = [('This was the most kablamo movie I have ever seen.', 'pos'),
("Cargo Shorts Two was not kablamo. I wish I had not seen it.", 'neg'),
("Not one person thought that was a kablamo movie", 'neg'),
("It was almost kablamo, but ultimately failed.", 'neg')]
# Update model with new data
classifier.update(new_data)
Keep in mind the more data you have the better your classifier will perform. There are many pre-existing data sources that can be used to help provide additional training data.
Training large quantities of data can take a long time, so it is best to keep your classifier object in memory for as long as it will be needed.
An alternative would be to use something like pickle, to serialize the trained classifier and save it to a file that could then be read back in when needed.
A more likely scenario will be that you will load your training data from a CSV/JSON/TSV file, as to be truly effective you would want a considerable amount of data to train on.
Noun Phrases
A noun phrase is generally the noun and the relevant words that augment that noun such as adjectives and other grammatical components. These can be very useful for helping to give insight as to what a piece of text might be about. This is similar to Named Entity Recognition (NER) available in other NLP tools.
You can retrieve noun phrases with with the noun_phrases method:
text = '''
Tesseract is an open source ocr engine developed by Google. It can recognize and process text in more than 100 languages out of the box.
Spacy is an open source Python libary for Natural Language Processing (NLP) and excels at extracting information from unstructured content.
'''
blob = TextBlob(text)
print("This is about: ", blob.noun_phrases)
# output:
# This is about: ['tesseract', 'open source ocr engine', 'google', 'process text', 'spacy', 'open source', 'python', 'language processing', 'nlp']
The above could be used for a simple content tagging application. For example options for a "keyword" taxonomy in a CMS like Drupal.
It's important to test this against multiple sets of content that you intend to use this for, as the results and quality may vary. The above results are pretty good, but using it on our original test sentence provides lower quality results:
blob = TextBlob("Yesterday we went to the movie theater and saw Star Wars. It was not very good.")
print(blob.noun_phrases)
# output
# ['yesterday', 'movie theater', 'wars']
In this case it turned "Star Wars" into "wars" which could have a detrimental effect if that was used in an automated way to classify the meaning of a piece of content. Keep in mind, this is not a shortcoming of NLP Named Entity Recognition, but potentially just a limitation of the data this model was trained on. Using other models and other tools like Spacy or Hugging Face will result in different results. This is another reason why these tools are often best as an "assistant" to humans. Using the content tagging idea above, this would present possible tags for a content creator/editor to select from as a lit of suggestions.
Lemmatization
Lemmatization involves breaking a word down to its simplest form. This is somewhat similar to stemming which is used by many search engine libraries like solr, however generally can be more efficient.
For example, talked, talking, talk, are all forms of the lemma "talk".
This can be a useful tool for identifying the frequency that words appear in text.
For example, text that includes fishing, fished, and fish can all be lemmatized to identify that it contains three instances of "fish".
It can also be combined with PoS tagging to lemmatize just verbs, so for example, ignore instances of fish when used as a noun.
By default, TextBlob will return the noun of the word. You can specify the type like this:
word = Word("quicker")
print(word.lemmatize("a"))
# output
# quick
The full list of parts of speech you can lemmatize (noun is the default) are:
ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'
WordNet
WordNet is somewhat like a thesaurus, though there are some differences, or as they state on their web page: "A large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept."
TextBlob provides methods to interact with WordNet.
The basic unit in WordNet is the synset. A synset would represent a set of related words. For example, Car, Vehicle, and Automobile would be grouped under CAR.
A word might belong to multiple synsets because it might have different meanings.
For example:
word = Word("program")
synsets = word.synsets
print(synsets)
# sample output
# [Synset('plan.n.01'), Synset('program.n.02'), Synset('broadcast.n.02'), Synset('platform.n.02'), Synset('program.n.05'), Synset('course_of_study.n.01'), Synset('program.n.07'), Synset('program.n.08'), Synset('program.v.01'), Synset('program.v.02')]
This isn't too helpful on its own since many will have the same name.
You can get the description for a synset with synset.description().
For example:
word = Word("program")
synsets = word.synsets
for synset in synsets:
print(synset, synset.definition())
# output
# Synset('plan.n.01') a series of steps to be carried out or goals to be accomplished
# Synset('program.n.02') a system of projects or services intended to meet a public need
# Synset('broadcast.n.02') a radio or television show
# Synset('platform.n.02') a document stating the aims and principles of a political party
# Synset('program.n.05') an announcement of the events that will occur as part of a theatrical or sporting event
# Synset('course_of_study.n.01') an integrated course of academic studies
# Synset('program.n.07') (computer science) a sequence of instructions that a computer can interpret and execute
# Synset('program.n.08') a performance (or series of performances) at a public presentation
# Synset('program.v.01') arrange a program of or for
# Synset('program.v.02') write a computer program
As you can see above, it is a mix of noun and verb sysnets.
You can limit to a particular part of speech with:
synsets = Word("program").get_synsets(pos=NOUN) # or VERB
Synonyms
You can then get the synonyms for a sysnet with lemma_names().
For example:
print(synsets[5].lemma_names())
# output
# ['course_of_study', 'program', 'programme', 'curriculum', 'syllabus']
Word Frequency
The ability to find how many times specific words appear can help determine the focus of a piece of content as well.
text = TextBlob("I am a big vuejs fan. I have used vuejs for a variety of projects.")
text.word_counts['vuejs']
# output:
# 2
Removing stop words
Stopwords are words like of, a, is, etc. Words that don't add a lot of meaning so can be omitted from our text analysis of things like word frequency.
To remove stopwords first import nltk and the stopwords library from the nltk corpus.
import nltk
from nltk.corpus import stopwords
text = '''
My skills include Drupal, Python, PHP, React, Vuejs, Adobe Suite, Machine Learning and other AI related tools and frameworks.
Most recently I have worked on chatbots and AI assistants. I publish a monthly AI newsletter. I have also created a variety of Drupal modules.
I have also recently worked on an opensource chatbot module allowing organizations to leverage any chatbot platform for their Drupal websites.
'''
# create a textblob object
blob = TextBlob(text)
nltk.download('stopwords')
counts = {}
for word in blob.words:
if word not in stopwords.words('english'):
count = blob.word_counts[str(word).lower()]
counts[word] = count
print(counts)
# output:
# {'My': 1, 'skills': 1, 'include': 1, 'Drupal': 3, 'Python': 1, 'PHP': 1, 'React': 1, 'Vuejs': 1, 'Adobe': 1, 'Suite': 1, 'Machine': 1, 'Learning': 1, 'AI': 3, 'related': 1, 'tools': 1, 'frameworks': 1, 'Most': 1, 'recently': 2, 'I': 4, 'worked': 2, 'chatbots': 1, 'assistants': 1, 'publish': 1, 'monthly': 1, 'newsletter': 1, 'also': 2, 'created': 1, 'variety': 1, 'modules': 1, 'opensource': 1, 'chatbot': 2, 'module': 1, 'allowing': 1, 'organizations': 1, 'leverage': 1, 'platform': 1, 'websites': 1}
Notice also that we converted the text to lowercase, otherwise, words that it was not trained on will be ignored.
Spelling
You can use TextBlob to check and correct spelling.
The correct() method will automatically try to correct mispelled words.
blob = TextBlob("I came in first in the speling bee.")
print(blob.correct())
# output:
# I came in first in the spelling bee.
You can use the spellcheck() method on Word objects to get a confidence score.
word = Word('speling')
word.spellcheck()
# output:
# [('spelling', 1.0)]
Singularization and Pluralization
word = Word("developer")
plural = word.pluralize()
print(plural)
# output:
# developers
word = Word("developers")
singular = word.singularize()
print(singular )
# output:
# developer
Using the capabilities in combination
Getting text data ready for NLP processing often means doing some data cleaning/normalization. For example, before doing a word frequency check you might run the text through the spelling correction and singularization methods to ensure misspelled and plural/singular words don't get counted as separate words. TextBlob gives you a lot of features for that type of text processing.
Stay tuned for a followup post where I discuss some more practical applications of these tools and do some comparisons with other NLP libraries like Spacy and HuggingFace.
Add new comment