To get going with code without alot of theory this notebook is amazing

https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle

Key python libraries

transformers

nltk

spacy

newspaper

textblob

Typical uses of NLP:

1. Text classification (For eg sentiment analysis https://www.kaggle.com/c/nlp-getting-started/overview)

2. Summarization

3. Text completion (Fill missing words, Q&A, text generation)

4. Translation

Transformers:

Developed by Google in 2018.
Basis of the huggingface (transformers portal)
Uses a different approach than RNN, rather than following sequence in one direction it goes in both directions of text
Architecture is composed on an encoder and a decoder. The encoder contains a positional encoder which takes the position of a word into account which is not done in word2vec
Transformers learn their own embeddings ie they dont use word2vec. They build their own.
BERT is the most famous architecture implementing the transformer model. There are many versions of BERT trained on different corpus
GPT-3 is a much larger model compared to BERT (470 times bigger)

The way to use the model is as follows:

1. For typical applications such as sentiment analysis, then an off-the shelf model can be used. For example, sentiment analysis can be simply used as follows
>>> from transformers import pipeline
>>> classifier = pipeline('sentiment-analysis')
classifier('We are very happy to show you the 🤗 Transformers library.')
[{'label': 'POSITIVE', 'score': 0.9998}]
Under the hood this code uses a default model. To use a specific mode use
>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
>>> model = AutoModelForSequenceClassification.from_pretrained(model_name)
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
>>> classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
2. In case we have labeled data, we need to 'fine-tune' an existing model. For some reason, this is not so intuitive. Here's the tutorial
https://huggingface.co/docs/transformers/training
Here's a colab notebook
https://colab.research.google.com/github/huggingface/notebooks/blob/master/transformers_doc/training.ipynb

If you want to build a model from scratch, then use the following steps

Typical steps in an NLP pipeline for non-transformers

1. Cleaning: Remove non-useful words/characters for example #, http, . To get a good cleaning function check this solution: https://www.kaggle.com/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert#4.-Embeddings-and-Text-Cleaning

2. Tokenization: Convert a sentence into a list of words. The words are then converted to integers that correspond to the index in the vocabulary.

3. Embedding:

A. Bag of words: Initially a sentence would be translated into a a vector containing the counts of each word. For eg. 60k columns of all possible words and each row contains the number of occurrences of each word -> Problem is that there is a huge number of features

B. TFIDF: Scale the count of each word according to how frequent it occurs in the entire document.

C. Word2Vec: map each word into a vector of fixed length (for e.g. 200). The challenge is to make words that are similar to be close to each other ie the cosine similarity should be high. There are 2 approaches to build this

i. Skip gram. Given a word at position i, predict what are the most common words in i+1 to i+N and i-N to i-1

ii. CBOW: Opposite to skip-gram, given a neighborhood of words, what is the most likely word at position i

D. Other techniques: GloVe, Fastext (FB)

4. Aggregate embeddings per sentence: After embedding the words, combine the embeddings in a way to give a sentence embedding, easiest is to just average. This is not straightforward. Follow this post (https://www.kaggle.com/varun08/sentiment-analysis-using-word2vec)

5. Add metafeatures: Some features such as the sentence length, n-gram count, count of specific words of interest (Get from here https://www.kaggle.com/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert)

word_count number of words in text
unique_word_count number of unique words in text
stop_word_count number of stop words in text
url_count number of urls in text
mean_word_length average character count in words
char_count number of characters in text
punctuation_count number of punctuations in text
hashtag_count number of hashtags (#) in text
mention_count number of mentions (@) in text

6. Modeling: You could run an unsupervised clustering of the text check https://towardsdatascience.com/unsupervised-sentiment-analysis-a38bf1906483. Then follow it with simple supervised learning if needed. In some cases LSTM is used where each sentence is represented as an N_words x N_features matrix. The LSTM is fed the words and one output

-----------------------------------------------------------------------------------------------

Text cleaning

def remove_html(text):
    html=re.compile(r'<.*?>')
    return html.sub(r'',text)

def remove_punct(text):
    table=str.maketrans('','',string.punctuation)
    return text.translate(table)

def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

# Thanks to https://www.kaggle.com/rftexas/text-only-kfold-bert
def clean_tweets(tweet):
    """Removes links and non-ASCII characters"""
    
    tweet = ''.join([x for x in tweet if x in string.printable])
    
    # Removing URLs
    tweet = re.sub(r"http\S+", "", tweet)
    
    return tweet

-----------------------------------------------------------------------------------------------

Train a word2vec model on sentences

# Creating the model and setting values for the various parameters
num_features = 300  # Word vector dimensionality
min_word_count = 40 # Minimum word count
num_workers = 4     # Number of parallel threads
context = 10        # Context window size
downsampling = 1e-3 # (0.001) Downsample setting for frequent words

# Initializing the train model
from gensim.models import word2vec
print("Training model....")
model = word2vec.Word2Vec(sentences,\
                          workers=num_workers,\
                          size=num_features,\
                          min_count=min_word_count,\
                          window=context,
                          sample=downsampling)

# To make the model memory efficient
model.init_sims(replace=True)

# Saving the model for later use. Can be loaded using Word2Vec.load()
model_name = "300features_40minwords_10context"
model.save(model_name)

-----------------------------------------------------------------------------------------------

To average embeddings in a sentence. Based on (https://www.kaggle.com/varun08/sentiment-analysis-using-word2vec)

# Function to average all word vectors in a paragraph
def featureVecMethod(words, model, num_features):
    # Pre-initialising empty numpy array for speed
    featureVec = np.zeros(num_features,dtype="float32")
    nwords = 0
    
    #Converting Index2Word which is a list to a set for better speed in the execution.
    index2word_set = set(model.wv.index2word)
    
    for word in  words:
        if word in index2word_set:
            nwords = nwords + 1
            featureVec = np.add(featureVec,model[word])
    
    # Dividing the result by number of words to get average
    featureVec = np.divide(featureVec, nwords)
    return featureVec

# Function for calculating the average feature vector
def getAvgFeatureVecs(reviews, model, num_features):
    counter = 0
    reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
    for review in reviews:
        # Printing a status message every 1000th review
        if counter%1000 == 0:
            print("Review %d of %d"%(counter,len(reviews)))
            
        reviewFeatureVecs[counter] = featureVecMethod(review, model, num_features)
        counter = counter+1
        
    return reviewFeatureVecs

# Calculating average feature vector for training set
clean_train_reviews = []
for review in train['review']:
    clean_train_reviews.append(review_wordlist(review, remove_stopwords=True))
    
trainDataVecs = getAvgFeatureVecs(clean_train_reviews, model, num_features)

Here, model is word2vec that has been trained earlier on the entire text

# Fitting a random forest classifier to the training data
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 100)
    
print("Fitting random forest to training data....")    
forest = forest.fit(trainDataVecs, train["sentiment"])

-----------------------------------------------------------------------------------------------

Averaging mebeddings over a sentence

# this function creates a normalized vector for the whole sentence
def sent2vec(s):
    words = str(s).lower().decode('utf-8')
    words = word_tokenize(words)
    words = [w for w in words if not w in stop_words]
    words = [w for w in words if w.isalpha()]
    M = []
    for w in words:
        try:
            M.append(embeddings_index[w])
        except:
            continue
    M = np.array(M)
    v = M.sum(axis=0)
    if type(v) != np.ndarray:
        return np.zeros(300)
    return v / np.sqrt((v ** 2).sum())

Get bigrams

def get_top_tweet_bigrams(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

In [26]:
plt.figure(figsize=(16,5))
top_tweet_bigrams=get_top_tweet_bigrams(tweet['text'])[:10]
x,y=map(list,zip(*top_tweet_bigrams))
sns.barplot(x=y,y=x)

def generate_ngrams(text, n_gram=1):
    token = [token for token in text.lower().split(' ') if token != '' if token not in STOPWORDS]
    ngrams = zip(*[token[i:] for i in range(n_gram)])
    return [' '.join(ngram) for ngram in ngrams]

-----------------------------------------------------------------------------------------------

Tools for sentiment analysis:

https://medium.com/@b.terryjack/nlp-pre-trained-sentiment-analysis-1eb52a9d742c

Financial: FinBert hugging face

https://huggingface.co/yiyanghkust/finbert-tone

from transformers import BertTokenizer, BertForSequenceClassification
from transformers import pipeline

finbert = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone',num_labels=3)
tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone')

nlp = pipeline("sentiment-analysis", model=finbert, tokenizer=tokenizer)

sentences = ["there is a shortage of capital, and we need extra financing",  
             "growth is strong and we have plenty of liquidity", 
             "there are doubts about our finances", 
             "profits are flat"]
results = nlp(sentences)
print(results)  #LABEL_0: neutral; LABEL_1: positive; LABEL_2: negative

NLTK (uses vader)

import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
sid.polarity_scores(sentence)

TextBlob

from textblob import TextBlob
TextBlob(sentence).sentiment

Vader (https://github.com/cjhutto/vaderSentiment, https://www.geeksforgeeks.org/python-sentiment-analysis-using-vader/)

  from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
    #note: depending on how you installed (e.g., using source code download versus pip install), you may need to import like this:
    #from vaderSentiment import SentimentIntensityAnalyzer

# --- examples -------
sentences = ["VADER is smart, handsome, and funny.",  # positive sentence example
             "VADER is smart, handsome, and funny!",  # punctuation emphasis handled correctly (sentiment intensity adjusted)
             "VADER is very smart, handsome, and funny.", # booster words handled correctly (sentiment intensity adjusted)
             "VADER is VERY SMART, handsome, and FUNNY.",  # emphasis for ALLCAPS handled
             "VADER is VERY SMART, handsome, and FUNNY!!!", # combination of signals - VADER appropriately adjusts intensity
             "VADER is VERY SMART, uber handsome, and FRIGGIN FUNNY!!!", # booster words & punctuation make this close to ceiling for score
             "VADER is not smart, handsome, nor funny.",  # negation sentence example
             "The book was good.",  # positive sentence
             "At least it isn't a horrible book.",  # negated negative sentence with contraction
             "The book was only kind of good.", # qualified positive sentence is handled correctly (intensity adjusted)
             "The plot was good, but the characters are uncompelling and the dialog is not great.", # mixed negation sentence
             "Today SUX!",  # negative slang with capitalization emphasis
             "Today only kinda sux! But I'll get by, lol", # mixed sentiment example with slang and constrastive conjunction "but"
             "Make sure you :) or :D today!",  # emoticons handled
             "Catch utf-8 emoji such as such as 💘 and 💋 and 😁",  # emojis handled
             "Not bad at all"  # Capitalized negation
             ]

analyzer = SentimentIntensityAnalyzer()
for sentence in sentences:
    vs = analyzer.polarity_scores(sentence)
    print("{:-<65} {}".format(sentence, str(vs)))

Flair

!pip3 install flair
import flair
flair_sentiment = flair.models.TextClassifier.load('en-sentiment')
s = flair.data.Sentence(sentence)
flair_sentiment.predict(s)
total_sentiment = s.labels
total_sentiment

DeepMoji

!git clone https://github.com/huggingface/torchMoji
import os
os.chdir('torchMoji')
!pip3 install -e .
!python3 scripts/download_weights.py
!python3 examples/text_emojize.py --text f" {sentence} "

ابو الفضل

Thursday, December 30, 2021

NLP summary

Loud fan of desktop

Followers

Report Abuse