To get going with code without alot of theory this notebook is amazing
https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle
https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle
Key python libraries
transformers
nltk
spacy
newspaper
textblob
Typical uses of NLP:
1. Text classification (For eg sentiment analysis https://www.kaggle.com/c/nlp-getting-started/overview)
2. Summarization
3. Text completion (Fill missing words, Q&A, text generation)
4. Translation
Transformers:
- Developed by Google in 2018.
- Basis of the huggingface (transformers portal)
- Uses a different approach than RNN, rather than following sequence in one direction it goes in both directions of text
- Architecture is composed on an encoder and a decoder. The encoder contains a positional encoder which takes the position of a word into account which is not done in word2vec
- Transformers learn their own embeddings ie they dont use word2vec. They build their own.
- BERT is the most famous architecture implementing the transformer model. There are many versions of BERT trained on different corpus
- GPT-3 is a much larger model compared to BERT (470 times bigger)
The way to use the model is as follows:
1. For typical applications such as sentiment analysis, then an off-the shelf model can be used. For example, sentiment analysis can be simply used as follows
>>> from transformers import pipeline
>>> classifier = pipeline('sentiment-analysis')
classifier('We are very happy to show you the 🤗 Transformers library.')
[{'label': 'POSITIVE', 'score': 0.9998}]Under the hood this code uses a default model. To use a specific mode use
>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
>>> model = AutoModelForSequenceClassification.from_pretrained(model_name)
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
>>> classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
2. In case we have labeled data, we need to 'fine-tune' an existing model. For some reason, this is not so intuitive. Here's the tutorial
https://huggingface.co/docs/transformers/training
Here's a colab notebook
https://colab.research.google.com/github/huggingface/notebooks/blob/master/transformers_doc/training.ipynb
If you want to build a model from scratch, then use the following steps
Typical steps in an NLP pipeline for non-transformers
1. Cleaning: Remove non-useful words/characters for example #, http, . To get a good cleaning function check this solution: https://www.kaggle.com/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert#4.-Embeddings-and-Text-Cleaning
2. Tokenization: Convert a sentence into a list of words. The words are then converted to integers that correspond to the index in the vocabulary.
3. Embedding:
A. Bag of words: Initially a sentence would be translated into a a vector containing the counts of each word. For eg. 60k columns of all possible words and each row contains the number of occurrences of each word -> Problem is that there is a huge number of features
B. TFIDF: Scale the count of each word according to how frequent it occurs in the entire document.
C. Word2Vec: map each word into a vector of fixed length (for e.g. 200). The challenge is to make words that are similar to be close to each other ie the cosine similarity should be high. There are 2 approaches to build this
i. Skip gram. Given a word at position i, predict what are the most common words in i+1 to i+N and i-N to i-1
ii. CBOW: Opposite to skip-gram, given a neighborhood of words, what is the most likely word at position i
D. Other techniques: GloVe, Fastext (FB)
4. Aggregate embeddings per sentence: After embedding the words, combine the embeddings in a way to give a sentence embedding, easiest is to just average. This is not straightforward. Follow this post (https://www.kaggle.com/varun08/sentiment-analysis-using-word2vec)
5. Add metafeatures: Some features such as the sentence length, n-gram count, count of specific words of interest (Get from here https://www.kaggle.com/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert)
word_count
number of words in textunique_word_count
number of unique words in textstop_word_count
number of stop words in texturl_count
number of urls in textmean_word_length
average character count in wordschar_count
number of characters in textpunctuation_count
number of punctuations in texthashtag_count
number of hashtags (#) in textmention_count
number of mentions (@) in text
6. Modeling: You could run an unsupervised clustering of the text check https://towardsdatascience.com/unsupervised-sentiment-analysis-a38bf1906483. Then follow it with simple supervised learning if needed. In some cases LSTM is used where each sentence is represented as an N_words x N_features matrix. The LSTM is fed the words and one output
-----------------------------------------------------------------------------------------------
def remove_html(text): html=re.compile(r'<.*?>') return html.sub(r'',text)
def remove_punct(text): table=str.maketrans('','',string.punctuation) return text.translate(table)
def remove_emoji(text): emoji_pattern = re.compile("[" u"\U0001F600-\U0001F64F" # emoticons u"\U0001F300-\U0001F5FF" # symbols & pictographs u"\U0001F680-\U0001F6FF" # transport & map symbols u"\U0001F1E0-\U0001F1FF" # flags (iOS) u"\U00002702-\U000027B0" u"\U000024C2-\U0001F251" "]+", flags=re.UNICODE) return emoji_pattern.sub(r'', text)
# Thanks to https://www.kaggle.com/rftexas/text-only-kfold-bert def clean_tweets(tweet): """Removes links and non-ASCII characters""" tweet = ''.join([x for x in tweet if x in string.printable]) # Removing URLs tweet = re.sub(r"http\S+", "", tweet) return tweet
-----------------------------------------------------------------------------------------------
# Creating the model and setting values for the various parameters num_features = 300 # Word vector dimensionality min_word_count = 40 # Minimum word count num_workers = 4 # Number of parallel threads context = 10 # Context window size downsampling = 1e-3 # (0.001) Downsample setting for frequent words # Initializing the train model from gensim.models import word2vec print("Training model....") model = word2vec.Word2Vec(sentences,\ workers=num_workers,\ size=num_features,\ min_count=min_word_count,\ window=context, sample=downsampling) # To make the model memory efficient model.init_sims(replace=True) # Saving the model for later use. Can be loaded using Word2Vec.load() model_name = "300features_40minwords_10context" model.save(model_name)
-----------------------------------------------------------------------------------------------
To average embeddings in a sentence. Based on (https://www.kaggle.com/varun08/sentiment-analysis-using-word2vec)
# Function to average all word vectors in a paragraph def featureVecMethod(words, model, num_features): # Pre-initialising empty numpy array for speed featureVec = np.zeros(num_features,dtype="float32") nwords = 0 #Converting Index2Word which is a list to a set for better speed in the execution. index2word_set = set(model.wv.index2word) for word in words: if word in index2word_set: nwords = nwords + 1 featureVec = np.add(featureVec,model[word]) # Dividing the result by number of words to get average featureVec = np.divide(featureVec, nwords) return featureVec
# Function for calculating the average feature vector def getAvgFeatureVecs(reviews, model, num_features): counter = 0 reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32") for review in reviews: # Printing a status message every 1000th review if counter%1000 == 0: print("Review %d of %d"%(counter,len(reviews))) reviewFeatureVecs[counter] = featureVecMethod(review, model, num_features) counter = counter+1 return reviewFeatureVecs
# Calculating average feature vector for training set clean_train_reviews = [] for review in train['review']: clean_train_reviews.append(review_wordlist(review, remove_stopwords=True)) trainDataVecs = getAvgFeatureVecs(clean_train_reviews, model, num_features)
Here, model is word2vec that has been trained earlier on the entire text
# Fitting a random forest classifier to the training data from sklearn.ensemble import RandomForestClassifier forest = RandomForestClassifier(n_estimators = 100) print("Fitting random forest to training data....") forest = forest.fit(trainDataVecs, train["sentiment"])
-----------------------------------------------------------------------------------------------
Averaging mebeddings over a sentence
# this function creates a normalized vector for the whole sentence def sent2vec(s): words = str(s).lower().decode('utf-8') words = word_tokenize(words) words = [w for w in words if not w in stop_words] words = [w for w in words if w.isalpha()] M = [] for w in words: try: M.append(embeddings_index[w]) except: continue M = np.array(M) v = M.sum(axis=0) if type(v) != np.ndarray: return np.zeros(300) return v / np.sqrt((v ** 2).sum())
Get bigrams
def get_top_tweet_bigrams(corpus, n=None):
vec = CountVectorizer(ngram_range=(2, 2)).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
plt.figure(figsize=(16,5))
top_tweet_bigrams=get_top_tweet_bigrams(tweet['text'])[:10]
x,y=map(list,zip(*top_tweet_bigrams))
sns.barplot(x=y,y=x)
def generate_ngrams(text, n_gram=1): token = [token for token in text.lower().split(' ') if token != '' if token not in STOPWORDS] ngrams = zip(*[token[i:] for i in range(n_gram)]) return [' '.join(ngram) for ngram in ngrams]
-----------------------------------------------------------------------------------------------
Tools for sentiment analysis:
https://medium.com/@b.terryjack/nlp-pre-trained-sentiment-analysis-1eb52a9d742c
Financial: FinBert hugging face
https://huggingface.co/yiyanghkust/finbert-tone
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import pipeline
finbert = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone',num_labels=3)
tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone')
nlp = pipeline("sentiment-analysis", model=finbert, tokenizer=tokenizer)
sentences = ["there is a shortage of capital, and we need extra financing",
"growth is strong and we have plenty of liquidity",
"there are doubts about our finances",
"profits are flat"]
results = nlp(sentences)
print(results) #LABEL_0: neutral; LABEL_1: positive; LABEL_2: negative
NLTK (uses vader)
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
sid.polarity_scores(sentence)
TextBlob
from textblob import TextBlob
TextBlob(sentence).sentiment
Vader (https://github.com/cjhutto/vaderSentiment, https://www.geeksforgeeks.org/python-sentiment-analysis-using-vader/)
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer #note: depending on how you installed (e.g., using source code download versus pip install), you may need to import like this: #from vaderSentiment import SentimentIntensityAnalyzer # --- examples ------- sentences = ["VADER is smart, handsome, and funny.", # positive sentence example "VADER is smart, handsome, and funny!", # punctuation emphasis handled correctly (sentiment intensity adjusted) "VADER is very smart, handsome, and funny.", # booster words handled correctly (sentiment intensity adjusted) "VADER is VERY SMART, handsome, and FUNNY.", # emphasis for ALLCAPS handled "VADER is VERY SMART, handsome, and FUNNY!!!", # combination of signals - VADER appropriately adjusts intensity "VADER is VERY SMART, uber handsome, and FRIGGIN FUNNY!!!", # booster words & punctuation make this close to ceiling for score "VADER is not smart, handsome, nor funny.", # negation sentence example "The book was good.", # positive sentence "At least it isn't a horrible book.", # negated negative sentence with contraction "The book was only kind of good.", # qualified positive sentence is handled correctly (intensity adjusted) "The plot was good, but the characters are uncompelling and the dialog is not great.", # mixed negation sentence "Today SUX!", # negative slang with capitalization emphasis "Today only kinda sux! But I'll get by, lol", # mixed sentiment example with slang and constrastive conjunction "but" "Make sure you :) or :D today!", # emoticons handled "Catch utf-8 emoji such as such as 💘 and 💋 and 😁", # emojis handled "Not bad at all" # Capitalized negation ] analyzer = SentimentIntensityAnalyzer() for sentence in sentences: vs = analyzer.polarity_scores(sentence) print("{:-<65} {}".format(sentence, str(vs)))
Flair
!pip3 install flair
import flair
flair_sentiment = flair.models.TextClassifier.load('en-sentiment')
s = flair.data.Sentence(sentence)
flair_sentiment.predict(s)
total_sentiment = s.labels
total_sentiment
DeepMoji
!git clone https://github.com/huggingface/torchMoji
import os
os.chdir('torchMoji')
!pip3 install -e .
!python3 scripts/download_weights.py
!python3 examples/text_emojize.py --text f" {sentence} "