Saturday, October 1, 2022

Loud fan of desktop

 Upon restart the fan of the desktop got loud again.


I cleaned the desktop from the dust but it was still loud (Lower than the first sound)

I tried to run nvidia-smi to see the temperature, couldnt see the driver

Tried to sudo apt update, didnt work

Tried to run sudo apt upgrade, worked

Ran sudo ubuntu-drivers devices

Then sudo ubuntu-drivers autoinstall

Didnt work

Then I just decided to remove all the nvidia drivers using

sudo apt-get remove --purge '^nvidia-.*'

sudo reboot

After reboot the noise was gone.


So it seems after reboot some of the drivers were upgraded which were not compatible.


Sunday, July 3, 2022

Gitlab CI/CD docker deployment

Source

(317) GitLab CI CD Tutorial for Beginners [Crash Course] - YouTube

.gitlab-ci.yml · main · Nana Janashia / gitlab-cicd-crash-course · GitLab



Main idea:

1. .gitlab-ci.yml will host the classical 3 stages which are

    a. test

    b. build

    c. deploy

2. build will use a docker image and service to build and push the docker image to docker hub

3. deploy will ssh to the target machine, kill existing docker image, run the specific docker image



variables:
  IMAGE_NAME: nanajanashia/demo-app
  IMAGE_TAG: python-app-1.0

stages:
  - test
  - build
  - deploy

run_tests:
  stage: test
  image: python:3.9-slim-buster
  before_script:
    - apt-get update && apt-get install make
  script:
    - make test


build_image:
  stage: build
  image: docker:20.10.16
  services:
    - docker:20.10.16-dind
  variables:
    DOCKER_TLS_CERTDIR: "/certs"
  before_script:
    - docker login -u $REGISTRY_USER -p $REGISTRY_PASS
  script:
    - docker build -t $IMAGE_NAME:$IMAGE_TAG .
    - docker push $IMAGE_NAME:$IMAGE_TAG


deploy:
  stage: deploy
  before_script:
    - chmod 400 $SSH_KEY
  script:
    - ssh -o StrictHostKeyChecking=no -i $SSH_KEY root@161.35.223.117 "
        docker login -u $REGISTRY_USER -p $REGISTRY_PASS &&
        docker ps -aq | xargs docker stop | xargs docker rm &&
        docker run -d -p 5000:5000 $IMAGE_NAME:$IMAGE_TAG"




Tuesday, June 14, 2022

ssh tunneling for closed ports

 

Source:

http://woshub.com/ssh-tunnel-port-forward-windows/



Problem: Some ports are blocked on the local machine by IT. Rather than asking to open ports, we use ssh tunneling to support the connection.


Assumption:

You can ssh to remote machine from the local machine using ssh user@ip


Approach:

Push the port connection through the ssh tunnel to access open ports



There are 3 modes of ssh tunneling


Local tunneling: When the local machine has IT restrictions but the remote is fine

Remote: Not yet sure what is the usecase for this


Local tunneling command


```

ssh -L 8888:10.247.2.145:7070 ubuntu@10.247.2.145

```

10.247.2.14: IP of the remote machine

7070 : Port that is opened on the remote server

8888: Local port that you will use to tunnel traffic.


After running the command above an ssh shell will open, we dont care about the shell


In the local browser open

127.0.0.1:8888


That's it





Sunday, March 13, 2022

Adding ssh keys to github

 To authenticate a connection from repository, you need to


1. Generate ssh keys in your machine

ssh-keygen -o -t rsa -C "your@email.com"

2. Copy the keys to github

Display the key

cat ~/.ssh/id_rsa.pub

Copy it to the ssh keys in github

3. Make sure the config files have the right url. Here is how it should look like. Make sure that the url has the structure shown below


[core]

        repositoryformatversion = 0

        filemode = true

        bare = false

        logallrefupdates = true

[remote "origin"]

        url = ssh://git@github.com:22/mohamedabolfadl/dubbizle_scraper.git

        fetch = +refs/heads/*:refs/remotes/origin/*

[branch "main"]

        remote = origin


Source

https://jdblischak.github.io/2014-09-18-chicago/novice/git/05-sshkeys.html


Setup the working space

 

1. Create anaconda environment

conda create --name scrappers --clone base


2. Activate environment

conda activate scrappers


3. Install key libs

$ conda config --add channels conda-forge
$ conda install cookiecutter
sudo apt-get install tree
Docker [Unverified] https://docs.docker.com/engine/install/ubuntu/
curl -fsSL https://get.docker.com -o get-docker.sh
DRY_RUN=1 sh ./get-docker.sh
4. Create cookiecutter project
mkdir defproj
cd defproj
cookiecutter -c v1 https://github.com/drivendata/cookiecutter-data-science
mkdir project_name
cd project_name
mkdir docker
5. Download Dockerfiles
curl -o Dockerfile https://raw.githubusercontent.com/mohamedabolfadl/default_docker_datascience/main/Dockerfile
cp Dockerfile project_name/docker/Dockerfile
curl -o project_name/docker/main.py https://raw.githubusercontent.com/mohamedabolfadl/default_docker_datascience/main/main.py


6. Initialize git repository [Not working at the push stage]
git init
git add .
git commit -m "first commit"
git branch -M main
git remote add origin https://github.com/USERNAME/REPONAME.git
git remote -v
git remote set-url origin git@github.com:USERNAME/REPONAME.git
git push -u origin main

6. Run Docker
docker build -t scrapper .

docker run -it -p 8080:8080 --mount type=bind,source=/home/mohamed/Desktop/old/defproj,target=/app  scrapper




Thursday, December 30, 2021

NLP summary

 

To get going with code without alot of theory this notebook is amazing

https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle


https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle



Key python libraries

transformers

nltk

spacy

newspaper

textblob


Typical uses of NLP:


1. Text classification (For eg sentiment analysis https://www.kaggle.com/c/nlp-getting-started/overview)

2. Summarization

3. Text completion (Fill missing words, Q&A, text generation)

4. Translation


Transformers:

  • Developed by Google in 2018.
  • Basis of the huggingface (transformers portal)
  • Uses a different approach than RNN, rather than following sequence in one direction it goes in both directions of text
  • Architecture is composed on an encoder and a decoder. The encoder contains a positional encoder which takes the position of a word into account which is not done in word2vec
  • Transformers learn their own embeddings ie they dont use word2vec. They build their own.
  • BERT is the most famous architecture implementing the transformer model. There are many versions of BERT trained on different corpus
  • GPT-3 is a much larger model compared to BERT (470 times bigger)

The way to use the model is as follows:

1. For typical applications such as sentiment analysis, then an off-the shelf model can be used. For example, sentiment analysis can be simply used as follows

>>> from transformers import pipeline
>>> classifier = pipeline('sentiment-analysis')
classifier('We are very happy to show you the 🤗 Transformers library.')
[{'label': 'POSITIVE', 'score': 0.9998}]

Under the hood this code uses a default model. To use a specific mode use

>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
>>> model = AutoModelForSequenceClassification.from_pretrained(model_name)
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
>>> classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

2. In case we have labeled data, we need to 'fine-tune' an existing model. For some reason, this is not so intuitive. Here's the tutorial

https://huggingface.co/docs/transformers/training

Here's a colab notebook

https://colab.research.google.com/github/huggingface/notebooks/blob/master/transformers_doc/training.ipynb



If you want to build a model from scratch, then use the following steps

Typical steps in an NLP pipeline for non-transformers

1. Cleaning: Remove non-useful words/characters for example #, http, . To get a good cleaning function check this solution: https://www.kaggle.com/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert#4.-Embeddings-and-Text-Cleaning

2. Tokenization: Convert a sentence into a list of words. The words are then converted to integers that correspond to the index in the vocabulary.

3. Embedding: 

A. Bag of words: Initially a sentence would be translated into a a vector containing the counts of each word. For eg. 60k columns of all possible words and each row contains the number of occurrences of each word -> Problem is that there is a huge number of features

B. TFIDF: Scale the count of each word according to how frequent it occurs in the entire document. 

C. Word2Vec: map each word into a vector of fixed length (for e.g. 200). The challenge is to make words that are similar to be close to each other ie the cosine similarity should be high. There are 2 approaches to build this

i. Skip gram. Given a word at position i, predict what are the most common words in i+1 to i+N and i-N to i-1

ii. CBOW: Opposite to skip-gram, given a neighborhood of words, what is the most likely word at position i

D. Other techniques: GloVe, Fastext (FB)

4. Aggregate embeddings per sentence: After embedding the words, combine the embeddings in a way to give a sentence embedding, easiest is to just average. This is not straightforward. Follow this post (https://www.kaggle.com/varun08/sentiment-analysis-using-word2vec)

5. Add metafeatures: Some features such as the sentence length, n-gram count, count of specific words of interest (Get from here https://www.kaggle.com/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert)

  • word_count number of words in text
  • unique_word_count number of unique words in text
  • stop_word_count number of stop words in text
  • url_count number of urls in text
  • mean_word_length average character count in words
  • char_count number of characters in text
  • punctuation_count number of punctuations in text
  • hashtag_count number of hashtags (#) in text
  • mention_count number of mentions (@) in text

6. Modeling: You could run an unsupervised clustering of the text check https://towardsdatascience.com/unsupervised-sentiment-analysis-a38bf1906483. Then follow it with simple supervised learning if needed. In some cases LSTM is used where each sentence is represented as an N_words x N_features matrix. The LSTM is fed the words and one output 



-----------------------------------------------------------------------------------------------

Text cleaning
def remove_html(text):
    html=re.compile(r'<.*?>')
    return html.sub(r'',text)

def remove_punct(text):
    table=str.maketrans('','',string.punctuation)
    return text.translate(table)

def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)
# Thanks to https://www.kaggle.com/rftexas/text-only-kfold-bert
def clean_tweets(tweet):
    """Removes links and non-ASCII characters"""
    
    tweet = ''.join([x for x in tweet if x in string.printable])
    
    # Removing URLs
    tweet = re.sub(r"http\S+", "", tweet)
    
    return tweet

-----------------------------------------------------------------------------------------------

Train a word2vec model on sentences
# Creating the model and setting values for the various parameters
num_features = 300  # Word vector dimensionality
min_word_count = 40 # Minimum word count
num_workers = 4     # Number of parallel threads
context = 10        # Context window size
downsampling = 1e-3 # (0.001) Downsample setting for frequent words

# Initializing the train model
from gensim.models import word2vec
print("Training model....")
model = word2vec.Word2Vec(sentences,\
                          workers=num_workers,\
                          size=num_features,\
                          min_count=min_word_count,\
                          window=context,
                          sample=downsampling)

# To make the model memory efficient
model.init_sims(replace=True)

# Saving the model for later use. Can be loaded using Word2Vec.load()
model_name = "300features_40minwords_10context"
model.save(model_name)


-----------------------------------------------------------------------------------------------

To average embeddings in a sentence. Based on (https://www.kaggle.com/varun08/sentiment-analysis-using-word2vec)


# Function to average all word vectors in a paragraph
def featureVecMethod(words, model, num_features):
    # Pre-initialising empty numpy array for speed
    featureVec = np.zeros(num_features,dtype="float32")
    nwords = 0
    
    #Converting Index2Word which is a list to a set for better speed in the execution.
    index2word_set = set(model.wv.index2word)
    
    for word in  words:
        if word in index2word_set:
            nwords = nwords + 1
            featureVec = np.add(featureVec,model[word])
    
    # Dividing the result by number of words to get average
    featureVec = np.divide(featureVec, nwords)
    return featureVec
# Function for calculating the average feature vector
def getAvgFeatureVecs(reviews, model, num_features):
    counter = 0
    reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
    for review in reviews:
        # Printing a status message every 1000th review
        if counter%1000 == 0:
            print("Review %d of %d"%(counter,len(reviews)))
            
        reviewFeatureVecs[counter] = featureVecMethod(review, model, num_features)
        counter = counter+1
        
    return reviewFeatureVecs
# Calculating average feature vector for training set
clean_train_reviews = []
for review in train['review']:
    clean_train_reviews.append(review_wordlist(review, remove_stopwords=True))
    
trainDataVecs = getAvgFeatureVecs(clean_train_reviews, model, num_features)


Here, model is word2vec that has been trained earlier on the entire text


# Fitting a random forest classifier to the training data
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 100)
    
print("Fitting random forest to training data....")    
forest = forest.fit(trainDataVecs, train["sentiment"])

-----------------------------------------------------------------------------------------------


Averaging mebeddings over a sentence


# this function creates a normalized vector for the whole sentence
def sent2vec(s):
    words = str(s).lower().decode('utf-8')
    words = word_tokenize(words)
    words = [w for w in words if not w in stop_words]
    words = [w for w in words if w.isalpha()]
    M = []
    for w in words:
        try:
            M.append(embeddings_index[w])
        except:
            continue
    M = np.array(M)
    v = M.sum(axis=0)
    if type(v) != np.ndarray:
        return np.zeros(300)
    return v / np.sqrt((v ** 2).sum())






Get bigrams

def get_top_tweet_bigrams(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]
In [26]:
plt.figure(figsize=(16,5))
top_tweet_bigrams=get_top_tweet_bigrams(tweet['text'])[:10]
x,y=map(list,zip(*top_tweet_bigrams))
sns.barplot(x=y,y=x)
def generate_ngrams(text, n_gram=1):
    token = [token for token in text.lower().split(' ') if token != '' if token not in STOPWORDS]
    ngrams = zip(*[token[i:] for i in range(n_gram)])
    return [' '.join(ngram) for ngram in ngrams]

-----------------------------------------------------------------------------------------------

Tools for sentiment analysis:

https://medium.com/@b.terryjack/nlp-pre-trained-sentiment-analysis-1eb52a9d742c

Financial: FinBert hugging face


https://huggingface.co/yiyanghkust/finbert-tone


from transformers import BertTokenizer, BertForSequenceClassification
from transformers import pipeline

finbert = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone',num_labels=3)
tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone')

nlp = pipeline("sentiment-analysis", model=finbert, tokenizer=tokenizer)

sentences = ["there is a shortage of capital, and we need extra financing",  
             "growth is strong and we have plenty of liquidity", 
             "there are doubts about our finances", 
             "profits are flat"]
results = nlp(sentences)
print(results)  #LABEL_0: neutral; LABEL_1: positive; LABEL_2: negative


NLTK (uses vader)

import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
sid.polarity_scores(sentence)

TextBlob

from textblob import TextBlob
TextBlob(sentence).sentiment

Vader (https://github.com/cjhutto/vaderSentiment, https://www.geeksforgeeks.org/python-sentiment-analysis-using-vader/)

  from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
    #note: depending on how you installed (e.g., using source code download versus pip install), you may need to import like this:
    #from vaderSentiment import SentimentIntensityAnalyzer

# --- examples -------
sentences = ["VADER is smart, handsome, and funny.",  # positive sentence example
             "VADER is smart, handsome, and funny!",  # punctuation emphasis handled correctly (sentiment intensity adjusted)
             "VADER is very smart, handsome, and funny.", # booster words handled correctly (sentiment intensity adjusted)
             "VADER is VERY SMART, handsome, and FUNNY.",  # emphasis for ALLCAPS handled
             "VADER is VERY SMART, handsome, and FUNNY!!!", # combination of signals - VADER appropriately adjusts intensity
             "VADER is VERY SMART, uber handsome, and FRIGGIN FUNNY!!!", # booster words & punctuation make this close to ceiling for score
             "VADER is not smart, handsome, nor funny.",  # negation sentence example
             "The book was good.",  # positive sentence
             "At least it isn't a horrible book.",  # negated negative sentence with contraction
             "The book was only kind of good.", # qualified positive sentence is handled correctly (intensity adjusted)
             "The plot was good, but the characters are uncompelling and the dialog is not great.", # mixed negation sentence
             "Today SUX!",  # negative slang with capitalization emphasis
             "Today only kinda sux! But I'll get by, lol", # mixed sentiment example with slang and constrastive conjunction "but"
             "Make sure you :) or :D today!",  # emoticons handled
             "Catch utf-8 emoji such as such as 💘 and 💋 and 😁",  # emojis handled
             "Not bad at all"  # Capitalized negation
             ]

analyzer = SentimentIntensityAnalyzer()
for sentence in sentences:
    vs = analyzer.polarity_scores(sentence)
    print("{:-<65} {}".format(sentence, str(vs)))

Flair

!pip3 install flair
import flair
flair_sentiment = flair.models.TextClassifier.load('en-sentiment')
s = flair.data.Sentence(sentence)
flair_sentiment.predict(s)
total_sentiment = s.labels
total_sentiment

DeepMoji

!git clone https://github.com/huggingface/torchMoji
import os
os.chdir('torchMoji')
!pip3 install -e .
!python3 scripts/download_weights.py
!python3 examples/text_emojize.py --text f" {sentence} "




Loud fan of desktop

 Upon restart the fan of the desktop got loud again. I cleaned the desktop from the dust but it was still loud (Lower than the first sound) ...