Comp 440 HW3: Mining Sentiment on Twitter

Samantha Fritsche

3/26/2020

In [0]:
import pandas as pd
import numpy as np

Data Collection

I collected and cleaned my data on my local machine, then uploaded the json files to Dropbox. While collecting my labeled tweet dataset, I confirmed that I given tweet contained only ":)" or ":(," but not both. In addition, I stripped out usernames, links, punctuation, and stopwords. I also skipped retweets so that any single tweet by a popular individual would not be given greater weight than any other tweet. I preserved the original text in the "text" field and stored the cleaned text in the "clean_text" field. In addition, I assigned tweets a sentiment based on whether they included ":)," ":(," or neither - "happy," "sad," or "?," respectively. The last sentiment category only came into play as I was collecting my keyword-based dataset. I chose to include tweets containing the keyword, "healthcare." I did not deliberately collect tweets with either emoticon in them, but I did allow them. I've included below a code sample showing how I gathered my "happy" dataset. For all of the datasets, I avoided duplicating tweets by ensuring that, for each new stream tweet, its ID did not match any of the past 100 ID's.

Overall, I collected 32,500 tweets - 15k "happy" tweets, 15k "sad" tweets, and 2.5k "healthcare" tweets.

import tweepy

import json, string

import re

from nltk.corpus import stopwords

class MyStreamListener(tweepy.StreamListener):

def __init__(self, stop_at=15000):
    super(MyStreamListener, self).__init__()
    self.tweet_counter = 0
    self.datalist = []
    self.stop_at = stop_at
    self.file_entries = 0
    self.checkpoint = 100

    self.file_cap = 5000

    self.ids = [0] * 100
    self.id_exists = False
    self.stops = stopwords.words("english")


def on_status(self, status):

    # Write a file every 5000 entries read
    # If the connection messes up, we'll have some of the stream at least

    if self.tweet_counter % self.checkpoint == 0:
        print(self.tweet_counter)

    if len(self.datalist) >= self.file_cap:
        print('file created')
        with open('nsmile_tweets_' + str(self.file_entries) + '.json', 'w') as outfile:
            json.dump(self.datalist, outfile)
        self.file_entries += 1
        self.datalist.clear()

    if self.tweet_counter < self.stop_at:
        tweetj = status._json

        if tweetj['id'] not in self.ids:
            self.id_exists = False
        else:
            self.id_exists = True

        if 'retweeted_status' not in tweetj and not self.id_exists:
            if (tweetj['lang']=='en'):


                text = ''
                if 'extended_tweet' in tweetj:
                    # print('hit full')
                    text = tweetj['extended_tweet']['full_text'] # Make sure to get full tweet text
                else:
                    text = tweetj['text']

                tweetj['text'] = text

                # ----------- Set invalids -----------
                if ':)' in text:  # skip messages without certain things

                    if ':(' not in text:

                        # print("Orig: " + tweetj['text'])

                        text = text.lower() # Make everything lower
                        text = re.sub('http://\S+|https://\S+', '', text)  # Remove links
                        text = re.sub('@[^\s]+', '', text)  # Remove usernames
                        text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
                        text = ' '.join([word for word in text.split() if word not in (self.stops)]) # Remove stopwords
                        # print("New: " + text)

                        tweetj['clean_text'] = text

                        # ----------- Set sentiment start -----------
                        tweetj['sentiment'] = 'happy'
                        # ----------- Set sentiment end -----------

                        self.datalist.append(tweetj)

                        self.tweet_counter += 1
                        self.ids.pop()
                        self.ids.append(tweetj['id'])
        return True

    else:
        return False

consumer_token = 'xxx' consumer_secret = 'xxx'

access_token = 'xxx' access_secret = 'xxx'

auth = tweepy.OAuthHandler(consumer_token, consumer_secret)

auth.set_access_token(access_token, access_secret)

listener = MyStreamListener(stop_at=15000)

streaming_api = tweepy.Stream(auth, listener, timeout=60, tweet_mode='extended')

streaming_api.filter(track=[':)'])

Data Preparation

To prepare my data for use as a training data set, I concatenated all of the data I'd collected into a single pandas DataFrame. I only included my text and sentiment data as I was only interested in doing text analysis.

In [0]:
sm1_df = pd.read_json('https://www.dropbox.com/s/gp1nuytnu42mbm1/nsmile_tweets_0.json?dl=1')[['clean_text', 'text', 'sentiment']]
sm2_df = pd.read_json('https://www.dropbox.com/s/1qh2hxgpzpz530q/nsmile_tweets_1.json?dl=1')[['clean_text', 'text', 'sentiment']]
sm3_df = pd.read_json('https://www.dropbox.com/s/dnnbi6p7kswqjqj/nsmile_tweets_2.json?dl=1')[['clean_text', 'text', 'sentiment']]

fr1_df = pd.read_json('https://www.dropbox.com/s/donahmoq4uhybme/nfrown_tweets_0.json?dl=1')[['clean_text', 'text', 'sentiment']]
fr2_df = pd.read_json('https://www.dropbox.com/s/e885oryqk4q3czx/nfrown_tweets_1.json?dl=1')[['clean_text', 'text', 'sentiment']]
fr3_df = pd.read_json('https://www.dropbox.com/s/0oycgnwbao086d4/nfrown_tweets_2.json?dl=1')[['clean_text', 'text', 'sentiment']]

misc_sm_df = pd.read_json('https://www.dropbox.com/s/apxxw40v3jo3532/nmisc_smile_tweets.json?dl=1')[['clean_text', 'text', 'sentiment']]
misc_fr_df = pd.read_json('https://www.dropbox.com/s/uqsw3zf2e3y1tl3/nmisc_frown_tweets.json?dl=1')[['clean_text', 'text', 'sentiment']]
misc_qu_df = pd.read_json('https://www.dropbox.com/s/s6pz9q94uqhcd1m/nmisc_quest_tweets.json?dl=1')[['clean_text', 'text', 'sentiment']]
In [32]:
tweet_df = pd.concat([sm1_df, sm2_df, sm3_df, fr1_df, fr2_df, fr3_df, misc_sm_df, misc_fr_df, misc_qu_df], ignore_index=True).reset_index(drop=True)
tweet_df.head()
Out[32]:
clean_text text sentiment
0 passport renewal took like hour half bad Passport renewal only took like an hour and a half, not bad :) happy
1 oh goodness good one @mpatte10 Oh my goodness! Good one! :) happy
2 say jumped plane fell rainbownowell read happened skydiving skydivingnewengland maine Can you say you have jumped out of a plane, and then fell through a rainbow?...No?......Well I can, because I did :) \n.\nRead how it happened here https://t.co/iYekJ3g5om\n.\n#Skydiving #SkydivingNewEngland #Maine https://t.co/L9iiKYGLDw happy
3 course @gloriagaynor But, of course! :) happy
4 oh didnt know @JonComms Oh my. Didn't know this :) happy

After concatenating the data, I constructed a Tf-Idf matrix for all of the data.

In [0]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

count_vect = CountVectorizer()
counts = count_vect.fit_transform(tweet_df['clean_text'])

tfidf_transformer = TfidfTransformer()
full_tfidf = tfidf_transformer.fit_transform(counts)
In [34]:
display(full_tfidf)
<32500x29578 sparse matrix of type '<class 'numpy.float64'>'
	with 240337 stored elements in Compressed Sparse Row format>

Since I reset the indices of my DataFrame, the array row indices coordinate with the index of my DataFrame. Each row of the matrix represents a tweet, with each of the columns representing a one-word "feature" that tweet includes. Below I printed information about the first three rows of the matrix, AKA the first three tweets in the DataFrame, which can be seen above. If you compare the "clean_text" field to the number of stored elements, you can see the number of words in the former is equal to the latter.

In [35]:
display(full_tfidf[0,:])
display(full_tfidf[1,:])
display(full_tfidf[2,:])
<1x29578 sparse matrix of type '<class 'numpy.float64'>'
	with 7 stored elements in Compressed Sparse Row format>
<1x29578 sparse matrix of type '<class 'numpy.float64'>'
	with 4 stored elements in Compressed Sparse Row format>
<1x29578 sparse matrix of type '<class 'numpy.float64'>'
	with 10 stored elements in Compressed Sparse Row format>

However, although I want my Tf-Idf matrix to include features for all of the data, I want to train my classifier only based on the labeled dataset and test the classifier on unlabeled data. Therefore, I needed to break the matrix up into two subsets - labeled and unlabeled. I then broke the labeled subset up into test and training data.

In [0]:
tweet_df_labeled = tweet_df[tweet_df['sentiment']!='?']

tweet_df_unlabeled = tweet_df[tweet_df['sentiment']=='?'].copy() # denote rows to remove
mask_indices = list(tweet_df_unlabeled.index) # array position indices
In [0]:
mask = np.ones(full_tfidf.shape[0], dtype=bool)
mask[mask_indices] = False

labeled_tfidf = full_tfidf[mask]
In [38]:
labeled_tfidf
Out[38]:
<30003x29578 sparse matrix of type '<class 'numpy.float64'>'
	with 204868 stored elements in Compressed Sparse Row format>
In [0]:
(labeled_tfidf_train, labeled_tfidf_test, labeled_train, labeled_test) = train_test_split(labeled_tfidf, 
                                                                  tweet_df_labeled['sentiment'], 
                                                                  test_size=0.2, random_state=1)

Using the training data, I fit a Gradient Boosting Classifier. I also produced the confusion matrix for the test data. My classifier achieved an accuracy of approximately 68%. This is a relatively low score, therefore I later attempted to improve it by including bigrams as feature, in the next section.

In [0]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import confusion_matrix

gbc = GradientBoostingClassifier().fit(labeled_tfidf_train, labeled_train)
labeled_pred = gbc.predict(labeled_tfidf_test)
In [41]:
c_matrix = confusion_matrix(labeled_test, labeled_pred)
display(c_matrix)
display((c_matrix[0,0]+c_matrix[1,1])/(c_matrix[1,0]+c_matrix[0,1]+c_matrix[0,0]+c_matrix[1,1]))
array([[2678,  346],
       [1554, 1423]])
0.6833861023162806

I was also interested in seeing the most "important" features in determining sentiment. Logically, words like "miss," "sad," and "thanks," which generally have a clear positive/negative skew, are high on the list.

In [42]:
top_indices = gbc.feature_importances_.argsort()[::-1]
names = count_vect.get_feature_names()
print([names[i] for i in top_indices[:30]])
['miss', 'sad', 'thanks', 'sorry', 'want', 'thank', 'love', 'great', 'wish', 'good', 'wanna', 'baby', 'heart', 'happy', 'omg', 'new', 'nice', 'pls', 'hi', 'supernatural', 'welcome', 'cute', 'much', 'check', 'awesome', 'enjoy', 'poor', 'bad', 'hey', 'im']

Next, I examined predictions for the unlabeled dataset. To do this, I subset the Tf-Idf matrix again, but this time produced only the unlabeled rows. In addition to the hard "happy" and "sad" predictions, I produced a probability or "score" of happiness/sadness for each tweet. I appended these to my tweet DataFrame.

In [0]:
n_mask_indices = list(tweet_df_labeled.index) # Recall that these are rows that are "happy" or "sad"
n_mask = np.ones(full_tfidf.shape[0], dtype=bool)
n_mask[n_mask_indices] = False

unlabeled_tfidf = full_tfidf[n_mask]
In [0]:
unlabeled_pred = list(gbc.predict(unlabeled_tfidf))
unlabeled_pred_probs = gbc.predict_proba(unlabeled_tfidf)
In [0]:
happy_probs = unlabeled_pred_probs[:,0]
sad_probs = unlabeled_pred_probs[:,1]
In [46]:
pd.set_option('display.max_colwidth', -1)

tweet_df_unlabeled['pred'] = unlabeled_pred

tweet_df_unlabeled['happiness'] = happy_probs
tweet_df_unlabeled['sadness'] = sad_probs

display(tweet_df_unlabeled.sort_values(by='happiness', ascending=False).head())
display(tweet_df_unlabeled.sort_values(by='sadness', ascending=False).head())
clean_text text sentiment pred happiness sadness
31342 latest chronically awesome reader thanks healthcare parkinsons The latest The Chronically Awesome Reader! https://t.co/CcU7YhoAQA Thanks to @IsobelKnight2 #healthcare #parkinsons ? happy 0.898682 0.101318
32037 hi andrew thanks support working parent thrilled taxes support affordable childcare well healthcare ways society cares seniors @oceansidewebtv @albertaNDP @RachelNotley Hi Andrew, thanks for your support. As a working parent, I am thrilled for my taxes support more affordable childcare - as well as healthcare and other ways our society cares for seniors. ? happy 0.887219 0.112781
31983 thanks she’s going fine minor injury thankfully super glad live rad country public healthcare @ScottCrockatt Thanks! She’s going to be fine, just a minor injury thankfully. Super glad we live in such a rad country with public healthcare! ? happy 0.883968 0.116032
31761 thanks jeff looking forward Thanks Jeff. Looking forward to it. https://t.co/6CJdmgzcgL ? happy 0.873127 0.126873
32477 someone preexisting condition needs health insurance survive id rather healthcare fucked good guys win political points thanks @MmeScience @sahilkapur As someone with a pre-existing condition who needs health insurance to survive I'd rather not have my healthcare fucked up so "the good guys" can win political points, thanks ? happy 0.843994 0.156006
clean_text text sentiment pred happiness sadness
31682 sad thing education healthcare jobs poor sods want jumlaman blogman mercenaries think otherwise @tavleen_singh And the sad thing is education healthcare jobs is what the poor sods want, but JumlaMan, BlogMan and their mercenaries think otherwise. https://t.co/vls1v4vMLv ? sad 0.091702 0.908298
31057 sad hear jeff work healthcare amp breaks heart desperately need medicare @JeffHisDudeness So sad to hear this Jeff. I work in healthcare &amp; this just breaks my heart. We desperately need Medicare for ALL. ? sad 0.092471 0.907529
30307 miserable human must someone want access healthcare ripped away sick people sad @RickLangel @thehill How miserable of a human being must someone be to want access to healthcare to be ripped away from sick people? Sad. ? sad 0.135246 0.864754
31075 miss vote healthcare bill @dccc Did I miss the vote on @RepJayapal's healthcare bill? ? sad 0.160904 0.839096
30509 american healthcare system damn messed neglected it’s sad the American healthcare system is so damn messed up and neglected it’s sad ? sad 0.164510 0.835490

Although my model doesn't perform very well in general, it does appear that it has correctly predicted the sentiment for the happiest and saddest of posts for the most part. Tweet #32477 is clearly negative. It's likely that certain words like "thanks" and "good," which are generally positive, caused it to be skewed positive.

Bigrams

My last classifier uses only unigrams as features and is obviously not terribly accurate. In order to improve the classifier, I introduced bigrams as well. This allows us to consider phrases such as "not good" as being collectively negative, even if "not" and "good" might not be negative on their own. The workflow was largely the same as for the unigram classifier. I produced a Tf-Idf matrix for the data, but this time included both unigrams and bigrams as features. I then subset the labeled dataset and trained the model on it, then tested the model on the test labeled data and the unlabeled data.

In [0]:
bi_count_vect = CountVectorizer(ngram_range=(1,2))
bi_counts = bi_count_vect.fit_transform(tweet_df['clean_text'])

bi_tfidf_transformer = TfidfTransformer()
bi_full_tfidf = bi_tfidf_transformer.fit_transform(bi_counts)
In [0]:
bi_mask = np.ones(bi_full_tfidf.shape[0], dtype=bool)
bi_mask[mask_indices] = False

bi_labeled_tfidf_cl = bi_full_tfidf[bi_mask]
In [0]:
(bi_labeled_tfidf_train, bi_labeled_tfidf_test, bi_labeled_train, bi_labeled_test) = train_test_split(bi_labeled_tfidf_cl, 
                                                                  tweet_df_labeled['sentiment'], 
                                                                  test_size=0.2, random_state=1)
In [0]:
bi_gbc = GradientBoostingClassifier().fit(bi_labeled_tfidf_train, bi_labeled_train)
bi_labeled_pred = bi_gbc.predict(bi_labeled_tfidf_test)

I show the confusion matrix for this model below. I was disappointed to discover that the model's accuracy was nearly the same.. Moreover, when I displayed the most important features, I discovered that they were all unigram features. This could potentially be a negative effect of having removed stopwords at the start. With further work, I would consider leaving some or all stopwords in.

In [51]:
bi_c_matrix = confusion_matrix(bi_labeled_test, bi_labeled_pred)
display(bi_c_matrix)
display((bi_c_matrix[0,0]+bi_c_matrix[1,1])/(bi_c_matrix[0,0]+bi_c_matrix[1,1]+bi_c_matrix[0,1]+bi_c_matrix[1,0]))
array([[2623,  401],
       [1507, 1470]])
0.6820529911681387
In [52]:
bi_top_indexes = bi_gbc.feature_importances_.argsort()[::-1]
bi_names = bi_count_vect.get_feature_names()
print([bi_names[i] for i in bi_top_indexes[:30]])
['miss', 'sad', 'thanks', 'sorry', 'want', 'thank', 'love', 'wish', 'great', 'good', 'wanna', 'baby', 'happy', 'heart', 'omg', 'new', 'nice', 'hi', 'pls', 'supernatural', 'cute', 'welcome', 'awesome', 'check', 'much', 'oh', 'im', 'enjoy', 'poor', 'hey']
In [0]:
bi_unlabeled_mask = np.ones(bi_full_tfidf.shape[0], dtype=bool)
bi_unlabeled_mask[n_mask_indices] = False

bi_unlabeled_tfidf = bi_full_tfidf[bi_unlabeled_mask]
In [0]:
bi_unlabeled_pred = list(bi_gbc.predict(bi_unlabeled_tfidf))
bi_unlabeled_pred_probs = bi_gbc.predict_proba(bi_unlabeled_tfidf)
In [0]:
bi_happy_probs = bi_unlabeled_pred_probs[:,0]
bi_sad_probs = bi_unlabeled_pred_probs[:,1]
In [58]:
pd.set_option('display.max_colwidth', -1)

tweet_df_unlabeled['bi_pred'] = bi_unlabeled_pred

tweet_df_unlabeled['bi_happiness'] = bi_happy_probs
tweet_df_unlabeled['bi_sadness'] = bi_sad_probs

display(tweet_df_unlabeled.sort_values(by='bi_happiness', ascending=False).head(5))
display(tweet_df_unlabeled.sort_values(by='bi_sadness', ascending=False).head(5))
clean_text text sentiment pred happiness sadness bi_pred bi_happiness bi_sadness
31342 latest chronically awesome reader thanks healthcare parkinsons The latest The Chronically Awesome Reader! https://t.co/CcU7YhoAQA Thanks to @IsobelKnight2 #healthcare #parkinsons ? happy 0.898682 0.101318 happy 0.899134 0.100866
32037 hi andrew thanks support working parent thrilled taxes support affordable childcare well healthcare ways society cares seniors @oceansidewebtv @albertaNDP @RachelNotley Hi Andrew, thanks for your support. As a working parent, I am thrilled for my taxes support more affordable childcare - as well as healthcare and other ways our society cares for seniors. ? happy 0.887219 0.112781 happy 0.886858 0.113142
31983 thanks she’s going fine minor injury thankfully super glad live rad country public healthcare @ScottCrockatt Thanks! She’s going to be fine, just a minor injury thankfully. Super glad we live in such a rad country with public healthcare! ? happy 0.883968 0.116032 happy 0.877798 0.122202
31761 thanks jeff looking forward Thanks Jeff. Looking forward to it. https://t.co/6CJdmgzcgL ? happy 0.873127 0.126873 happy 0.870848 0.129152
30498 follow laws country congress work together come better healthcare mind ur country thanks @SatchofBridgend @PittMom16 @thehill It is. But we follow laws in this country. Congress can work together to come up with better healthcare. Until mind ur own country thanks ? happy 0.831058 0.168942 happy 0.842282 0.157718
clean_text text sentiment pred happiness sadness bi_pred bi_happiness bi_sadness
31057 sad hear jeff work healthcare amp breaks heart desperately need medicare @JeffHisDudeness So sad to hear this Jeff. I work in healthcare &amp; this just breaks my heart. We desperately need Medicare for ALL. ? sad 0.092471 0.907529 sad 0.080971 0.919029
31682 sad thing education healthcare jobs poor sods want jumlaman blogman mercenaries think otherwise @tavleen_singh And the sad thing is education healthcare jobs is what the poor sods want, but JumlaMan, BlogMan and their mercenaries think otherwise. https://t.co/vls1v4vMLv ? sad 0.091702 0.908298 sad 0.091563 0.908437
30307 miserable human must someone want access healthcare ripped away sick people sad @RickLangel @thehill How miserable of a human being must someone be to want access to healthcare to be ripped away from sick people? Sad. ? sad 0.135246 0.864754 sad 0.155372 0.844628
31075 miss vote healthcare bill @dccc Did I miss the vote on @RepJayapal's healthcare bill? ? sad 0.160904 0.839096 sad 0.158897 0.841103
30509 american healthcare system damn messed neglected it’s sad the American healthcare system is so damn messed up and neglected it’s sad ? sad 0.164510 0.835490 sad 0.164923 0.835077

While examining the happiest and saddest posts as scored by the unigram and bigram model, I was pleasantly surprised to find that the distinctly negative tweet that had been labeled "happy" by the unigram classifier, #31477 ("@MmeScience @sahilkapur As someone with a pre-existing condition who needs health insurance to survive I'd rather not have my healthcare fucked up so "the good guys" can win political points, thanks") was not included. Although the accuracy was essentially the same for the labeled dataset, there's a chance the performance still improved for something like healthcare, where many double negatives might occur.

Conclusion

Overall, I would admit that I'm disappointed with my model. With more time or resources, I would consider collecting a much larger dataset, or including some or all stopwords. Still, I believe there is some merit to this model in that it generally makes good predictions for very happy or very sad tweets. Given that I streamed my data over a very limited time period, it's possible that world events, pop culture, or the Twitter community were influencing user behavior in such a way that much of the data was difficult to classify. It might also be interesting to look into a threshold for the "happy" or "sad" class - for example, if a predicted happy "score" is 50.0001, it might be better labeled as neutral rather than happy.