import pandas as pd
import numpy as np
I collected and cleaned my data on my local machine, then uploaded the json files to Dropbox. While collecting my labeled tweet dataset, I confirmed that I given tweet contained only ":)" or ":(," but not both. In addition, I stripped out usernames, links, punctuation, and stopwords. I also skipped retweets so that any single tweet by a popular individual would not be given greater weight than any other tweet. I preserved the original text in the "text" field and stored the cleaned text in the "clean_text" field. In addition, I assigned tweets a sentiment based on whether they included ":)," ":(," or neither - "happy," "sad," or "?," respectively. The last sentiment category only came into play as I was collecting my keyword-based dataset. I chose to include tweets containing the keyword, "healthcare." I did not deliberately collect tweets with either emoticon in them, but I did allow them. I've included below a code sample showing how I gathered my "happy" dataset. For all of the datasets, I avoided duplicating tweets by ensuring that, for each new stream tweet, its ID did not match any of the past 100 ID's.
Overall, I collected 32,500 tweets - 15k "happy" tweets, 15k "sad" tweets, and 2.5k "healthcare" tweets.
import tweepy
import json, string
import re
from nltk.corpus import stopwords
class MyStreamListener(tweepy.StreamListener):
def __init__(self, stop_at=15000):
super(MyStreamListener, self).__init__()
self.tweet_counter = 0
self.datalist = []
self.stop_at = stop_at
self.file_entries = 0
self.checkpoint = 100
self.file_cap = 5000
self.ids = [0] * 100
self.id_exists = False
self.stops = stopwords.words("english")
def on_status(self, status):
# Write a file every 5000 entries read
# If the connection messes up, we'll have some of the stream at least
if self.tweet_counter % self.checkpoint == 0:
print(self.tweet_counter)
if len(self.datalist) >= self.file_cap:
print('file created')
with open('nsmile_tweets_' + str(self.file_entries) + '.json', 'w') as outfile:
json.dump(self.datalist, outfile)
self.file_entries += 1
self.datalist.clear()
if self.tweet_counter < self.stop_at:
tweetj = status._json
if tweetj['id'] not in self.ids:
self.id_exists = False
else:
self.id_exists = True
if 'retweeted_status' not in tweetj and not self.id_exists:
if (tweetj['lang']=='en'):
text = ''
if 'extended_tweet' in tweetj:
# print('hit full')
text = tweetj['extended_tweet']['full_text'] # Make sure to get full tweet text
else:
text = tweetj['text']
tweetj['text'] = text
# ----------- Set invalids -----------
if ':)' in text: # skip messages without certain things
if ':(' not in text:
# print("Orig: " + tweetj['text'])
text = text.lower() # Make everything lower
text = re.sub('http://\S+|https://\S+', '', text) # Remove links
text = re.sub('@[^\s]+', '', text) # Remove usernames
text = text.translate(str.maketrans('', '', string.punctuation)) # Remove punctuation
text = ' '.join([word for word in text.split() if word not in (self.stops)]) # Remove stopwords
# print("New: " + text)
tweetj['clean_text'] = text
# ----------- Set sentiment start -----------
tweetj['sentiment'] = 'happy'
# ----------- Set sentiment end -----------
self.datalist.append(tweetj)
self.tweet_counter += 1
self.ids.pop()
self.ids.append(tweetj['id'])
return True
else:
return False
consumer_token = 'xxx' consumer_secret = 'xxx'
access_token = 'xxx' access_secret = 'xxx'
auth = tweepy.OAuthHandler(consumer_token, consumer_secret)
auth.set_access_token(access_token, access_secret)
listener = MyStreamListener(stop_at=15000)
streaming_api = tweepy.Stream(auth, listener, timeout=60, tweet_mode='extended')
streaming_api.filter(track=[':)'])
To prepare my data for use as a training data set, I concatenated all of the data I'd collected into a single pandas DataFrame. I only included my text and sentiment data as I was only interested in doing text analysis.
sm1_df = pd.read_json('https://www.dropbox.com/s/gp1nuytnu42mbm1/nsmile_tweets_0.json?dl=1')[['clean_text', 'text', 'sentiment']]
sm2_df = pd.read_json('https://www.dropbox.com/s/1qh2hxgpzpz530q/nsmile_tweets_1.json?dl=1')[['clean_text', 'text', 'sentiment']]
sm3_df = pd.read_json('https://www.dropbox.com/s/dnnbi6p7kswqjqj/nsmile_tweets_2.json?dl=1')[['clean_text', 'text', 'sentiment']]
fr1_df = pd.read_json('https://www.dropbox.com/s/donahmoq4uhybme/nfrown_tweets_0.json?dl=1')[['clean_text', 'text', 'sentiment']]
fr2_df = pd.read_json('https://www.dropbox.com/s/e885oryqk4q3czx/nfrown_tweets_1.json?dl=1')[['clean_text', 'text', 'sentiment']]
fr3_df = pd.read_json('https://www.dropbox.com/s/0oycgnwbao086d4/nfrown_tweets_2.json?dl=1')[['clean_text', 'text', 'sentiment']]
misc_sm_df = pd.read_json('https://www.dropbox.com/s/apxxw40v3jo3532/nmisc_smile_tweets.json?dl=1')[['clean_text', 'text', 'sentiment']]
misc_fr_df = pd.read_json('https://www.dropbox.com/s/uqsw3zf2e3y1tl3/nmisc_frown_tweets.json?dl=1')[['clean_text', 'text', 'sentiment']]
misc_qu_df = pd.read_json('https://www.dropbox.com/s/s6pz9q94uqhcd1m/nmisc_quest_tweets.json?dl=1')[['clean_text', 'text', 'sentiment']]
tweet_df = pd.concat([sm1_df, sm2_df, sm3_df, fr1_df, fr2_df, fr3_df, misc_sm_df, misc_fr_df, misc_qu_df], ignore_index=True).reset_index(drop=True)
tweet_df.head()
After concatenating the data, I constructed a Tf-Idf matrix for all of the data.
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
count_vect = CountVectorizer()
counts = count_vect.fit_transform(tweet_df['clean_text'])
tfidf_transformer = TfidfTransformer()
full_tfidf = tfidf_transformer.fit_transform(counts)
display(full_tfidf)
Since I reset the indices of my DataFrame, the array row indices coordinate with the index of my DataFrame. Each row of the matrix represents a tweet, with each of the columns representing a one-word "feature" that tweet includes. Below I printed information about the first three rows of the matrix, AKA the first three tweets in the DataFrame, which can be seen above. If you compare the "clean_text" field to the number of stored elements, you can see the number of words in the former is equal to the latter.
display(full_tfidf[0,:])
display(full_tfidf[1,:])
display(full_tfidf[2,:])
However, although I want my Tf-Idf matrix to include features for all of the data, I want to train my classifier only based on the labeled dataset and test the classifier on unlabeled data. Therefore, I needed to break the matrix up into two subsets - labeled and unlabeled. I then broke the labeled subset up into test and training data.
tweet_df_labeled = tweet_df[tweet_df['sentiment']!='?']
tweet_df_unlabeled = tweet_df[tweet_df['sentiment']=='?'].copy() # denote rows to remove
mask_indices = list(tweet_df_unlabeled.index) # array position indices
mask = np.ones(full_tfidf.shape[0], dtype=bool)
mask[mask_indices] = False
labeled_tfidf = full_tfidf[mask]
labeled_tfidf
(labeled_tfidf_train, labeled_tfidf_test, labeled_train, labeled_test) = train_test_split(labeled_tfidf,
tweet_df_labeled['sentiment'],
test_size=0.2, random_state=1)
Using the training data, I fit a Gradient Boosting Classifier. I also produced the confusion matrix for the test data. My classifier achieved an accuracy of approximately 68%. This is a relatively low score, therefore I later attempted to improve it by including bigrams as feature, in the next section.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import confusion_matrix
gbc = GradientBoostingClassifier().fit(labeled_tfidf_train, labeled_train)
labeled_pred = gbc.predict(labeled_tfidf_test)
c_matrix = confusion_matrix(labeled_test, labeled_pred)
display(c_matrix)
display((c_matrix[0,0]+c_matrix[1,1])/(c_matrix[1,0]+c_matrix[0,1]+c_matrix[0,0]+c_matrix[1,1]))
I was also interested in seeing the most "important" features in determining sentiment. Logically, words like "miss," "sad," and "thanks," which generally have a clear positive/negative skew, are high on the list.
top_indices = gbc.feature_importances_.argsort()[::-1]
names = count_vect.get_feature_names()
print([names[i] for i in top_indices[:30]])
Next, I examined predictions for the unlabeled dataset. To do this, I subset the Tf-Idf matrix again, but this time produced only the unlabeled rows. In addition to the hard "happy" and "sad" predictions, I produced a probability or "score" of happiness/sadness for each tweet. I appended these to my tweet DataFrame.
n_mask_indices = list(tweet_df_labeled.index) # Recall that these are rows that are "happy" or "sad"
n_mask = np.ones(full_tfidf.shape[0], dtype=bool)
n_mask[n_mask_indices] = False
unlabeled_tfidf = full_tfidf[n_mask]
unlabeled_pred = list(gbc.predict(unlabeled_tfidf))
unlabeled_pred_probs = gbc.predict_proba(unlabeled_tfidf)
happy_probs = unlabeled_pred_probs[:,0]
sad_probs = unlabeled_pred_probs[:,1]
pd.set_option('display.max_colwidth', -1)
tweet_df_unlabeled['pred'] = unlabeled_pred
tweet_df_unlabeled['happiness'] = happy_probs
tweet_df_unlabeled['sadness'] = sad_probs
display(tweet_df_unlabeled.sort_values(by='happiness', ascending=False).head())
display(tweet_df_unlabeled.sort_values(by='sadness', ascending=False).head())
Although my model doesn't perform very well in general, it does appear that it has correctly predicted the sentiment for the happiest and saddest of posts for the most part. Tweet #32477 is clearly negative. It's likely that certain words like "thanks" and "good," which are generally positive, caused it to be skewed positive.
My last classifier uses only unigrams as features and is obviously not terribly accurate. In order to improve the classifier, I introduced bigrams as well. This allows us to consider phrases such as "not good" as being collectively negative, even if "not" and "good" might not be negative on their own. The workflow was largely the same as for the unigram classifier. I produced a Tf-Idf matrix for the data, but this time included both unigrams and bigrams as features. I then subset the labeled dataset and trained the model on it, then tested the model on the test labeled data and the unlabeled data.
bi_count_vect = CountVectorizer(ngram_range=(1,2))
bi_counts = bi_count_vect.fit_transform(tweet_df['clean_text'])
bi_tfidf_transformer = TfidfTransformer()
bi_full_tfidf = bi_tfidf_transformer.fit_transform(bi_counts)
bi_mask = np.ones(bi_full_tfidf.shape[0], dtype=bool)
bi_mask[mask_indices] = False
bi_labeled_tfidf_cl = bi_full_tfidf[bi_mask]
(bi_labeled_tfidf_train, bi_labeled_tfidf_test, bi_labeled_train, bi_labeled_test) = train_test_split(bi_labeled_tfidf_cl,
tweet_df_labeled['sentiment'],
test_size=0.2, random_state=1)
bi_gbc = GradientBoostingClassifier().fit(bi_labeled_tfidf_train, bi_labeled_train)
bi_labeled_pred = bi_gbc.predict(bi_labeled_tfidf_test)
I show the confusion matrix for this model below. I was disappointed to discover that the model's accuracy was nearly the same.. Moreover, when I displayed the most important features, I discovered that they were all unigram features. This could potentially be a negative effect of having removed stopwords at the start. With further work, I would consider leaving some or all stopwords in.
bi_c_matrix = confusion_matrix(bi_labeled_test, bi_labeled_pred)
display(bi_c_matrix)
display((bi_c_matrix[0,0]+bi_c_matrix[1,1])/(bi_c_matrix[0,0]+bi_c_matrix[1,1]+bi_c_matrix[0,1]+bi_c_matrix[1,0]))
bi_top_indexes = bi_gbc.feature_importances_.argsort()[::-1]
bi_names = bi_count_vect.get_feature_names()
print([bi_names[i] for i in bi_top_indexes[:30]])
bi_unlabeled_mask = np.ones(bi_full_tfidf.shape[0], dtype=bool)
bi_unlabeled_mask[n_mask_indices] = False
bi_unlabeled_tfidf = bi_full_tfidf[bi_unlabeled_mask]
bi_unlabeled_pred = list(bi_gbc.predict(bi_unlabeled_tfidf))
bi_unlabeled_pred_probs = bi_gbc.predict_proba(bi_unlabeled_tfidf)
bi_happy_probs = bi_unlabeled_pred_probs[:,0]
bi_sad_probs = bi_unlabeled_pred_probs[:,1]
pd.set_option('display.max_colwidth', -1)
tweet_df_unlabeled['bi_pred'] = bi_unlabeled_pred
tweet_df_unlabeled['bi_happiness'] = bi_happy_probs
tweet_df_unlabeled['bi_sadness'] = bi_sad_probs
display(tweet_df_unlabeled.sort_values(by='bi_happiness', ascending=False).head(5))
display(tweet_df_unlabeled.sort_values(by='bi_sadness', ascending=False).head(5))
While examining the happiest and saddest posts as scored by the unigram and bigram model, I was pleasantly surprised to find that the distinctly negative tweet that had been labeled "happy" by the unigram classifier, #31477 ("@MmeScience @sahilkapur As someone with a pre-existing condition who needs health insurance to survive I'd rather not have my healthcare fucked up so "the good guys" can win political points, thanks") was not included. Although the accuracy was essentially the same for the labeled dataset, there's a chance the performance still improved for something like healthcare, where many double negatives might occur.
Overall, I would admit that I'm disappointed with my model. With more time or resources, I would consider collecting a much larger dataset, or including some or all stopwords. Still, I believe there is some merit to this model in that it generally makes good predictions for very happy or very sad tweets. Given that I streamed my data over a very limited time period, it's possible that world events, pop culture, or the Twitter community were influencing user behavior in such a way that much of the data was difficult to classify. It might also be interesting to look into a threshold for the "happy" or "sad" class - for example, if a predicted happy "score" is 50.0001, it might be better labeled as neutral rather than happy.