What is Sentiment Analysis?

5 min readJun 16, 2021

What is Sentiment Analysis?

In short, it’s figuring out how positive or negative a snippet of text is. For example, when I tell another human “I really liked dinner last night!” chances are, I'm conveying positive sentiment about a certain experience. If you’ve heard of something like the 7–38–55 rule, you know it’s not always that easy, though — the rule states that essentially, in verbal communication 7% of what is conveyed is through the actual words spoken, 38% through tone of voice, and 55% through body language (for example, sarcasm is hard to convey through text).

Any kind of sentiment analysis that a bot can do from just that text, then, really has its work cut out for it. Luckily, there’s VADER sentiment analysis.

How does it work? A closer look.

VADER ( Valence Aware Dictionary for Sentiment Reasoning) is a computer model used for figuring out sentiment by looking at the positive/negative sense (polarity) of a text’s emotion, as well as its strength (LOVED vs liked). It works through a dictionary that maps words/phrases/emojis to emotional polarities and intensities, and then sums these up to create an overall sentiment score. VADER is on the simpler end, summing up the overall intensity of phrases, but some models/packages/strategies that take this further do other complex calculations on phrases, taking into account cultural vernacular, etc.

The strategy we’re looking at today looks at the community WallStreetBets on Reddit. It takes in as input a number of tickers to analyze (we go through recent posts prior to running the algorithm to scrape tickers), and then goes through posts and comments, running the VADER process on each one and aggregating the total sentiment per ticker by this basis, weighted by upvotes. The idea here is, if a highly positive-sentiment comment about a ticker is getting a lot of upvotes, that comment must be indicative of the sentiment of the community to a large extent. If a highly positive-sentiment comment isn’t getting upvoted, it may only voice the opinions of a minority.

Backtesting and Results

Backtesting is the process of using our algorithm over a historical period, using past price movement, and determining how well the strategy would have performed. There's a number of ways you can do this yourself — consider this comprehensive list of backtesting software. The algorithm we built is backtested over a period from 3/2/2021 to 6/18/2021 (the reasoning for this is to benchmark it against the BUZZ social sentiment ETF, which came out in early March), and here are the results we’ve achieved over that time period with this algorithm, when accounting for slippage:

Annualized Return: 172.65% (this unrealistic number largely comes from GME and AMC’s meteoric rise in the time period we’ve backtested)
Max Drawdown (the difference between the last all-time high and the lowest subsequent trough): -9.1%
Sharpe Ratio (calculated as expected return over the risk free rate, divided by the standard deviation of its return): 2.74
Profit/Loss Ratio (average profit on winning investments divided by average loss on losing investments): 2.77
Average Win (when an investment is sold at a gain, the average of that gain): 9.11%
Average Loss: -3.29%

Why Reddit?

We’ve seen a massive increase in the interest in fast-paced stocks over just the last six months — it’s what made us realize that something like Quantbase needs to exist! Communities like WallStreetBets have more than 10x’d in size since January, and most communities related to algorithmic, crypto, and leveraged trading have done the same, some as much as 20x. Obviously, there’s a lot of interest in beating the market here by actively trading, and people are putting a lot of effort and time into discussing assets that they believe in. With our sentiment analyzing indices, we’re able to capitalize on that immediately.

Reddit WallStreetBets

We’re honing in on WallStreetBets because that’s where a lot of the active, engaging discussion about individual tickers is going on, and it’s a large community. There are others that we’re looking into, but WSB is one of the most exciting communities to check the sentiment of, and it’s been a community that I’ve been a part of personally since 2015.

Scraping

There are thousands of tickers listed on US exchanges alone. To go about actually automating the sentiment analysis, we can’t be inputting these tickers individually or manually, because that would be hard work and also make the codebase incredibly annoying to look at and edit. The quick and easy solution I used was Selenium to scrape the tickers from a website that listed US stocks filtered by market cap, then put those in a dictionary that the resulting sentiment analysis could then act on. Then I used the Reddit PRAW API to pull text information from posts and comments and run the VADER sentiment analyzing algorithm on it. Outputting the resulting sentiment score per ticker lets us then pick the top 15 stocks based on weekly sentiment score, and send that to our order executor to rebalance our portfolio based on the weighted sentiment score of these top picks.

Source Code

#!/usr/bin/env python
# coding: utf-8# In[1]:import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
import praw
import matplotlib.pyplot as plt
import math
import datetime as dt
import pandas as pd
import numpy as np# In[2]:nltk.download('vader_lexicon')
nltk.download('stopwords')# In[3]:reddit = praw.Reddit(client_id='*********',
                    client_secret='******************',
                    user_agent='*********') ## to use this, make a Reddit app. Client ID is in top left corner, client secret is given, and user agent is the username that the app is under# In[4]:sub_reddits = reddit.subreddit('wallstreetbets')
stocks = ["GME", "AMC"] 
# For example purposes. To use this as a live trading tool, you'd want to populate this with tickers that have been mentioned on the pertinent community (WSB in our case) in a specified period.# In[5]:def commentSentiment(ticker, urlT):
    subComments = []
    bodyComment = []
    try:
        check = reddit.submission(url=urlT)
        subComments = check.comments
    except:
        return 0
    
    for comment in subComments:
        try: 
            bodyComment.append(comment.body)
        except:
            return 0
    
    sia = SIA()
    results = []
    for line in bodyComment:
        scores = sia.polarity_scores(line)
        scores['headline'] = lineresults.append(scores)
    
    df =pd.DataFrame.from_records(results)
    df.head()
    df['label'] = 0
    
    try:
        df.loc[df['compound'] > 0.1, 'label'] = 1
        df.loc[df['compound'] < -0.1, 'label'] = -1
    except:
        return 0
    
    averageScore = 0
    position = 0
    while position < len(df.label)-1:
        averageScore = averageScore + df.label[position]
        position += 1
    averageScore = averageScore/len(df.label) 
    
    return(averageScore)# In[6]:def latestComment(ticker, urlT):
    subComments = []
    updateDates = []
    try:
        check = reddit.submission(url=urlT)
        subComments = check.comments
    except:
        return 0
    
    for comment in subComments:
        try: 
            updateDates.append(comment.created_utc)
        except:
            return 0
    
    updateDates.sort()
    return(updateDates[-1])# In[7]:def get_date(date):
    return dt.datetime.fromtimestamp(date)# In[8]:submission_statistics = []
d = {}
for ticker in stocks:
    for submission in reddit.subreddit('wallstreetbets').search(ticker, limit=130):
        if submission.domain != "self.wallstreetbets":
            continue
        d = {}
        d['ticker'] = ticker
        d['num_comments'] = submission.num_comments
        d['comment_sentiment_average'] = commentSentiment(ticker, submission.url)
        if d['comment_sentiment_average'] == 0.000000:
            continue
        d['latest_comment_date'] = latestComment(ticker, submission.url)
        d['score'] = submission.score
        d['upvote_ratio'] = submission.upvote_ratio
        d['date'] = submission.created_utc
        d['domain'] = submission.domain
        d['num_crossposts'] = submission.num_crossposts
        d['author'] = submission.author
        submission_statistics.append(d)
    
dfSentimentStocks = pd.DataFrame(submission_statistics)_timestampcreated = dfSentimentStocks["date"].apply(get_date)
dfSentimentStocks = dfSentimentStocks.assign(timestamp = _timestampcreated)_timestampcomment = dfSentimentStocks["latest_comment_date"].apply(get_date)
dfSentimentStocks = dfSentimentStocks.assign(commentdate = _timestampcomment)dfSentimentStocks.sort_values("latest_comment_date", axis = 0, ascending = True,inplace = True, na_position ='last')dfSentimentStocks# In[9]:dfSentimentStocks.author.value_counts()# In[10]:dfSentimentStocks.to_csv('Reddit_Sentiment_Equity.csv', index=False)

How can I use this?

Quantbase! It’s incredibly easy — just pick how much you want to deposit into this strategy, and the algorithm (that you just saw above — no secrets here) does the rest. It’s just as easy, or even easier than investing into an ETF.

Written by Quantbase