Sentiment Analysis and Word Cloud in RStudio

Maung Agus Sutikno
4 min readAug 6, 2022

--

A technical article behind Citayam Fashion Week — from a Sentiment Analysis that is more a hypothesis building, analysis, and conclusion. Hopefully, this writing helps not only in learning R programming language, but also its application in Twitter sentiment analysis.

Photo by NEXT Academy on Unsplash

Installing the required packages is the first step we need to take. The packages are related to Twitter API connections, data visualization, and sentiment analysis itself.

install.packages("twitteR")
install.packages("ROAuth")
install.packages("plyr")
install.packages("dplyr")
install.packages("stringr")
install.packages("ggplot2")
install.packages("httr")
install.packages("wordcloud")
install.packages("sentiment")
install.packages("RCurl")
install.packages("syuzhet")
install.packages("rugarch")
install.packages("parallel")

Calling the libraries for the installed packages.

library(twitteR)
library(ROAuth)
library(plyr)
library(dplyr)
library(stringr)
library(ggplot2)
library(httr)
library(wordcloud)
library(sentiment)
library(RCurl)
library(syuzhet)
library(rugarch)
library(parallel)

Installing the sentiment packages.

install.packages("tm", dependencies = TRUE)
install.packages("ftp://cran.r-project.org/pub/R/src/contrib/Archive/Rstem_0.4-1.tar.gz", repos=NULL,
type="source", dependencies = TRUE)
install.packages("https://cran.r-project.org/src/contrib/Archive/sentiment/sentiment_0.2.tar.gz", repos = NULL,
type = "source", dependencies = TRUE)

To do a shake hand with Twitter API.

oauth_endpoint(authorize = "https://api.twitter.com/oauth",
access = "https://api.twitter.com/oauth/access_token")

Connecting to Twitter API.

download.file(url = 'http://curl.haxx.se/ca/cacert.pem', destfile = 'cacert-pem')
reqURL <- 'https:/api.twitter.com/oauth/request_token'
accessURL <- 'https:/api.twitter.com/oauth/access_token'
authURL <- 'https:/api.twitter.com/oauth/authorize'

Submitting the Twitter API credential. If these credentials are not available, we can apply for it here.

consumerKey = "Xxxxxxxxxxxxxx"
consumerSecret = "Xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
accessToken = "8xxxxx-Xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
accessSecret = "Xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
Cred <- OAuthFactory$new(consumerKey=consumerKey,
consumerSecret=consumerSecret,
requestURL=reqURL,
accessURL=accessURL,
authURL=authURL)

The next step is an authorization PIN Dynamic for Twitter. Once we launch the code first time, we can start from this line in the future and the libraries should be connected.

save(Cred, file = 'twitter authentication.Rdata')
load('twitter authentication.Rdata')
setup_twitter_oauth(consumer_key = consumerKey, consumer_secret = consumerSecret, access_token = accessToken, access_secret = accessSecret)

Harvesting some tweets; by providing the key word that the tweets need to consist of, number of tweets, starting date, and the language.

some_tweets = searchTwitter("citayam fashion week",
n = 5000,
since = "2022-06-29",
lang = "en")

Exploring the tweets to understand the length of tweets that have been downloaded.

length.some_tweets <- length(some_tweets)
length.some_tweets

Saving the collected tweets data frame as a CSV file.

some_tweets.df <- ldply(some_tweets, function(t) t$toDataFrame())
write.csv(some_tweets.df, "CFW_tweets.csv")

Getting the text to be ready for cleaning process.

some_txt = sapply(some_tweets, function(x) x$getText())

Cleaning 1 — removing people name, RT text, etc.

some_txt1 = gsub("(RT|via)((?:\\b\\W*@\\w+)+)","",some_txt)

Cleaning 2 — removing html links.

some_txt2 = gsub("http[^[:blank:]]+", "", some_txt1)

Cleaning 3 — removing people names.

some_txt3 = gsub("@\\w+", "", some_txt2)

Cleaning 4 — removing punctuations.

some_txt4 = gsub("[[:punct:]]", " ", some_txt3)

Cleaning 5 — removing alphanumeric (consisted both letters and numerals) data.

some_txt5 = gsub("[^[:alnum:]^]", " ", some_txt4)

Exporting to a CSV file after the data is cleaned. We will see the text become cleaner (meaning it contains only text).

write.csv(some_txt5, "CFW_tweets_clean.csv")

Creating a word corpus. Here, we will do a cleaning process again for the corpus by removing punctuation and numbers.

install.packages("tm")
library(tm)

some_txt6 <- Corpus(VectorSource(some_txt5))
some_txt6 <- tm_map(some_txt6, removePunctuation)
some_txt6 <- tm_map(some_txt6, removeNumbers)

At this point, the code is still working even though there’s a warning notification. The notification or warning is not an error. This warning only appears when we have a corpus based on a VectorSource in combination when we use Corpus instead of VCorpus.
The reason is that there is a check in the underlying code to see if the number of names of the corpus content matches the length of the corpus content. With reading the text as a vector, there are no document names and this warning pops up.

some_txt6 <- tm_map(some_txt6, content_transformer(tolower))

Removing stop words.

some_txt6 <- tm_map(some_txt6, removeWords, stopwords("english"))
some_txt6 <- tm_map(some_txt6, stripWhitespace)

Create a word cloud by preparing first the color pallet. After we run the code, it should display the word cloud.

install.packages("RColorBrewer")
library(RColorBrewer)

pal <- brewer.pal(8,"Dark2")

wordcloud(some_txt6, min.freq = 50, max.words = Inf, width = 10, height = 10,
random.order = FALSE, color = pal)

As a sentiment analysis example, the result is value 1 (one) in joy and positive, while 0 (zero) in anger, anticipation, disgust, fear, sadness, surprise, trust, and negative.

get_nrc_sentiment("I bought an iPhone a few days ago. It is such a nice phone. I love it")

Running the data for sentiment analysis. In this analysis, we use the NRC sentiment and emotion lexicon.

The NRC Sentiment and Emotion Lexicons is a collection of seven lexicons, including the widely used Word-Emotion Association Lexicon. The lexicons have been developed with a wide range of applications in mind; they can be used in a multitude of contexts such as sentiment analysis, product marketing, consumer behaviour analysis, and even political campaign analysis. Each lexicon has a list of words and their associations with certain categories of interest such as emotions (joy, sadness, fear, etc.), sentiment (positive and negative), or colour (red, blue, black, etc.)
- National Research Council Canada

mysentiment <- get_nrc_sentiment(some_txt5)
SentimentScores <- data.frame(colSums(mysentiment[,]))
names(SentimentScores) <- "Score"
SentimentScores <- cbind("sentiment" = rownames(SentimentScores), SentimentScores)
rownames(SentimentScores) <- NULL
ggplot(data = SentimentScores, aes(x = sentiment, y = Score)) +
geom_bar(aes(fill = sentiment), stat = "identity") +
theme(legend.position = "none") +
xlab("Sentiment") + ylab("Score") + ggtitle("Total Sentiment Score Based on Tweets")

More about the sentiment analysis package in R is here.

--

--