Improve customer engagement with Twitter Text Analytics and Machine Learning

The world of tweeting has transformed the way organisations and public figures engage with their customers and the public. You just need to observe Donald Trump’s tweets!

Tweet if you are unhappy!

Discover customer sentiments with Twitter insightsJust post a tweet to an organisation’s Twitter account and most likely, you’ll get a response faster than making a call to their call centre.

Still not satisfied with their response?

Create a catchy hashtag and if it turns viral, it will definitely get them noticing! In today’s world of hashtags and retweets, bad news can spread like wildfire.

Are you using Twitter text analytics and machine learning to enhance customer engagement?

Besides using Twitter as a communication tool, you can take your customer engagement to the next level by tapping into Twitter’s Search API and readily available data and text analytics tools.

With these tools, you will be able to get a better sense of what your customers or followers are tweeting about, apply machine learning techniques to efficiently recognise and identify key messages in tweets and understand how customers/followers are feeling towards you.

Use analytics tools to tune into your customer’s voice on Twitter

In this segment, I will share with you a real-life example of how data science tools and data from Twitter’s search API  are used to gain insight into what customers are tweeting.

Using a flexible programming language like Python, I will:

  • use Twitter’s API to extract information about the tweets that are posted on an organisation’s account
  • with the information, track the number of tweets being posted and identify events that lead to a spike in tweets
  • connect to open source Python modules to process tweets and convert them into machine-readable formats
  • once converted, execute machine learning algorithms to get an insight into topics being discussed on Twitter and sentiment


To demonstrate the insights you can achieve from applying data analytics onTwitter API data, let’s look at a recent case faced by an Australian Telco, Optus.

For those who are unaware, Optus, an Australian Telco, bought the rights to World Cup 2018 Russia and implemented a paywall where Australian soccer fans had to pay $15 to watch all the games. These games would be streamed via the internet to your mobile and TV.

Unfortunately for Optus, their streaming service was not up to the task and many ended up not being able to watch! Now, you definitely do not want to get in the way of a football fan and the world’s biggest game! And if they can’t watch their game, you can be sure Twitter will be the first to hear about it.

Optus streaming error

Using Twitter’s search API, we search for all tweets posted on the Twitter account Optus Sport

Through a Python script, we connect to Twitter’s Search API.

Twitter offers different access plans for their Search API. We are using the free standard plan which allows us to query 7 days of history and sample of Twitter data. Their paid business and enterprise plans offer longer history and full complete data.

We run a query for all tweets on OptusSport. The data is retrieved and stored in a file called a JSON file.

A JSON (short for Javascript Object Notation) is a file that is commonly used to allow for the transfer of data across the internet.

In this case, data is being transferred from Twitter’s servers to my computers. JSON files store information in an easy-to-access and understand manner. In short, HUMAN READABLE.

With the JSON file from the search API, the following information can be extracted from the user tweets:

  • the actual tweet and date and time of the post
  • screen name of the user
  • location (if set by the user)
  • # of followers
  • # of friends
  • # of likes the tweet generated
  • # of retweets generated

Tracking the number of tweets across time

A tweet too many!

With the extracted data, using the date and time of each tweet post, one can plot the number of tweets over a time period. Using our Optus Sport example, there was a spike in tweets at the 10 pm (AEST) mark. The 10 pm mark is the time where Round 1 World Cup matches begin in Australia. Something amiss here? Either the coverage is really excellent and exciting or something’s gone wrong!

Optus Sport hourly tweets


Plotting a tweet distribution count of the top hashtags embedded in tweets gives us an indication that there may be some issue behind the coverage. Notice the hashtags optusfail, foptus, fail and optusout!

Hashtags embedded in optussport tweets

Crawl through thousands of tweets with a little help from the machine

Millions of tweets It is going to be a tedious task to manually crawl through all the tweets to get a sense of how customers are feeling.

Thankfully, with the help of open source machine learning modules built-in Python, there is a way to segment and cluster tweets to help you identify key topics that are being tweeted.

Helping your computer understand human language in a tweet

A machine can only analyse and classify text if it is able to interpret human language. Thankfully, we can use Python to tap on the Natural Language Toolkit module to do this.

After cleaning our collected tweet text of unwanted symbols and punctuations, the Natural Language Toolkit or NLTK is used to:

  • process and remove unimportant words (STOPWORDS) from the text,
  • strip out keywords from sentences and convert them into a numerical format that allows the computer to run algorithms to tag or cluster similar related words and identify sentiment within sentences

What is the machine telling us about the tweets?

We’ve got our dataset and it’s now time to step into the machine and take a look at what it sees.
In this section, I will be applying text analytics and machine learning methods to try to get a sense of the topics being discussed in Optus Sport’s tweets.

Basic Word Frequency Insights

Let’s start with the most basic analysis which is a frequency of key words in our dataset of tweets.  We plot the top 20 words by count and observe that words like ‘service’, ‘coverage’ and ‘watching’ have a high count. This however, does not give us any deep insight into the topics being discussed.

Word frequency of Optus Sport tweets

Words that occur frequently together – Word Collocations

A single word count isn’t too insightful. What if we tried looking at words that occured frequently together?
Using our tweet text dataset, we run the words of each tweet sentence against the COLLOCATIONS module of found in NLTK.

Th results based on the top 20 combinations are found below.  Combinations like (‘taking’,’piss’), (‘playback’,error), (‘absolute’, ‘joke’)  provide a better view of the sentiment and the issues being faced.

(‘free’, ‘air’), (‘sporting’, ‘event’), (‘premier’, ‘league’), (‘biggest’, ‘sporting’), (‘biggest’, ‘event’), (‘serge’, ‘para’), (‘last’, ‘night’), (‘money’, ‘back’), (‘fetch’, ‘box’), (‘first’, ‘half’), (‘social’, ‘media’), (‘taking’, ‘piss’), (‘chrome’, ‘cast’), (‘playback’, ‘error’), (‘spinning’, ‘wheel’), (‘act’, ‘together’), (‘black’, ‘screen’), (‘half’, ‘time’), (‘good’, ‘enough’), (‘absolute’, ‘joke’)]

Can the machine listen to your Twitter conversation?

With the help of open-source text analytics algorithms, we experiment with the machine’s ability to process human language and classify and cluster tweet text into various discussion topics.

Term frequency and Inverse Document frequency algorithm (TF-IDF)

The first algorithm we apply is the TF-IDF module in Python’s Sci-Kit Learn library. The module scans through the keywords in a tweet and assigns the importance of that keyword based on the number of times it appears in your dataset of tweets. It then assigns each of the tweets in your dataset a score.

With this score, we can then utilise clustering techniques like K-means to identify natural groupings of keywords. In our example, 3 key clusters were observed (see below).

With the clusters, we can label them based on our observation of the keywords.  It appears that the bulk of tweets, cluster 1, experience a working service but may not be satisfied. In cluster 2, 3% of tweets, customers are asking Optus to fix coverage. On the extreme end, 2% of tweets, they want their MONEY BACK!

Tweet clusters based on term frequency inverse document frequency scores

GenSim Doc2Vec algorithm to measure similarities between keywords in tweets.

Similar to the previous step, we apply a different algorithm, GenSim Doc2Vec to measure the similarity between keywords in tweets. The outputs of the algorithm allow us to apply and perform clustering techniques to identify how these words are grouped. The observed clustered words allow us to identify the key topics being discussed.

After clustering and visualising the clustered words, we see that there are 3 distinct clusters (see image below).

Doc2Vec tagged document clusters from tweets

Using WordCloud to visualise the words in each topic cluster and TextBlob to measure sentiment within each cluster

The Python library WordCloud is a great way to visualise blocks of words. We use it to visualise the 3 clusters of words.

The words of each cluster are also passed through another library TextBlob. When we run the words of each cluster through TextBlob, it returns a score of -1.0 (indicating negative sentiment), 0.0 being neutral and 1.0 (indicating positive sentiment).

Fifa World Cup 2018

Cluster 1 Optus Tweet Topics

-ve sentiment words: 500 (30%)
+ve sentiment words: 1182 (70%)

Cluster 2 Optus Tweets Topics

-ve sentiment words: 434 (34%)
+ve sentiment words: 860 (64%)

Cluster 3 Optus Tweet Topics

-ve sentiment words: 807 (50%)
+ve sentiment words: 822 (50%)

Conclusion: Text analytics and machine learning methods can alert us to different levels of customer sentiment

The Optus Sports example shows that by accessing Twitter data and applying data science techniques, one can track and be alerted to changing levels of customer sentiment.

By tracking the number of tweets across time, we can be alerted to any significant changes to tweet volume.

Through tracking of hashtags and retweets, one can be alerted to any key issues that may have a VIRAL impact.

By using text analytics and unsupervised machine learning techniques such as clustering, we can uncover hidden patterns in tweet discussions and get a sense of the topics discussed (ie. customers who are unhappy and need Optus to fix the World Cup coverage and customers who are really unhappy and want a refund) and whether they are positive or negative in sentiment. These free open source techniques and tools are easily accessible on the Internet.

Finally, embedding analytics into your Twitter communication workflow can help enhance your customer engagement strategy.  I’ve drawn up an example below. To discuss further on this, feel free to leave comments or leave me an email at

Enhancing your twitter strategy with text analytics and machine learning

About The Author