Extracting Useful Data from Twitter for Methodological Evaluation – Part II

Posted By mcarter on Mar 14, 2013 | 0 comments


11.35: The hashtag RichardIII is now trending on Twitter. This was reported by Telegraph Reporters on Sept 12, 2012 during a minute by minute timeline of the announcement that the University of Leicester Archaeology Team had discovered bones believed to be Richard III buried under a Council parking lot.  Suffice to say, this was seminal event in archaeology, as it was the first time that embedded reporters reported live and in realtime, an archaeological event.  Twitter brought that news to the world.

In the second part of our investigation in extracting useful data from Twitter for methodological evaluation, I’m going to use Topsy again to try and provide a view of digital media, archaeology and public engagement.  Does an event such as this, also help to expose the public to archaeology and archaeologists or are these terms co-opted byproducts of a pop culture event?

To recap, two events occurred surrounding the discovery of Richard III.  The first happend on September 12, 2012 with the University of Leicester announcing that they had discoverer what they believe might be the bones of Richard III, with almost 1565 unique Tweets under the hashtag #richardIII.  The second event was the official news on February 4, 2013 that the archaeological team had confirmed the bones to be Richard III.  On that day, 66,696 Tweets where made world wide.

A couple of things need to be considered with this query.  In Twitter, when users want to “home in” on a subject matter of interest, they tend to use a hashtag such as #richardiii.  Prior to the original announcement of Richard III bones being discovered, the hashtag #richardiii was used by a various assortment of users who primarily discussed the works of Shakespeare and specifically the play The Tragedy of Richard the Third.  In the course of the archaeological discovery, this hashtag was hijacked by people wanting to connect or Tweet about the discovery of the bones and the subsequent news related to that discovery.  More importantly however, that hashtag was not the only way people Tweeted about Richard III.

Keywords are important when mining Twitter data.  Using additional related terms such as “Richard III”, “King Richard III” and “King Richard” a fuller picture begins to emerge of the extent of Twitter activity surrounding the archaeological event.  By combining total Twitter counts of just these four terms, total number of Tweets jumps to 430,079 over a 24hr period.


Essentially we are looking at the gross number of actual Tweets that contained the search terms above over a 24 hour period.  However, the story doesn’t stop there.  Inferences can be made on how many Twitter users were actually exposed to the Richard III terms above on Feb 4, 2013, buy taking the gross number of followers of each Twitter user who posted any message with “#richardIII”, “Richard III”, “King Richard III” and “King Richard”.  Using this methodology employed by Topsy, the system estimates that 1,280,087,045 Twitter users were exposed to a Tweet of some sort on Feb 4 around this archaeological event.


Topsy’s describes its methodology this way; Topsy calculates exposure by summing the follower counts of all the authors of tweets that match the keywords being queried. This calculation returns overall gross exposure (vs. unduplicated net exposure) so multiple tweets from the same author or authors with common followers may result in audience duplication.  To better understand the margin of error, Topsy would have to predict and/or calculate how many times the same Tweet was distributed by the same author.  As with using the search terms “#richardIII”, “Richard III”, “King Richard III” and “King Richard”, there is no clear indication on how much duplication within the gross calculation has been made.

Finally, one of the interesting elements from an anthropological perspective of this type of real-time, machine language data mining, is the ability to estimate gross number of Tweets from country of origin and the positive, neutral or negative value of the qualitative or quantisized Tweet.  Let’s first look at the geographic makeup of Tweets over a 24hr period on Feb 4, 2013.


Twitter can “geo-tag” a Tweet and generally there is a 90% confidence that all Tweets from a certain country is correct.  Topsy states;  The Geographic view shows country-level metrics at a high confidence and coverage rates. The confidence rate will be 90%, meaning that 90% of tweets that are geo-tagged by country are correct based on our validation methods. The targeted coverage will be 90%, meaning that 90% of tweets that come from Twitter will be geo-tagged at the country level at the 90% confidence rate.  So when using this methodology, researchers must also be cognizant that “volume” is qualitative in nature and not quantitative.

Going beyond the margins of error however, it is interesting to see that the largest amount of Tweets were generated (328,340) from the United States.  Next was the actual country of origin of the archaeological event, with 49,439 UK Tweets.  Surprisingly, Indonesia had the third largest amount of original Tweets on the subject.  Next was France and then Canada.  The Canadian ranking of 5th was surprising, solely for the fact that the actual identification of Richard III’s remains would not have been possible without the DNA sample from Canadian Michael Ibsen, who is a 17th great-grand-nephew of Richard’s older sister — Anne of York.

If you compare the top 5 Tweets listed beside the geographic total, 4 out of the 5 original Tweets are from the UK and one is from the USA.  Unfortunately, also out of the top 5 Tweets, 3 are jokes about Richard III’s situation.  Which brings us to the skewing factor.  If one dives down into the actual quantitative gross counts, to examine the qualitative nature of the actual Tweet, a substantial amount of Tweets turn out to be original or retold jokes!  This was not lost on some as this Feb 4th post almost 16hrs from the original UK announcement in Maclean’s Magazine points out Richard III’s skeleton found; Twitter gets buried in jokes. Now Topsy nor do any Twittter data mining tool set have a “no joke” filter, but there are some interesting observations that can be made to discern how to filter the actual jokes from the data set.


As discussed in Part I of last weeks blog, Topsy and other data mining applications use Sentiment Analysis or natural language processing (NLP) to determine a quantitized value of the actual Tweet.  Topsy uses a NLP methodology that ranks words with a value from 0 to 100.  As Joe Masciocco, Social Analytics Consultant over at Topsy points out; in layman’s terms, we have language coding specialists on staff.  We score every word that comes through within each tweet on a scale from 0-100 (very negative – very positive) we then take a look at how the words interact and score the tweet on a whole from 0-100.  This all happens in real time for all tweets.  Hence Topsy quantitizes the content of the Tweet to determine it’s overall Sentiment (Driscoll et al, 2007).

Again there is no “joke” filter in NL processing, however I did discover something interesting when reviewing the graph above the quantitative data displayed by Topsy.  By clicking on the end points of each graphed line, the user can get a listing of the top 5 positive Tweets.  When we go through all four search terms, almost exclusively in this small sample set does the search term #RichardIII reveal where the “jokesters” live!  It seems #RichardIII by the end of Feb 4th has been co-opted yet again, but this time by people looking to plant or supplant a good joke!

Unfortunately like any interesting data, we have only scratched the surface.  In all the jumble of understanding how one archaeological event could potentially expose over 1.2 Billion Twitter followers in a single day to archaeology, we also need to examine how archaeology and archaeologists were effected.  In Part III, I’ll compare our Richard III event alongside mix methods analysis of archaeology and archaeologists to see if there is a correlation between pop event culture and public engagement archaeology.  I leave you with an article from the Washington Post I found in a Tweet from an archaeologist the day after the big event; On social media, archaeologists roll their eyes at Richard III skeleton discovery.




Driscoll, D.L., Appiah-Yeboah, A., Salib, P. and Rupert, D.J., 2007. Merging Qualitative and Quantitative Data in Mixed Methods Research: How To and Why Not. Ecological and Environmental Anthropology 3(1): 19-28.

Leave a Reply