We are interested in seeing what social media data can tell us about Qatari society at large. More concretely, we want to see if we can develop tools to take the digital pulse of a country. For example, where do different people go to shop? Or when and where do people go to exercise? The current demo is just one small step that gives users the opportunity to explore data to see life in Qatar with new eyes.
After making a particular (un-)selection, e.g. selecting only the "Shopping" topic with all languages, days and times, certain areas on the map light up in red. Red areas on the map indicate "Hot spots" with more than expected activity for the selected filters. "Cold spots" with less than expected activity are not marked on the map to avoid clutter. However, filters marked in cyan on the right side bar indicate that, there is less activity than expected. For example the area around Landmark Mall, at the intersection of the Doha Expressway and Al Markhiya Street, lights up in red as out of the tweets matching the filter settings, an unusually large fraction matches this area when compared to the background distribution of all tweets. Similarly, for the same selection, the early evening (6pm-8pm) lights up in red as during this time unusually many tweets are related to shopping. When things are either "as expected" or there's not enough data to tell then we leave the corresponding area blank.
For this demo we use public, geo-tagged tweets that are obtained through the Twitter APIs (), using both their REST API () and their Streaming API (). In total, we have collected 1.8 million public tweets coming from 19,606 distinct users. Most of the data comes from Jan-Oct 2013, e.g. 178,480 tweets from Jul 2013. Only tweets that are geo-tagged with (latitude, longitude) are considered, and tweets that lie outside of Qatar are ignored. We will periodically add fresh data in the future.
Currently, we use a list of hand-compiled dictionaries for English, Arabic and Tagalog, the three most common languages in tweets from Qatar. For example, any tweet containing "swimming" would be marked as sports-related. Obviously, there are false positives such as "Things are going swimmingly ()" and false negatives such as "I was doing 200m of butterfly this morning ()". In the future we plan to improve the topic detection by integrating statistical language models such as Latent Dirichlet Allocation ().
We use a public tool for language detection which is available here. Note that the language detection of very short tweets or tweets that contain only a URL is difficult and the tool will sometimes make mistakes. Also Arabizi () or other transliterations of non-Latin scripts are currently not supported by this tool.
In central Doha there is a sufficiently large amount of geo-tagged tweets to aggregate statistics for small geographic areas. In many parts of central Qatar the tweets volume is much lower. In an effort to still have sufficient data to reason about using statistics, we resort to aggregating data from a larger area. The exact recursive splitting algorithm uses quadtrees ().
The lead scientist is Ingmar Weber (@ingmarweber) in QCRI's Social Computing group (). Most of the implementation is done by Kiran Garimella (@gvrkiran). Two local students, Humaira Tasnim (@humairatasnim) and Abhay Valiyaveettil (), contributed to this project during its early phase as part of their QCRI summer internship ().