In July, the Defense Advanced Research Projects Agency (DARPA) issued an RFP looking for new Big Data tools to track social media postings and interactions, reflecting a growing interest within government to use social media and open source data to "fill-in" and "complement" traditional data sources.
These publicly available data sources include social networking sites, such as Twitter, Facebook and LinkedIn; Websites which contain publicly available or subscription-based content; and topic-specific online message boards and chat rooms. The focus for the government is on using this public data to develop analytics that can be used to anticipate how an adversary or potential friend "thinks" and "feels" about a particular situation, with the hope of being able to predict their behavior, actions and reactions.
There are significant challenges and points for consideration in developing analytics against social media and open source data:
Provenance and pedigree -- One of the major weaknesses with public data is that you cannot accurately confirm the identity of the sources and their motivations. From the outset you need to consider data provenance (the metadata that describes the source of the data) and make appropriate pedigree decisions (confidence in the data) when using this data in any analytic. Maintaining data provenance metadata and pedigree-based decision-making is not something you "bolt-onto" analytical software after the fact -- it is an integral part of the analytic that must be designed and included from the outset.
Merging and re-using existing analytic techniques -- There are various initiatives such as the Mood of the Nation Project which have developed statistical machine-learning techniques for processing Twitter feeds to measure the Joy/Sadness/Anger/Fear level of an entire nation as major events unfold. These analytical techniques are often merged with other classic capabilities, such as entity extraction, relationship discovery and link analysis to offer greater insight into the data.
Natural language processing (NLP) -- For Latin-rooted languages, NLP techniques are relatively mature, although they still struggle with social language aspects such as sarcasm, metaphors, acronyms, Internet shorthand or intentional encoding to disguise the meaning. It is critical that more mature techniques be developed to handle non-Latin based languages, as it is likely that future conflicts will take place in countries where these languages are used.
Data Integration -- It's very likely that the social media and open source data will have to be integrated (or linked) with other more trusted data sources to derive a complete picture. Various data integration approaches have been developed over time (e.g., canonical data model, uber data model, software integration layer, semantics-based), but this is a very complex problem at any scale. The lack of any unique identifier (e.g, keys) in the social media and open source data makes it particularly difficult to integrate.
Predictive aspect -- Many social media analytics look in the past and measure sentiment, influence or the size of an organization of interest. What is really needed are analytics that can predict potential future events. For example, the way villagers and local militia may react to rising food prices triggered by climate change, leading to drier local weather patterns which could produce a localized drought leading to a smaller harvest and a subsequent rise in flour prices. Sophisticated models are required to be able to do this kind of analytic, but therein lies the power of analytics.
Also, in line with the predictive aspect, social media can contain "side channel" clues for otherwise hidden activity. In other words, discussions or tweets from people witnessing odd activity, such as high foot traffic around a previously isolated house. These details can then be used to focus our analytics on an area, person or group of interest.
Real-time analytics -- Just a few minutes after an event -- and sometimes even before an event happens -- social media is abuzz with associated details. Traditional batch-based analytics, or analytics that execute against data at-rest, are not appropriate for such a scenario; a more near real-time data streaming analytic execution model is required. Real-time analytics provide the benefit of being able to witness an event unfolding, as it is occurring, with a level of timeliness that cannot be accomplished with at-rest analytics. Thus, we use social media analytics as a real-time indicator of where to look, as opposed to using some form of at-rest data mining across the entire data spectrum, which often is too late.
Careful forethought and planning regarding the above points can ensure that social media and open source analytics reach their maximum potential.
Louis Chabot, a professional software architect with more than 25 years of experience in information systems, leads the Big Data practice for Dynamics Research Corp. He can be reached at: