Data assessments: keeping big data for good and profitable purposes
--
The opportunity that big data has brought to social sciences to analyze data from millions of people now challenges the need of establishing standards for achieving specific purposes in the use of data. Representativeness, distribution and applicability assessments of data are becoming a critical issue to ensure the impact of data for development and humanitarian action, making sure that all the world counts.
Descriptive assessments of data, inspired in social sciences and their methodologies to define sample population for studies should be now able to rely in analytical methodologies and processing tools. On the other hand, thanks to big data, the assessment of sample populations and estimation of demographics in a quantitative way with high resolution data is now becoming possible.
The loop between social science and analytical sciences let think about steps further in computational social sciences such as computational demographics. Recent studies based on data sources such as mobile phones and social networks have shown the potential to estimate timely census and profile of users. Whereas the examples are promising, there are important key points to address:
Understanding the interactions between survey-based demographics and computational demographics.
Surveys are often used as ground-truth data for validation of big data projects, however big data provides more resolution where survey data cannot reach. The interactions between survey-based demographics (based on small validated data) and computational demographics (based on big data) are yet to be properly studied.
Integrating social, environmental, cultural and economical data into the analysis and behavioral patterns.
Data integration provides broader perspectives to interpret patterns. Even more, intentionality of human behaviors can hardly be added to computational models, so the variability of behaviors in the population within a population sample is always relevant. Defining invariants of human behavior in multi-source data problems is still a long-term challenge that will require large pools of comparative data.
Designing interfaces and visualization tools to represent data and its uses
Multidimensional representation of data and its uses in different coordinate systems is necessary to understand the evolution of data dimensions and their impact of data innovation. For instance, how often data is applied to each SDG or how it is used to characterize different population groups.
Allowing overlapping and collision of data sources and uses
A necessary step is to accept some level of inaccuracy of data sources as data innovation has to deal with existing sources to assess potential uses and promote the investment in data ecosystems. Also, the optimization of existing data may not be similar for all the uses or the impacts in each SDG, so overlappings and collisions are expected.
Defining metrics for spatial and temporal distributions of data and the necessary sapling rates.
The amount of data distributed along space and time necessary to analyze patterns has to be progressively quantified. This problem has its analogy with the sampling necessary (Nyquist frequency), to for instance, send and receive an audio signal. However, the hypothesis of finding Nyquist frequencies in social systems becomes a complicated problem due to the different scales of interaction and large variability of the systems.
Scaling and deepening in any data ecosystem
Characterizing the availability of data as dense or sparse in any of its dimensions is necessary to adapt the analysis, understand the scope of data innovations and also stimulate further technological development. Artificial Intelligence as a tool to deepen impact of data requires the proper data mediums to ensure good impact.
Demographics and innovation based of behavioral analysis across demographics can be now approached using big data such as Call Detail Records and social media data. Being able to set the right frameworks to ensure fairness, equality and impact for the specific application will determine the success of data innovation.
References
UN Global Pulse Reports