Data have been collected and analyzed for millennia, but never before have these processes been so ubiquitous. Data journalism, with its focus on eye-catching visualizations and infographics, is transforming an industry from mere collection of information to effective presentation. Businesses rigorously analyze consumers’ browsing and purchasing histories to optimize sales and marketing. Likewise, the governance of a nation increasingly relies on the collection and analysis of data, energizing the field of government analytics. Federal and state agencies employ quantitative methods to conduct program evaluations. Policy debates overflow with statistical arguments. Campaigns mine voter data to micro-target potential supporters. Data-driven decision making holds the promise of making our government more efficient, more effective, and more responsive to our most critical needs.
The term “big data” is widely used to refer to the exponential growth in the size of datasets. A few years ago, a dataset with a few hundred thousand observations was considered large, particularly in the context of the traditional Neyman-Pearson framework of statistical significance. As the ease of data collection and storage has increased, so has the amount of available data. Computer-assisted programs collect a tremendous amount of data every second of every day. They scrape websites for text documents, track web users’ activity, gather location information from cell phones and cars, store satellite imagery, and collect data on commercial transactions. It is estimated that 2.5 quintillion bytes of data are created each year.
Big data is typically distinguished from other data by three key characteristics.2 First, big data is voluminous as a result of new data collection instruments and the ability to store the massive amount of collected data. Second, big data has a high velocity, meaning the data are collected and managed at high speeds. Third, big data has variety, meaning the data exist in various formats and structures. Owing to these three characteristics, big data poses unique challenges with respect to data management, analysis, and presentation.
In a governance context, the real value of data is the policy implications we can glean from the results of an analysis. In many instances, data have helped researchers debunk common myths or misperceptions about the true cause of an undesirable outcome. For example, many news commentators bemoan the increase in student debt levels over the past decades under the assumption that the typical student borrower is crushed by debt and hindered from achieving his or her personal and professional goals. The numbers, however, indicate that the increase in student debt is attributable to the increase in educational attainment and that these higher levels of education have led to higher lifetime earnings. In consequence, policy makers should focus less on policies aimed at all student borrowers and more on providing a safety net for the borrowers whose bet on higher education does not yield expected returns. A targeted approach may avoid the unintended consequences of a broad-based approach, which could exacerbate rather than solve over-borrowing by students and inflating tuition by universities. Accurate analytics regarding who is suffering under the current student loan framework (and why) can help policy makers craft laws and regulations that address the real problem.
For another example of the importance of careful use of analytics, consider the emphasis placed by many researchers on race when analyzing outcomes related to health and education. In the voluminous literature that attempts to estimate the causal effect of race, this characteristic is usually treated as a unidimensional, fixed variable. In short, race is generally considered to be a single (i.e., non-aggregate) measure. From a policy perspective, however, it may be more useful to treat race as a composite variable (like socioeconomic status)—a “bundle of sticks” that includes skin color, dialect, religion, neighborhood, and region of ancestry. The disaggregation of race into its constitutive components allows researchers to identify opportunities for effective policy formulation.
Excerpt from Analytics, Policy, and Governance, edited by Jennifer Bachner, Benjamin Ginsberg, and Kathryn Wagner Hill
Benjamin Ginsberg is David Bernstein Professor of Political Science and chair of the Center for Advanced Governmental Studies at Johns Hopkins. Kathy Wagner Hill is director of the Center for Advanced Governmental Studies at Johns Hopkins. Jennifer Bachner is director of the Master of Science in Government Analytics at Johns Hopkins.