It's time to wrap up NDIC17! What did we learn? What surprised you? What did we want to see more of? This time will be used to solicit feedback and write a document that summarizes our conference experience. The exact schedule will be determined Friday afternoon, unconference style!
There’s a story in every data point, maybe even a big one
About a decade ago, it was rare to find someone in a newsroom who specialized in data. Today, you’ll find data journalists, computer-assisted reporters, or as some lovingly call them, geeks, in most major newsrooms. Yet data in news is as old as stock numbers and unemployment figures. What’s different now is that journalists, and therefore the public, are increasingly accessing these raw files and crunching the numbers with their own queries. Legal battles to get public data in a useful format aside, newsrooms are increasingly comfortable with mining data, mapping it, identifying methodological concerns, creating visualizations and, of course, translating it all into plain English for their audience. Data may answer (or start to answer) questions like “How many?” “How often?” “How much?” “Who?” The findings may identify, for instance, discrimination that officials deny is happening. It may shine a light on a dire safety situation that no one else, even the bosses who oversee the data, knew about. It can confirm and underscore what sources say anecdotally is happening. In every data set – from car crashes to arrest records to soldiers kicked out of the military for misconduct – there is a story. And not just the story of the people behind each point of data. Data can bring a reporter to a person who is representative of an important trend. Thus, journalists put a face and narrative on data that people understand.
Consumers to citizens: Responsibility and Autonomy in Data Aggregation
The architecture of the Web and the predominant approaches of Big Data are inherently commercial in nature, but as distinctions between human virtual and actual communities continue to blur, the way we conceive of the digital self requires evolution. We are arguably in a period of adolescence, relatively speaking, regarding our digital lives, and we are recognizing patterns of moral development in this realm that are analogous to those of our actual lives. In moving from structures of self-focus and economic exploitation to those of community building and engagement, some key principles of moral agency must serve as critical guides. Platform models for marketing and news delivery will be increasingly called upon to account both for their impacts on the development of the self and on community efficacy. This will mean taking seriously the notions of autonomous agency, communitarian values and human dignity. It will require more comprehensive policy discussions on the nature of harm in the digital world, the distinction between knowledge and information, and the need for a universal right to “be forgotten.”
Understanding the unique ethical challenges of data collection faced by social scientists
From the perspective of social scientists, “data” often exists as the words of the people with whom we interact. As a result, social scientists are frequently challenged with not only collecting data to inform their studies, but necessarily tasked with doing so under less than ideal social, political, and/or economic conditions. This talk will detail the experiences and perspectives of a mixed-methods social scientist who has collected data under such conditions. Specifically, the challenges faced when gathering rich, qualitative data from coal-dependent communities in eastern Kentucky and southern West Virginia about a divisive form of coal extraction known as mountaintop removal. In addition to presenting the ethical dilemmas encountered in such studies, this talk will also engage topics such as consent, privacy, and researcher/subject power dynamics.
Data Integrity, Research Misconduct and the Impact of the Lab Environment
The mission of the U.S. Department of Health and Human Services’ Office of Research Integrity (ORI) is to promote the integrity of Public Health Service (PHS)-funded research, which it does through its education arm (Division of Education and Integrity) and it compliance arm (Division of Investigative Oversight). ORI reviews institutional research misconduct investigations into allegations of falsification, fabrication, and plagiarism in proposing, performing, or reviewing research, or in reporting research results. Through this work, ORI is presented with cases of research misconduct for which an obvious nexus between the lack of data integrity, poor mentoring and/or laboratory management, and allegations of research misconduct is evident. Review of the primary research record, including data spreadsheets in many of these cases, reveals remarkably poor record keeping, data validation, and oversight. Our analysis of ORI’s research misconduct findings suggests that graduate students, postdoctoral fellows, and research associates are at higher risk for engaging in research misconduct, which emphasizes the importance of the laboratory training environment. The 2017 National Academies of Sciences, Engineering, and Medicine report on Fostering Integrity in Research provides institutions and federal agencies with a number of recommendations for consideration to address these important issues.
During the last half day of the conference, we'll be working on putting together a document about the themes and concepts that arose during the conference. Now that we've seen a day's worth of talks, let's have a preliminary discussion about what stood out.
- Brief introductory remarks
- Identify common themes
The Secondary Cities Initiative: Ensuring Data Integrity for Complex Geospatial Projects
Data integrity is essential for data sharing and access by multiple users. The Secondary Cities (2C) Initiative is a field-based initiative to map resiliency, human security, and emergency preparedness. Secondary cities are fast-growing urban areas experiencing unplanned growth and development. These cities are unique environments that provide regional services and are hubs of governance, transportation, and/or economies. Secondary cities are generally are poorly mapped and under examined. The 2C Initiative spans the globe with 10 projects demonstrating partnerships between universities, government and non-governmental organizations (NGOs) through promoting participatory mapping and data collection using open source tools, software and platforms. Critical to this project is data integrity built upon sound scientific approaches, comprehensive ethic framework, a common metadata structure, and quality assurance/quality control practices. This presentation will describe the context and implementation of assuring data integrity in the 2C Initiative. Examples from 2C projects are provided that demonstrate the application of data integrity practices and innovation.
3 speakers, 15 minutes each, totally badass
Carolyn Broccardo - RePAIR Consensus Guidelines: Responsibilities of Publishers, Agencies, Institutions and Researchers in Protecting the Integrity of the Research Record
As scientists, we must be acutely aware of the societal impacts of our research data. When the public perception of the research process begins to falter, we have failed in our duty to clearly and honestly convey our work. Furthermore, researchers, institutions, agencies, and publishers or editors have complementary roles and responsibilities in maintaining the integrity of the research record—the very basis for our research communication. However, it can be exceedingly difficult to bring together the key players in this huge research enterprise. In order to bring clarity to this process, we generated guidelines to outline key stakeholder responsibilities when questions arise regarding the integrity of the research record, including possible research or publication misconduct. We identify common barriers to communication as well as potential solutions. The goal of this document is to foster effective communication at all levels of the research life cycle, extending from the researcher to the funding agency.
Wladimir Labeikovsky - A look at data management in a peer-to-peer future
We’ve turned the promise of the web as a platform for sharing scientific data and literature in a decentralized, robustly democratic fashion into a bazaar of largely non-interoperable silos controlled mostly by rent-seeking, scale-obsessed concerns. What would it look like if we redecentralized the web, or at least the parts of the web where we exchange scientific information? I’ll present a high-level, whirlwind tour of the current projects and applications (e.g. dat, IPFS and their brethren) that aim to rebuild the web in this fashion. How does a decentralized web augment efforts in data sharing and preservation? What are the pitfalls and costs of going peer-to-peer? How can data librarians and researchers contribute to get it right?
Cat Bens - Bias Control: Protecting Us From Ourselves
Not too long ago, Francis Bacon issued his philosophical work entitled, Novum Organum Scientiarum (‘new instrument of science’), that continued our understanding about the scientific method by focusing on empirical investigation. He noted that as humans we are programmed to pay more attention to evidence that agrees with our preconceptions and to reject evidence that doesn’t. If we want to learn more about the universe, we needed to take this inherent tendency toward natural bias into consideration in the design of scientific experimentation and focus on empirical investigation. It’s been 400 years since Novum Organum and we as researchers still struggle with preventing non-random error from influencing our research planning, conduct, analysis and reporting. This presentation will provide a quick reminder of the real danger of research bias and go over some of the techniques we can use to control or limit bias in research.
SHARING IN A GRAY AREA: A FRAMEWORK FOR BIG DATA CURATION
Our world is data-driven. "Big data" can create insights to inform all kinds of efforts—from business strategy and predicting markets, to producing media stories, to academic research. For the purpose of this talk, we define "big data" as data about people that is collected as a corollary to other services. Examples include shopping behavior tracked by retailers, clickstream data tracked by websites, geospatial data generated from people's mobile devices, and social media posts. As big data is increasingly used to conduct research, so is it increasingly shared publicly—in response to requirements from funding agencies and academic journals, or simply in the spirit of Open Data. While the research community has developed guidelines for anonymization and sharing of human subjects data, anonymization and sharing of big data continues to inhabit a gray area. Is big data human subjects data, or is it already-existing data that can be shared freely? Our talk proposes a framework that will draw on case studies from the Dryad Repository to help guide data curators through ethical inquiry when assessing big data for the purpose of public archiving. The framework explores factors such as the sensitivity of the research being conducted; the context in which the data was collected; and the expectations of the users whose lives and actions constitute big data. This framework is not meant to provide hard and fast rules, but rather aims to improve practice and minimize risk for all humans involved in the open data ecosystem.
Data, data, everywhere: Reflections on data, ethics, and human community
Like Coleridge’s Ancient Mariner, we find ourselves surrounded by a sea (of data). The decisions we make as a human community about the collection, analysis, and subsequent use of data to inform both institutional decisions and public policy are ordered to presuppositions about what it means to be human and the extent to which data can inform our understanding of human community. It comes as no surprise that data use and misuse can have a profound impact on human community; rightly approached, data have the potential to inform public policy on issues ranging from educational practices, health information and public health surveillance, and behavioral economics, to virtually any aspect of human community. The human side of data seeks to remind us of those individuals from whom data is obtained as well as those individuals that can be impacted by the interpretation and policy implications of data. Amidst this, one might suggest several “first principles” as reminders of the ethical landscape in which we operate. These principles include: 1). A consistent moral anthropology that understands that no amount of data can wholly capture the human condition; data may help inform, but cannot exhaust what it means to be human or to reside in human community. We must not allow data to blind us to the question of “being human”. 2). A consistent epistemology that understands that data per se is not knowledge (or “wisdom”), and that the exercises of data collection, data analytics, and how data inform public policy are inherently moral exercises that can impact both communities and individuals within a community. 3). An operational transparency; we must, as a society, have nothing to hide about data collection and use. 4). An intellectual humility that understands both the benefits and limits of data; we must be aware that we may (arguably inescapably WILL) be wrong about aspects of data collection and use, and must be willing to be corrected when we err. Reflection about the place of data within human community, then, engages our deliberative, reflective, and prudential capacity in pursuit of how we ought to order our lives together, an exercise that virtually defines human community. In this sense, our approach to the ethics of data (even BIG data) remains an exercise of moral agency embedded within human community.
How Do We Measure Value in Data Reuse? Nuancing Ethical Data Sharing and Attribution for the Social Sciences and Indigenous Communities
As a result of the ‘open data’ movement, an increased focus on how data and information should be attributed and cited has become increasingly important. As data becomes reused in scientific analyses or decision-making, efforts have turned to crediting the data creator, such as data citation and metrics of reuse to ensure appropriate attribution to the original data author. The increased focused on metrics and citation, however, need to be carefully considered when it comes to social science data, local observations, and Indigenous Knowledge held by Indigenous communities. These diverse and sometimes sensitive data/information/knowledge sets often require deep nuance, thought, and compromise within the ‘open data’ framework, in order to consider issues of the confidentiality of research subject and the ownership of data and information, often in a colonial context. These data and knowledge sets have been used to further specific causes that may not be beneficial to those who created and hold the knowledge. Furthermore, these datasets are often highly valuable to one or two villages, saving lives and retaining culture within. In these cases, quantitative metrics of “data reuse” and citation do not adequately measure a dataset’s ‘value.’ In this talk, I will provide examples of datasets that are highly valuable to small communities from my research in the Arctic and US Southwest, as well as critically approach issues of data ownership and sovereignty when it comes to data sharing and reuse. Many of these datasets are not highly cited or have impressive quantitative metrics (e.g., number of downloads) but have been incredibly valuable to the community where the data/information/knowledge are held. These case studies include atlases of placenames held by elders in small Arctic communities, as well as databases of local observations of wildlife and sea ice in Alaska that are essential for sharing knowledge across multiple villages. These examples suggest that a more nuanced approach to understanding how data should be accredited is needed when working with social science data and Indigenous Knowledge.