Ethical Data Glossary

The Ethical Data Glossary is a living document designed to be used as an educational resource for a broad public audience, including individuals with no prior knowledge of ethical data questions. It focuses on accessible definitions, avoiding where possible technical jargon related to data science or legal studies.

Drawing on philosophy, history, and social studies of science, in line with the Ethical Data Initiative’s approach, the Glossary aims to foster understanding and awareness of ethical data concepts. 

Entries should be written in general, clear language. They should use an inclusive tone and have a practical focus as much as possible.

The ideal length of an entry is 100–200 words.  

Entries will link to existing, more technical vocabularies. 



Bias

Bias refers to the characteristics of a system that systematically treats different elements unequally, such as particular social groups. Bias is unavoidable as all human-made systems are biased to some extent, but its outcomes can be unfair or discriminatory when the bias is not explicitly disclosed and mitigated. Such problematic bias can have various causes: a flawed method of data collection, an unequal representation of data in a dataset, a consistent pattern of error in processing data, or a persistent way of misinterpreting data, which accord with racial, ageist, gender-based, or other forms of prejudice or stereotyping about a person or group. In an automated system, bias can be difficult to detect and may be reinforced, since the outcomes of such systems tend to hide any systematic distortion in its decision-making processes. Examples of bias in algorithms or platforms have occurred widely, including racist bias in the algorithmic surveillance and policing or gender and class bias in algorithms for hiring procedures. 

Other definitions: 

Big Data

Usually refers to datasets that are very large in volume (the number of data inputs), different in variety (the different types of data), high in velocity (the speed at which data is generated, processed, and analyzed), and subject to variability (the inconsistency or fluctuation in data composition and structure), although “Big Data” is often used loosely to indicate only some of these features. Storing, handling, and analysing such extensive datasets requires substantial computing power, machine learning functions, and automation. As a result, big data pose many technical and social challenges. For example, their governance demands significant financial and human resources, as well as expertise to organise data so that they can be mobilised and used. Moreover, ensuring privacy and data protection standards has been a key challenge, as tracking the origins and rights of collected data is complicated. 

Reference: Kitchin, Rob. 2025. Critical Data Studies: An A to Z Guide to Concepts and Methods. Polity Books (Open Access), 20–21. 

Other definitions: 

Curation

Curation refers to a set of practices or processes used to organise and manage data collections in ways that improve their quality, meaning, structure, or findability. The activity of curation highlights how data are deeply embedded in the social context in which they are created and used, requiring human judgement, labour, and values to organize, maintain, and make data meaningful. Data curation practices may include systematic organization, selection, formatting, annotating, visualisation, preservation, storage, management strategy, or access to information. Curation not only points to the many points of interaction between humans and data but also the distribution of coordination and responsibilities for data collection between users/data-workers. Accordingly, curation has been a major concern for knowledge organizations such as research institutions, libraries, state agencies, and public administration; and has become better established and more visible as a profession on its own right, requiring specialised training and appropriate remuneration.  

Other definitions: 

Data Responsibility

Responsible data work involves efforts to account for the reasons, goals, and methods used when handling data, and make such work responsive to social needs and potential implications. In practice, this involves: complying with legal obligations around data collection, management, and use; striving to follow ethical principles of good practice; and engaging in discussions around the scientific and social aspects of data management decisions. For instance, the CARE principles provide an ethical framework to engage with data sovereignty, while guidance on “responsible AI” defines potential consequences of training data within complex computational systems. Working responsibly with data is a matter of considering the potential for harm that could follow from choices made in collecting, storing, or interpreting data, and then attempting to mitigate those harms. For example, individuals can be harmed when their personal data is scraped from online platforms without their knowledge or consent; similarly, harm can arise from using data produced by low-resourced groups without proper attribution; while energy- and material-intensive AI systems can cause harm to the global environment, as well as to communities affected by resource extraction. 

Other definitions: 

Raw Data

Data are usually described as “raw” when they have emerged from some process of data acquisition or recording, but have not yet been processed (e.g. computationally) for further use. The term “raw” is often taken to mean that the data are somehow natural, essential, or objective, in the sense of being free from human intervention. This view, however, is highly contested. In Data Studies, data can never be raw in this sense, because gathering and recording data always entails choices and assumptions made by humans. These come about in the processes of deciding what data to gather, what techniques and apparatus to use, and what format to use (numbers, images, text, etc.; stored digitally, in a notebook, etc.).  

Other definitions: 

Self-tracking

Self-tracking refers to the practice of measuring and recording aspects of our bodies, habits, or behaviours, and turning them into data that can be analysed, reflected upon, or shared. Although digital technologies have made it possible to measure more things, more quickly, self-tracking is not new. Historical studies show self-measurement was (and still is) associated with notions of (self-)improvement and progress. In the past, individuals tracked not only their bodily health (or ill health), but even their moral behaviour; Benjamin Franklin, for example, kept a record of his sins.

Technology’s evolution means we can now measure aspects of our selves in everyday life, more readily and more frequently than in the past, such as steps, sleep quality, and blood pressure. This can be seen as empowering; however, the increased accessibility of measurement and the advent of real-time tracking raise concerns about self-optimisation and how we think about our lives, bodies, and responsibilities. We might believe we can control these, but they are also shaped by outside influences or circumstances, notably broader ideas and standards of how we should live, what an “optimal” or “normal” person looks like, how this person can be measured, and the expectation that we should strive to reach this version of ourselves. The very act of recording activities and states of being, coupled with lack of control over what is being monitored, may adversely affect health, for instance through excessive emphasis on diet. Moreover, self-tracking may also become a form of surveillance, as we often have little information and control over how exactly our data are stored, analysed, and potentially reused by commercial companies.

Further sources:

Glossary contributors: Ana Sofia Acevedo Perez; Kenzi El Shaer; Kim Hajek; Sabina Leonelli; Paul Trauttmansdorff.