Handmade Dataset

Forthcoming chapter about Handmade Datasets in Decentering Ethics: AI Art as Method to be published by the end of 2025 by Open Humanities Press.

Excerpt Below:

There is pressure to collect more and more quickly when working with Machine Learning. Training a Generative Adversarial Network (GAN), a type of machine learning model that generates output (i.e. images/text) by emulating its training data, can require thousands of images, while training a Diffusion model, a newer type of generative ML model, requires billions of images. Many Machine Learning models are trained on data that is scraped from the internet and processed by click-workers. Data is often scraped without the consent or knowledge of those that produce it. For example, IBM's Diversity in Faces dataset, a dataset of face images and related annotations, sourced its images from Flickr, a photo-sharing website popular in the mid 2010's without the explicit permission or notification of photographers or their subjects (Harvey and LaPlace, "IBM Diversity in Faces," n.d.). IMB is currently facing a class action lawsuit alleging violation of the Illinois Biometric Information Privacy Act (Rizzi, "Class Action Accuses IBM," 2023).

The click-workers that are typically employed to annotate/process scraped data through crowd-sourcing platforms like Amazon Mechanical Turk, Figure Eight, or in some cases, job-training initiatives like M2Work (a collaboration between Nokia and The World Bank) are often subject to difficult working conditions. Adrienne Williams, Milagros Miceli and Timnit Gebru write in their 2022 article The Exploited Labor Behind Artificial Intelligence, "Data labeling interfaces have evolved to treat crowdworkers like machines, often prescribing them highly repetitive tasks, surveilling their movements and punishing deviation through automated tools. (Williams, Miceli, and Gebru, "The Exploited Labor Behind Artificial Intelligence," 2022)" And as artist Mimi Onuoha points out in her work The Future is Here!, a video piece revealing the domestic spaces where this type of micro-work or gig-work is actually carried out, despite the 'future-facing' rhetoric around Machine Learning, click-work "rests on a history of labor that is pretty predictable and is actually in line with a lot of historic ways in which we’ve seen labor used. (Onuoha, "The Future is Here! An interview with Mimi Onuoha," 2022)" In his 2021 article, Refugees help power machine learning advances at Microsoft, Facebook, and Amazon, researcher Phil Jones describes M2Work, writing "dedicated to 'job creation' in the Global South, the World Bank undoubtedly sees Palestine’s 30% unemployment rate as an unmissable opportunity — an untapped source of cheap labor, readily brought into the sphere of global capital by the great telecom networks on which our brave “new economy” rests (Jones, "Refugees, Machine Learning, and Big Tech," 2021)."

I believe that handmade datasets can push back on pressures to scale and outsource labor through the inherent slowness involved in their making. Their relatively small size allows collectors and creators to fully review each datapoint, spending meaningful time learning about the content of their dataset. The slowness of the process allows one to properly set intention, consider issues of consent and ownership as well as plan around stewardship of the dataset as an archive. 

While building a dataset and training a model from scratch is certainly not feasible or desirable in all cases, I've personally found going through this tedious process enlightening and critical to both the artwork I create and my own understanding of the data I am working with.




Back to all projects