# Datasets A dataset on Mabyduck represents a structured collection of media files — audio, images, or videos — organized for subjective evaluation. Datasets are the foundation of every experiment: before you can run a study, you need to create and upload a dataset. For a detailed guide on how to structure your media files, see the [Datasets documentation](/datasets/). ## How it works The API follows a three-step process to create a dataset: ```mermaid flowchart LR A["Create a dataset"] --> B["Upload a file"] --> C["Poll status"] --> D["Dataset ready"] ``` 1. **Create a dataset** — send a `POST` request that returns a signed upload URL. 2. **Upload a file** — upload a `.zip` archive or a `.txt` file of URLs to the signed URL. 3. **Poll status** — poll the status endpoint until the dataset is processed and ready. Once a dataset reaches the `ready` state, it can be attached to experiments. ## Authentication All API requests require an API key, which you can generate from your [project settings](https://app.mabyduck.com). Include it in the `Authorization` header of every request: ``` Authorization: Api-Key YOUR_API_KEY ``` ## Creating a dataset To create a dataset, send a `POST` request to the create dataset endpoint with a `name` and a `filename`. The `filename` determines how the dataset will be processed: - **`.zip`** — a zip archive containing your organized media files. - **`.txt`** — a plain text file with one URL per line, for externally hosted media. ```python import requests API_KEY = "YOUR_API_KEY" PROJECT_ID = "YOUR_PROJECT_ID" headers = {"Authorization": f"Api-Key {API_KEY}"} response = requests.post( f"https://api.mabyduck.com/projects/{PROJECT_ID}/datasets/", headers=headers, json={"name": "My Dataset", "filename": "dataset.zip"}, ) dataset = response.json() print(dataset["upload_url"]) # Signed URL for uploading print(dataset["id"]) # Dataset ID ``` ```bash curl -X POST \ 'https://api.mabyduck.com/projects/{project_id}/datasets/' \ -H 'Authorization: Api-Key YOUR_API_KEY' \ -H 'Content-Type: application/json' \ -d '{"name": "My Dataset", "filename": "dataset.zip"}' ``` The response includes: - `id` — the dataset identifier. - `upload_url` — a time-limited signed URL for uploading your file. - `status_url` — the URL to check the processing status. ## Uploading the file Upload your file to the `upload_url` returned in the previous step. ### Uploading a zip archive A zip archive should contain your media files organized into folders. Each folder represents a *source*, and files within represent different *conditions*. See the [datasets documentation](/datasets/) for details on structuring your files. ```python with open("dataset.zip", "rb") as f: response = requests.put(dataset["upload_url"], data=f) ``` ```bash curl --upload-file dataset.zip '{upload_url}' ``` ### Uploading externally hosted files For media hosted on external URLs (e.g., cloud storage), create your dataset with a `.txt` filename and upload a text file with one URL per line. The last two path segments of each URL are interpreted as `/`, following the same folder structure as local datasets: ``` https://cdn.example.com/source_1/method_a.wav https://cdn.example.com/source_1/method_b.wav https://cdn.example.com/source_2/method_a.wav https://cdn.example.com/source_2/method_b.wav ``` ```python urls = [ "https://cdn.example.com/source_1/method_a.wav", "https://cdn.example.com/source_1/method_b.wav", "https://cdn.example.com/source_2/method_a.wav", "https://cdn.example.com/source_2/method_b.wav", ] # Create a dataset for external URLs response = requests.post( f"https://api.mabyduck.com/projects/{PROJECT_ID}/datasets/", headers=headers, json={"name": "External Dataset", "filename": "urls.txt"}, ) dataset = response.json() # Upload the URL list response = requests.put(dataset["upload_url"], data="\n".join(urls)) ``` ## Polling the status After uploading, the dataset is processed server-side (e.g., collecting stimulus durations). Poll the status endpoint until the status changes from `processing` to `ready`: ```python from time import sleep while True: response = requests.get(dataset["status_url"], headers=headers) status = response.json() if status["status"] != "processing": break sleep(1.0) print(status["status"]) # "ready" or "error" ``` The possible statuses are: | Status | Description | | --- | --- | | `draft` | Dataset created but file not yet uploaded | | `processing` | File uploaded and being processed | | `ready` | Processing complete — dataset can be used in experiments | | `error` | Processing failed (check the `error` field for details) | ## Retrieving stimuli and stimulus groups Once a dataset is ready, you can retrieve its details to inspect the resulting stimuli and stimulus groups: ```python response = requests.get( f"https://api.mabyduck.com/projects/{PROJECT_ID}/datasets/{dataset['id']}/", headers=headers, ) detail = response.json() for stimulus in detail["stimuli"]: print(stimulus["filepath"], stimulus["group_id"]) ``` Each stimulus has: - `filepath` — the path of the media file within the dataset (e.g., `source_1/method_a.wav`). - `group_id` — an identifier grouping stimuli that belong to the same source folder. Stimuli within the same group correspond to different conditions applied to the same source. In an experiment, stimuli from the same group are evaluated together — for example, as a single MUSHRA slate or a pairwise comparison. What's next? Once your dataset is ready, you can create an experiment. See the [Experiments section](/api/tutorials/experiments/) for a step-by-step guide on creating and launching an experiment using the API.