# Datasets

A dataset on Mabyduck represents a structured collection of media files — audio, images, or videos — organized
for subjective evaluation. Datasets are the foundation of every experiment: before you can run a study,
you need to create and upload a dataset.

For a detailed guide on how to structure your media files, see the [Datasets documentation](/datasets/).

## How it works

The API follows a three-step process to create a dataset:


```mermaid
flowchart LR
    A["Create a dataset"] --> B["Upload a file"] --> C["Poll status"] --> D["Dataset ready"]
```

1. **Create a dataset** — send a `POST` request that returns a signed upload URL.
2. **Upload a file** — upload a `.zip` archive or a `.txt` file of URLs to the signed URL.
3. **Poll status** — poll the status endpoint until the dataset is processed and ready.


Once a dataset reaches the `ready` state, it can be attached to experiments.

## Authentication

All API requests require an API key, which you can generate from your
[project settings](https://app.mabyduck.com).
Include it in the `Authorization` header of every request:


```
Authorization: Api-Key YOUR_API_KEY
```

## Creating a dataset

To create a dataset, send a `POST` request to the create dataset endpoint with a `name` and a `filename`.
The `filename` determines how the dataset will be processed:

- **`.zip`** — a zip archive containing your organized media files.
- **`.txt`** — a plain text file with one URL per line, for externally hosted media.


```python
import requests

API_KEY = "YOUR_API_KEY"
PROJECT_ID = "YOUR_PROJECT_ID"

headers = {"Authorization": f"Api-Key {API_KEY}"}

response = requests.post(
    f"https://api.mabyduck.com/projects/{PROJECT_ID}/datasets/",
    headers=headers,
    json={"name": "My Dataset", "filename": "dataset.zip"},
)

dataset = response.json()
print(dataset["upload_url"])  # Signed URL for uploading
print(dataset["id"])          # Dataset ID
```


```bash
curl -X POST \
    'https://api.mabyduck.com/projects/{project_id}/datasets/' \
    -H 'Authorization: Api-Key YOUR_API_KEY' \
    -H 'Content-Type: application/json' \
    -d '{"name": "My Dataset", "filename": "dataset.zip"}'
```

The response includes:

- `id` — the dataset identifier.
- `upload_url` — a time-limited signed URL for uploading your file.
- `status_url` — the URL to check the processing status.


## Uploading the file

Upload your file to the `upload_url` returned in the previous step.

### Uploading a zip archive

A zip archive should contain your media files organized into folders. Each folder represents
a *source*, and files within represent different *conditions*. See the
[datasets documentation](/datasets/) for details on structuring your files.


```python
with open("dataset.zip", "rb") as f:
    response = requests.put(dataset["upload_url"], data=f)
```


```bash
curl --upload-file dataset.zip '{upload_url}'
```

### Uploading externally hosted files

For media hosted on external URLs (e.g., cloud storage), create your dataset with a `.txt`
filename and upload a text file with one URL per line. The last two path segments of each URL
are interpreted as `<source>/<condition>`, following the same folder structure as local datasets:


```
https://cdn.example.com/source_1/method_a.wav
https://cdn.example.com/source_1/method_b.wav
https://cdn.example.com/source_2/method_a.wav
https://cdn.example.com/source_2/method_b.wav
```


```python
urls = [
    "https://cdn.example.com/source_1/method_a.wav",
    "https://cdn.example.com/source_1/method_b.wav",
    "https://cdn.example.com/source_2/method_a.wav",
    "https://cdn.example.com/source_2/method_b.wav",
]

# Create a dataset for external URLs
response = requests.post(
    f"https://api.mabyduck.com/projects/{PROJECT_ID}/datasets/",
    headers=headers,
    json={"name": "External Dataset", "filename": "urls.txt"},
)
dataset = response.json()

# Upload the URL list
response = requests.put(dataset["upload_url"], data="\n".join(urls))
```

## Polling the status

After uploading, the dataset is processed server-side (e.g., collecting stimulus durations).
Poll the status endpoint until the status changes from `processing` to `ready`:


```python
from time import sleep

while True:
    response = requests.get(dataset["status_url"], headers=headers)
    status = response.json()

    if status["status"] != "processing":
        break

    sleep(1.0)

print(status["status"])  # "ready" or "error"
```

The possible statuses are:

| Status | Description |
|  --- | --- |
| `draft` | Dataset created but file not yet uploaded |
| `processing` | File uploaded and being processed |
| `ready` | Processing complete — dataset can be used in experiments |
| `error` | Processing failed (check the `error` field for details) |


## Retrieving stimuli and stimulus groups

Once a dataset is ready, you can retrieve its details to inspect the resulting stimuli and stimulus groups:


```python
response = requests.get(
    f"https://api.mabyduck.com/projects/{PROJECT_ID}/datasets/{dataset['id']}/",
    headers=headers,
)
detail = response.json()

for stimulus in detail["stimuli"]:
    print(stimulus["filepath"], stimulus["group_id"])
```

Each stimulus has:

- `filepath` — the path of the media file within the dataset (e.g., `source_1/method_a.wav`).
- `group_id` — an identifier grouping stimuli that belong to the same source folder.


Stimuli within the same group correspond to different conditions applied to the same source.
In an experiment, stimuli from the same group are evaluated together — for example, as a
single MUSHRA slate or a pairwise comparison.

What's next?
Once your dataset is ready, you can create an experiment.
See the [Experiments section](/api/tutorials/experiments/) for a step-by-step guide on creating and launching
an experiment using the API.