Datasets
Copy for LLM
Copy page as Markdown for LLMs
View as Markdown
Open this page as Markdown
Open in ChatGPT
Get insights from ChatGPT
Open in Claude
Get insights from Claude

A dataset on Mabyduck represents a structured collection of media files — audio, images, or videos — organized for subjective evaluation. Datasets are the foundation of every experiment: before you can run a study, you need to create and upload a dataset.

For a detailed guide on how to structure your media files, see the Datasets documentation.

How it works

The API follows a three-step process to create a dataset:

Create a dataset — send a POST request that returns a signed upload URL.
Upload a file — upload a .zip archive or a .txt file of URLs to the signed URL.
Poll status — poll the status endpoint until the dataset is processed and ready.

Once a dataset reaches the ready state, it can be attached to experiments.

Authentication

All API requests require an API key, which you can generate from your project settings. Include it in the Authorization header of every request:

Authorization: Api-Key YOUR_API_KEY

Creating a dataset

To create a dataset, send a POST request to the create dataset endpoint with a name and a filename. The filename determines how the dataset will be processed:

.zip — a zip archive containing your organized media files.
.txt — a plain text file with one URL per line, for externally hosted media.

import requests

API_KEY = "YOUR_API_KEY"
PROJECT_ID = "YOUR_PROJECT_ID"

headers = {"Authorization": f"Api-Key {API_KEY}"}

response = requests.post(
    f"https://api.mabyduck.com/projects/{PROJECT_ID}/datasets/",
    headers=headers,
    json={"name": "My Dataset", "filename": "dataset.zip"},
)

dataset = response.json()
print(dataset["upload_url"])  # Signed URL for uploading
print(dataset["id"])          # Dataset ID

curl -X POST \
    'https://api.mabyduck.com/projects/{project_id}/datasets/' \
    -H 'Authorization: Api-Key YOUR_API_KEY' \
    -H 'Content-Type: application/json' \
    -d '{"name": "My Dataset", "filename": "dataset.zip"}'

The response includes:

id — the dataset identifier.
upload_url — a time-limited signed URL for uploading your file.
status_url — the URL to check the processing status.

Uploading the file

Upload your file to the upload_url returned in the previous step.

Uploading a zip archive

A zip archive should contain your media files organized into folders. Each folder represents a source, and files within represent different conditions. See the datasets documentation for details on structuring your files.

with open("dataset.zip", "rb") as f:
    response = requests.put(dataset["upload_url"], data=f)

curl --upload-file dataset.zip '{upload_url}'

Uploading externally hosted files

For media hosted on external URLs (e.g., cloud storage), create your dataset with a .txt filename and upload a text file with one URL per line. The last two path segments of each URL are interpreted as <source>/<condition>, following the same folder structure as local datasets:

https://cdn.example.com/source_1/method_a.wav
https://cdn.example.com/source_1/method_b.wav
https://cdn.example.com/source_2/method_a.wav
https://cdn.example.com/source_2/method_b.wav

urls = [
    "https://cdn.example.com/source_1/method_a.wav",
    "https://cdn.example.com/source_1/method_b.wav",
    "https://cdn.example.com/source_2/method_a.wav",
    "https://cdn.example.com/source_2/method_b.wav",
]

# Create a dataset for external URLs
response = requests.post(
    f"https://api.mabyduck.com/projects/{PROJECT_ID}/datasets/",
    headers=headers,
    json={"name": "External Dataset", "filename": "urls.txt"},
)
dataset = response.json()

# Upload the URL list
response = requests.put(dataset["upload_url"], data="\n".join(urls))

Polling the status

After uploading, the dataset is processed server-side (e.g., collecting stimulus durations). Poll the status endpoint until the status changes from processing to ready:

from time import sleep

while True:
    response = requests.get(dataset["status_url"], headers=headers)
    status = response.json()

    if status["status"] != "processing":
        break

    sleep(1.0)

print(status["status"])  # "ready" or "error"

The possible statuses are:

Status	Description
`draft`	Dataset created but file not yet uploaded
`processing`	File uploaded and being processed
`ready`	Processing complete — dataset can be used in experiments
`error`	Processing failed (check the `error` field for details)

Retrieving stimuli and stimulus groups

Once a dataset is ready, you can retrieve its details to inspect the resulting stimuli and stimulus groups:

response = requests.get(
    f"https://api.mabyduck.com/projects/{PROJECT_ID}/datasets/{dataset['id']}/",
    headers=headers,
)
detail = response.json()

for stimulus in detail["stimuli"]:
    print(stimulus["filepath"], stimulus["group_id"])

Each stimulus has:

filepath — the path of the media file within the dataset (e.g., source_1/method_a.wav).
group_id — an identifier grouping stimuli that belong to the same source folder.

Stimuli within the same group correspond to different conditions applied to the same source. In an experiment, stimuli from the same group are evaluated together — for example, as a single MUSHRA slate or a pairwise comparison.

What's next?

Once your dataset is ready, you can create an experiment. See the Experiments section for a step-by-step guide on creating and launching an experiment using the API.

DatasetsCopyCopy for LLMCopy page as Markdown for LLMsView as MarkdownOpen this page as MarkdownOpen in ChatGPTGet insights from ChatGPTOpen in ClaudeGet insights from Claude