Experiments
Copy for LLM
Copy page as Markdown for LLMs
View as Markdown
Open this page as Markdown
Open in ChatGPT
Get insights from ChatGPT
Open in Claude
Get insights from Claude

An experiment on Mabyduck defines a subjective evaluation study — the type of test, the user interface, and the data to be evaluated. Each experiment is associated with one or more datasets and can have multiple jobs that define who the raters are and how much data each rater evaluates.

For a detailed guide on experiment types and configuration, see the Experiments documentation.

Key concepts

Experiment — defines the type of study (e.g., MUSHRA, pairwise comparison) and the datasets to evaluate.
Job — defines the rater pool, session count, and sampling strategy. One experiment can have multiple jobs.
Session — created when a rater participates in a study.
Slate — a group of stimuli evaluated together (e.g., one MUSHRA screen).
Rating — a score assigned to a stimulus within a slate.

Workflow

The typical workflow for running an experiment via the API is:

Create a dataset — upload your media files (see the Datasets section).
Create an experiment — define the experiment type and attach datasets.
Create a job — set up the rater pool, number of sessions, and strategy.
Get costs — retrieve a cost estimate and a confirmation token.
Launch — start the study using the cost token.
Retrieve results — get slates, ratings, and computed metrics.

Authentication

All API requests require an API key, which you can generate from your project settings. Include it in the Authorization header of every request:

import requests

API_KEY = "YOUR_API_KEY"
PROJECT_ID = "YOUR_PROJECT_ID"

headers = {"Authorization": f"Api-Key {API_KEY}"}

Creating an experiment

Prerequisite: You need at least one dataset in the ready state before creating an experiment. See the Datasets section for instructions on creating a dataset via the API.

To create an experiment, send a POST request with the experiment configuration. The easiest way to build the configuration JSON is to first set up an experiment through the Mabyduck UI and copy its configuration. See also the experiments API reference for details on the available fields.

DATASET_ID = "YOUR_DATASET_ID"

response = requests.post(
    f"https://api.mabyduck.com/projects/{PROJECT_ID}/experiments/",
    headers=headers,
    json={
        "name": "MUSHRA pilot study",
        "type": "mushra",
        "language": "en",
        "datasets": [DATASET_ID],
        "training_datasets": [],
        "config": {
            "mushra": {
                "labels": ["Excellent", "Good", "Fair", "Poor", "Bad"],
                "showReference": True,
                "hiddenReference": True,
                "showWaveform": "reference",
            }
        },
        "title": "",
        "question": "",
        "description": "",
        "introduction": "",
    },
)

experiment = response.json()
print(experiment["id"])

The type field determines the experiment type. Available types include mushra, acr_audio, acr_image, acr_video, pairwise_audio, pairwise_image, pairwise_video, binary_audio, binary_image, binary_video, and more.

The config field is specific to each experiment type. It controls the user interface — labels, reference display, interaction settings, and so on. The title, question, and description fields control the instructions displayed at the top of the experiment screen.

Creating a job

A single experiment can have multiple jobs, each targeting a different rater pool. Before creating a job, retrieve the available rater pools:

response = requests.get(
    f"https://api.mabyduck.com/projects/{PROJECT_ID}/experiments/{experiment['id']}/rater_pools/",
    headers=headers,
)
rater_pools = response.json()

The available rater pools can be seen on the rater pools API reference page. Each rater pool has the following structure:

{
    "id": "string",
    "kind": 0,
    "name": "string",
    "label": "string"
}

Then create a job with the desired rater pool and parameters:

response = requests.post(
    f"https://api.mabyduck.com/projects/{PROJECT_ID}/experiments/{experiment['id']}/jobs/",
    headers=headers,
    json={
        "rater_pool_id": rater_pools[0]["id"],
        "num_sessions": 10,
        "num_comparisons": 5,
        "num_training": 2,
        "max_repetitions": 1,
        "min_rest_time": 1,
        "strategy": "randomized",
        "note": "",
    },
)

job = response.json()
print(job["id"])

Key parameters:

num_sessions — how many times the task is performed (by different raters).
num_comparisons — how many slates each rater evaluates per session.
num_training — training examples shown before the actual evaluation.
strategy — how stimuli are sampled: randomized, lexicographic, active, or neighbor. See Strategies for details.

Getting job costs

Before launching, retrieve a cost estimate for the job. This also returns a time-limited token needed to confirm the launch:

response = requests.get(
    f"https://api.mabyduck.com/projects/{PROJECT_ID}/experiments/{experiment['id']}/jobs/{job['id']}/costs/",
    headers=headers,
)
costs = response.json()

print(f"Cost: {costs['cost']} {costs['currency']}")
print(f"Per additional session: {costs['cost_per_additional_session']} {costs['currency']}")
print(f"Token expires in: {costs['token_expires_in']}s")

The response includes:

cost — the total cost for the job.
cost_per_additional_session — cost for each additional session.
currency — the currency code.
token — a time-limited token to confirm the cost when launching.

Launching the job

Launch the job by sending the cost token. This ensures you are charged the quoted price:

response = requests.post(
    f"https://api.mabyduck.com/projects/{PROJECT_ID}/experiments/{experiment['id']}/jobs/{job['id']}/launch/",
    headers=headers,
    json={"token": costs["token"]},
)

print(response.status_code)  # 200 on success

Once launched, the job is live and raters can begin participating.

Retrieving results

After raters complete sessions, you can retrieve the collected data.

Listing slates

Slates represent groups of stimuli that were evaluated together (e.g., one MUSHRA screen). Each slate contains ratings for the stimuli it presented:

response = requests.get(
    f"https://api.mabyduck.com/projects/{PROJECT_ID}/experiments/{experiment['id']}/slates/",
    headers=headers,
)

slates = response.json()
print(f"Total slates: {len(slates)}")

for slate in slates[:3]:
    for rating in slate["ratings"]:
        print(f"  {rating['stimulus']}: {rating['score']}")

Listing results and metrics

The results endpoint returns computed metrics for your experiment, such as mean opinion scores (MOS) or Elo ratings:

response = requests.get(
    f"https://api.mabyduck.com/projects/{PROJECT_ID}/experiments/{experiment['id']}/results/",
    headers=headers,
)

results = response.json()
for result in results:
    print(f"Metric: {result['kind']}")
    for score in result["scores"]:
        print(f"  {score}")

You can filter results by metric type using the metric query parameter (e.g., ?metric=elo or ?metric=mean). You can also retrieve project-wide metrics across multiple experiments.

Datasets documentation — structuring your media files
Creating datasets via the API — uploading media files via the API
Experiment types — choosing and configuring experiment types
Sampling strategies — controlling how stimuli are presented
Elo metrics — understanding Elo-based scoring
Experiment API reference — detailed API endpoints and schemas

ExperimentsCopyCopy for LLMCopy page as Markdown for LLMsView as MarkdownOpen this page as MarkdownOpen in ChatGPTGet insights from ChatGPTOpen in ClaudeGet insights from Claude