Analyze Your MaxDiff Experiment
- On the Results page, click the Preference Likelihood dropdown to toggle between Preference likelihood (#/screen), Average-based PL (% baseline), or Utility Scores.
- Hover over the bars to see further statistical analysis.
- Click the hamburger menu to download a PNG, JPEG, PDF, or SVG vector image of the current data visualization.
Preference Likelihood (#/screen)
With Preference Likelihood selected, the baseline is set at an appropriate percentage depending on the number of items per screen programmed in the MaxDiff. This represents the chance an item would be selected from a random set of items, where the set size matches how many items were tested within the MaxDiff tasks respondents completed:
- 33% if 3 items per screen
- 25% if 4 items per screen
- 20% if 5 items per screen
In the example below, respondents were shown 5 alternatives per screen, so the baseline is the black line set at 20%.
Average-based PL (50% Baseline)
For Average-based PL (50% baseline), the baseline is set at 50% — the probability an item would be chosen from among a set of two, no matter how many items per screen respondents interacted with.
Utility Scores
With Utility Scores visualized, values are shown with a zero-centered average to show performance in relation to one another. Since zero represents average performance, the more positive an item's utility the more it is preferred by respondents, and the more negative an item's utility the less it is preferred.
MaxDiff FAQs
What is the best metric/output to use in my analysis?
There is not a single best metric — it is often a matter of personal preference.
Preference Likelihood scores are more easily interpreted than Utility Scores because the values have more meaning. With Preference Likelihood, each percent represents the probability an item would be most preferred out of a given set:
- If you prefer the given set to reflect the task respondents completed, use Preference Likelihood (#/screen).
- If you prefer the given set to reflect a head-to-head comparison of one item versus another, use Average-based PL (50% baseline).
Utility Scores give a quick, high-level view of performance:
- Positive values = above average
- Negative values = below average
- They also provide the overall rank order of items.
For significance testing between options, use Utility Scores with a t-test rather than confidence intervals.
💡 Tip: Confidence intervals show the range where the true population parameter is likely to fall, while shading in the chart indicates pairwise statistical significance — which groups differ significantly from each other. These two measures don't always align because they convey different information. Rule of thumb: if two 95% confidence intervals just touch or slightly overlap, the difference is often still significant at the 5% level.
- Confidence intervals = one-sample population estimates
- Shading = tests whether the difference between two estimates is not zero
The rank order of items varies across different metrics. Which metric should I use to report rank order?
We recommend using Utility Scores for looking at the overall rank order of items. Without diving too far into the math, Utility Scores are preferred as they are the rawest form of the analysis and the data is normalized.
The baseline value of Preference Likelihood (#/screen) does not match the average of all PL values. Why is this, and what does this mean?
It has to do with the mathematical transformation of the raw utility scores that is done to produce preference likelihood based on the number of items per screen. In short, it's easier to move upward than downward in these calculations. When there is a clearer rank order and preference among the items, the average of these values will creep above the baseline value (which is the theoretical preference likelihood, simply thought of as chance). If all items performed equally, the average of Preference Likelihood scores would more closely align with the baseline value.
Some items that perform below the baseline with Average-based PL (50% baseline) perform above the baseline with Preference Likelihood (#/screen). Why is this, and what does this mean?
It is possible and likely for Preference Likelihood values based on the number of items per screen to creep above the theoretical average, allowing each item to move up slightly. This effect isn't observed with Average-based PL (50% baseline), due to the mathematical simplicity of that metric. Since these two behave slightly differently with regard to the baseline, it isn't fair to compare how items perform against the benchmark under different scenarios. For those wanting to understand pure above-average and below-average performers, we recommend looking at Utility Scores (positive values = above average, negative values = below average, values tightly around 0 are about average).