Blog · July 16, 2025
Does Star Gender Affect How Audiences Rate Films?
A statistical investigation into gender bias in movie ratings using MovieLens 32M, IMDb data, a custom star-identification heuristic, and Cohen's d effect size measurement.
Is the screen neutral? The question sounds simple. The answer, when you put 32 million ratings under a statistical microscope, turns out to be more nuanced than intuition suggests — and more interesting.
This post summarizes the methodology and findings of a study I carried out jointly with Barbara, presented at Sapienza on 16 July 2025. We set out to test whether the gender of movie voters systematically influences their ratings depending on the gender of the film’s lead star.
The data
We combined three datasets through a multi-stage pipeline:
- MovieLens 1M (2003) — voter ratings with demographic metadata: gender, age, occupation, zip code
- MovieLens 32M (2024) — 32 million ratings linked to IMDb identifiers
- IMDb Non-Commercial Datasets — cast lists and credit ordering
The result was four clean tables: Actors, Movies, Cast, and Ratings with voter demographics attached. Joining them required careful key matching across different identifier schemes and a thorough cleaning pass.
The structural asymmetry in credits
Before measuring any bias in ratings, we looked at how actors and actresses are distributed across credit positions. The picture is clear: female actors appear at lower average credit positions (higher ordering numbers) than male actors. The KDE and CDF of average ordering by gender both show this shift, and it’s not a marginal effect.
This matters because credit position is one of the inputs to the star identification heuristic — and because it’s a baseline finding in its own right about the industry, before any rating data enters the picture.
Finding the stars
The central independent variable is the gender of the film’s lead star. But what counts as a star?
Rather than using a subjective list, we applied a data-driven scoring function:
The first term is the actor’s average credit position across all their films — lower means consistently top-billed. The second term is a logarithmic penalty for appearing in few films: someone who topped the bill in one film doesn’t count the same as someone who did it across thirty. An actor qualifies as a star if their score falls below the 25th percentile threshold, yielding 3209 male and 1999 female stars.
The heuristic passes a basic sanity check: among identified male stars are Humphrey Bogart, Elvis Presley, and Michael Douglas; among female stars, Judy Garland, Shirley Temple, and Barbra Streisand.
What we found
Films were partitioned into four conditions by star gender (at least one male star, at least one female star, only male stars, only female stars), and within each condition we compared male and female voter rating distributions using Welch t-tests and Cohen’s d.
All four comparisons are statistically significant — but with 32 million ratings, significance is nearly guaranteed. What matters is effect size. Every Cohen’s d falls in the negligible range (below 0.2 by convention), with the largest value at 0.125 for the “only female stars” condition.
The directional pattern is consistent and visible in the trend chart: across every threshold of stars considered (top 3 through top 21), female voters rate films with female stars higher than male voters do, and male voters rate films with male stars higher than female voters do. The lines are stable. The effect is real but small.
Negligible doesn't mean absent
A Cohen’s d of 0.125 is small by Cohen’s own conventions, but it’s consistent across all star-count thresholds and all four experimental conditions. A pattern that shows up everywhere and doesn’t disappear as you vary the parameters is worth taking seriously, even if it won’t rewrite the industry.
Genre, occupation, and geography
Before reaching the bias analysis, we built three-way breakdowns of average ratings by genre, voter occupation, and voter gender. Some findings from that layer:
- Documentary and Film-Noir attract the highest average ratings; Comedy and Horror the lowest — consistent with selection effects (people who seek out documentaries rate them highly).
- Some occupation-genre cells show divergence of 0.5 or more between male and female voters; others are nearly identical. The pattern is heterogeneous rather than uniform.
- Geographically, California accounts for 18.5% of all ratings. Average ratings vary by state in the 3.39–3.81 range — regional taste differences that are real but modest.
The network layer
We built two networks from 500-actor and 500-voter random samples respectively.
The actor network’s centrality rankings produced the most interpretively interesting result. Betweenness centrality — which identifies actors who bridge different communities — is topped by Jack Nicholson, Robert De Niro, and Sean Connery, all confirmed stars by our heuristic. Eigenvector centrality — which identifies actors densely connected to other well-connected actors — is dominated by Frank Oz, Jerry Nelson, and Dave Goelz: the Muppet performers, who co-appeared across a dense cluster of films with each other. They score almost nothing on betweenness.
This divergence is the point. Being a “central” actor in a network means something completely different depending on whether you’re measuring bridge position or cluster embeddedness. Stars in the conventional sense — actors who connect diverse corners of the industry — show up in betweenness. Actors who worked extensively within a single production ecosystem show up in eigenvector. Neither metric is wrong; they answer different questions.
The voter network similarly revealed distinct taste-based clusters, confirming that the audience can be meaningfully segmented by rating behavior.
A methodological note
This is a correlational study. Finding that female voters rate films with only female stars higher than male voters do doesn’t identify the mechanism. Plausible explanations include genuine preference differences, selection effects in who watches which films, and genre confounding. The Cohen’s d values give effect sizes; they don’t resolve causality.
The honest framing: we found a systematic directional pattern that is consistent and robust to parameter variation. Understanding why it exists requires a different kind of study design.
The full analysis was presented on 16 July 2025 at Sapienza.