11 Comments

Interesting piece, thanks. Also reminds me of some of the Taleb/Silver arguments from a few years back (https://nautil.us/nassim-talebs-case-against-nate-silver-is-bad-math-237369/). On the calibration point, also feels like having a prediction interval would shed some light on the usefulness of models, e.g. if my current 95% interval for probability of a win is X-Y, then this uncertainty should ideally narrow over time, with my election day probability still within this range (otherwise I’ve potentially been too overconfident/reactive to news along the way).

Expand full comment
author
Sep 13·edited Sep 13Author

Yes, I was reminded of that debate too - although my recollection was mostly that it appeared obvious that Taleb was wrong: confusing a snapshot read: "what if the election was today?" with a projection "what if the election was on election day?" So I couldn't work out why he was picking the fight.

On the calibration - I think everyone has confidence intervals on their electoral college number projections (e.g., 95% that Harris will land in the range 170-307). And similarly for vote share. But as for "95% intervals for probabilities of a win" ... this sounds like appealing to distributions of probabilities, a concept I've always found troubling (though I've seen it as a motivation for the existence of Beta distributions). I rather feel as though probabilities of probabilities doesn't make sense, and should be replaceable by a point probability without loss of information, though I am not sure how to make this precise. But it also feels as this should be a solved question - so if you (or anyone else) knows of a good discussion, I'd really like to read it.

Expand full comment

Yes, framing around distribution of probabilities isn’t very concrete (especially as a win is a binary outcome, whereas electoral college votes are a hard number that can be compared to predictions on the day). If I was approaching the problem fresh (on a Friday eve, having not thought about it nearly as deeply as you and others) it seems like calculating the amount of density of the electoral votes above 270 would provide a sensible (and testable) point probability estimate? And with a preference for a prediction interval on votes that narrows over time, we’d expect this point probability to be fairly non-jumpy over time?

Expand full comment
author
Sep 13·edited Sep 13Author

Exactly this. Simply summing up the distribution to 270 is precisely how I extracted win probabilities for the Princeton Election Consortium model, https://paulmainwood.substack.com/p/us-election-2024-poll-aggregator. And then plotted them over time (you can see them in the plots above), and concluded from their massive jumpiness that they hadn't even come close to fixing their uncorrelated errors problem. And while their CIs do narrow over time, they do so to such a ridiculous, overconfident degree, that they're pretty much useless.

Expand full comment

You are badly misunderstanding how calibration scores are computed. It simply isn't true that "you have only one point in time to check your results". Conversely, there is nothing special about the last forecast before the event, and ignoring the forecast history over time leads to absurdities like the one you highlight here.

The usual practice is to average the model score (commonly the Brier score, though there are others that work) over all of each model's predictions over time. This is, for example, what Tetlock does in his forecasting experiments. If you do this, and if the event happened in the end, then Model 3 will score strictly worse than Model 1, even though they have the same final forecast. For Model 2 it's hard to tell just by eyeballing the graph, but it will probably score better than Model 1. This might seem counterintuitive, but it is appropriate because Model 2 spent much of the time saying that the event was very likely to happen. In the opposite scenario, where the event didn't happen, the order will be reversed, with Model 3 scoring the best, which again seems appropriate.

The thing one has to keep in mind here is that there is no way to tell which model was "right" from a single forecast. You seem to be arguing that models 2 and 3 are clearly inferior to Model 1, but there is no way to know that from the information given. You certainly can't assume that just because all of the models ended up at 53% that this was the "right" answer all along. You have to look at forecasts for a lot of different events. If Model 2 is always swingy, and if the swings are unjustified, then over a lot of events its score will suffer. If, on the other hand, it is responding to real information, then it should outperform the other models in the long run.

Expand full comment
author

Model 1 predicts a candidate gets 310 EV during August to end of September (~9 weeks), then adjusts to new polling and switches to 260 EV for the 5 weeks to election day.

Model 2 predicts 260 EV during all of Aug-Sep, then adjusts to the same new polling to switch to a prediction of 310 EV which it retains for 5 weeks to election day.

Actual result is 315 EV.

The approach you suggest would score Model 1 as far superior to Model 2.

Expand full comment
2 hrs ago·edited 1 hr ago

Are we looking at the same graph? The one I was looking at had probabilities, not EV counts.

NB: A model that predicts a candidate will get 310 EV is predicting a 100% chance of victory, as there is no way to lose if you get 310 EV. Conversely, a model that predicts a candidate will get 260 EV is predicting a 0% chance of victory for the same reason. A model that waffles between 0% and 100% will not score well under a quadratic scoring rule. Certainly not as well as a model that predicts something in between all the time.

Expand full comment

I would have thought a reasonable polling average methodology should also be able to predict the outcome of polls, at least over short time horizons (e.g. those currently in the field but not released yet).

Expand full comment
author

I think this is the best way to think about the dynamic linear model-based ones (The Economist, FiveThirtyEight's new model, Data Diary). You are essentially selecting your state over time to be the one that best predicts the data coming in the next day.

And then the only thing special about 5 November is that it is an especially important poll.

Expand full comment

I think 538 previously (ie Silver Bulletin now) viewed this as something their model should do too

Expand full comment
author
Sep 15·edited Sep 15Author

Interestingly, this is exactly what his adjustments like the "convention bounce" violate. That is, it's an additional adjustment to take account of things like goodwill towards one candidate after their convention, which are assumed a) to exist, and b) to be temporary - so the model "aims off" for several weeks, in order (in his view) to get a better lead on the final 5 November result.

Of course this means it won't get the polls right in the interim, unless that adjustment is turned off again.

Expand full comment