Election forecasting

Election forecasting is the process of using AI algorithms to forecast election outcomes using data from various sources, such as polling agencies, social media, news outlets, and government sources. Election forecasting is extremely important because of the following reasons:

Election forecasting helps candidates plan their campaigns. Accurate election forecasting can help candidates plan their campaigns by identifying issues that are most important to voters and which demographics are most likely to vote in their favor. This allows them to tailor their messages and campaign strategies accordingly, making their campaigns more effective.
It also provides insights into voter behavior. Election forecasting can provide useful insights into voter preferences and behavior, helping parties and candidates understand why certain voters may be more likely to support them. This information can be used to inform policy decisions which can sway voters for future elections.
Finally, accurate forecasting can help facilitate resource allocation. Forecasting can help political parties and candidates allocate resources such as funding, staff, and time more effectively. By identifying which regions and demographics are most likely to vote in their favor, they can focus their resources on those areas and maximize their impact

There are many different data sources which can be used for forecasting elections (more details in subsequent sections):

Voter demographics: Voter demographics such as age, gender, ethnicity, and income can provide insight into the voting patterns of different groups.
Past election results: You can analyze historical election data to understand how different regions or demographic groups have voted in the past and use this information to make predictions about how they are likely to vote in the future.
Current polling data: Current polling data can provide real-time information about the opinions and preferences of voters leading up to the election. In fact, there are a number of publications which use a subset or all of the above features to make predictions.

Election forecasting is extremely hard because of the degree of uncertainty in the predictions. There have been notable successes and failures. E.g. In 2008, FiveThirtyEight (a popular website which provides data analysis on sports and elections) correctly predicted the outcome of the Obama vs McCain presidential election perfectly. However, they have had some big misses. Notably, the 2016 US Presidential election where they gave Hillary Clinton a 71% chance of winning the elections. However, Trump won the election with 304 electoral votes to Clinton’s 227 votes. This shows that experts in the fields can also get it wrong.

In this paper, we will create a blueprint to build an AI application for election forecasting including the problem definition, datafication, error analysis, payoffs, and evaluation and deployment analysis.

Problem Statement

Predict the outcome of an election (output) based on various inputs such as voter demographics, past election results, and current polling data (input). The prediction would be used by campaign managers to allocate resources and revise campaign messaging strategies.

Dataification

Input

The following are the inputs for election prediction application

Demographic data: This includes factors such as distribution of age, gender, income, education, and race or ethnicity of voters. You can get this data from the US census: https://www.census.gov/data.html
Polling data: This includes data from polls conducted during the election campaign, such as the percentage of voters who support each candidate, and the percentage of undecided voters. There are many websites which conduct election polls. It is important to know the polling methodology and potential drawbacks. You can get this from polls databases like Reuters: https://polling.reuters.com/
Past election results: This includes data from previous elections, such as the percentage of votes each candidate received in previous elections, and changes in voter support over time
Political and economic indicators: This includes factors such as GDP growth, unemployment rates, and approval ratings for political leaders. Source: https://www.govinfo.gov/app/collection/econi
Campaign spending: This includes data on how much money each candidate has spent on their campaign, as well as the amount of money spent by political action committees and other outside groups. The Federal Election Commission publishes campaign financing data: https://www.fec.gov/data/
Social media data: This includes data on how candidates are using social media to engage with voters, as well as sentiment analysis of social media posts related to the election
Geographical data: This includes data on voting patterns and demographics in different regions of the country or constituency

Output

There are a number of possible outputs for election forecasting:

Binary classification of the winner
Probability scores of each candidate
Odds of each candidate
Number of seats won by each candidate. This one can be especially tricky because of electoral college rules which can vary by state. This data is typically made available by different federal and state agencies. Here is one such source: https://www.usa.gov/election-results,

Errors

Figure 1: President Truman holding an incorrect banner of the Chicago Daily Tribune

This is a field which has been rife with famous errors. In the election of 1948, it was predicted that Thomas Dewey would defeat Harry Truman. However, the prediction was made based on faulty polls. E.g. One of the polls relied on telephone directories and car registration lists which overrepresented wealthy voters who tended to lean Republican.

More recently, there were a number of significant polling errors in the Brexit referendum. One key one is social desirability bias where voters were reluctant to reveal their true intentions because of the controversial nature of their opinion. In addition, the polls under-sampled older and less-educated voters who were more likely to vote for Leave.

Since polling results are a massive feature in election forecasting, it is critical to get this right.

Another key challenge is the time when the forecast is made. The earlier the forecast, the more useful it will be for the candidate to make resource decisions. However, earlier forecasts might miss crucial information which impact the elections. E.g. In the 2016 elections, less than two weeks before the 2016 US Presidential election between Trump and Clinton, James Comey announced an investigation by the FBI into the use of a private email server by Hillary Cinton when she served as Secretary of State. This is an example of a factor which significantly impacted the elections which was not accounted for by forecasts.

The way in which the output (forecasts) and the actual Y (results) differ are:

Forecasts are typically made before the election results and may not have all the information
Forecasts may be based on polls which may contain bias. E.g. polls might over represent or under represent certain populations.
Forecasts may be made based on questions such as Do you approve of a certain candidate? which is not the same as Who will win the election?

It is actually very common for different polling agencies to disagree with each other. To eliminate errors in polling, a common strategy used is to use a poll of polls as a feature into the model. For more advanced practitioners, it might be useful to use a weighted average of polls depending on the historical accuracy and correctness of the methodology used by the polls.

Training dataset

We would look at historical election data to create the training dataset.

Historical election outcomes
Polling data for those elections
Candidate approval ratings for prior to the elections
Voter demographic distribution at the time of election
Sub region election results at the time of elections if available (e.g. using state election results as a feature for federal elections)

Deployment dataset

We would collect the same features as the training dataset. Here are some of the ways the training data might differe from deployment,

New polling sources which were not available in prior elections
New technology which might impact how polling occurs. E.g. the advent of the telephone and internet changed how election polling took place.
New data sources which were not available earlier.
New political parties which were not present earlier.
Lack of data on candidates. E.g. new candidate without approval ratings
Time when the forecast is being made. E.g. if the forecast is made right after a convention, then it’s likely that the party hosting the convention might have gotten a bump.

Payoffs

The cost of inaccuracies is extremely high for election forecasts.

Incorrect allocation of resources: It is postulated that Trump won the 2016 elections because of his targeted campaigning in the swing states of Wisconsin, Michigan, Ohio and Pennsylvania. It is possible that Clinton might have performed better had she focused more on these states.
Loss of credibility: There are a number of forecasting companies which lost credibility after wrong predictions.

Evaluation

The easiest way to model this problem would be to use classification with using accuracy as the metric. However, this may not provide the entire picture. E.g. we might be interested in the number of electoral seats won by the candidate. In this case, the problem is a regression problem. In that case, we might use metrics such as mean squared error or mean absolute error. In certain cases, it might be worthwhile to use a combination of regression and classification. The regression can be used to predict the magnitude of the victory while classification can be used to provide probabilities / odds. It is best practice to validate the model with as many sources as possible.

Deployment

Note that this is not an automation problem, rather a prediction problem. Hence, the user gets a prediction which may not be correct. Hence, the user needs to refresh the prediction as often as possible as they get new data. The user should also use best practices such as using error bounds instead of using absolute numbers of decision making. It is also important for the user to supplement the prediction with qualitative analysis (colloquially referred to as the eye test) to ensure better decision making.

Summary

Using AI for election forecasting is a common application of AI but one which is very complicated. The space is rife with a number of errors and it is important to be aware of them while trying to build a model. In this paper, we cover methods to build a dataset, common errors and potential solutions to avoid those errors.

Updated: 24 February, 2023
Created: 24 February, 2023