Jeopardy! What, like it’s hard?: Using BERT Classification to Estimate Jeopardy Question Valuation

Awisk
7 min readApr 29, 2021

Introduction

Netflix had recently graced us with old episodes of Jeopardy! when Alex Trebek was still hosting. Being over a year into quarantine and having exhausted much of my television backlog, I decided to watch. Watching these episodes brought back a certain nostalgia and an urge to show that I could compete with the contestants (though I obviously could not). As I consecutively got almost every answer wrong, I wondered… Are these questions too niche for your average contestant? How many people know the name of Franklin D. Roosevelt’s beloved Scottish Terrier (it’s Fala btw)?

From this curiosity I decided to see if a NLP classifier could predict which value/difficulty a question belonged to. In a CheatSheet article from 2020, the author mentions that currently the only check for whether a question is too difficult or not was the writer quizzing the fellow question writers. While human confirmation is a valuable evaluation method, having an objective model assess the difficulty of a question can bring in a second perspective and encourage consistency.

RIP Alex Trebek

Data

Data Description

The data was obtained from data.world and was originally scrapped from j-archive.com which has archived Jeopardy! games from 1984 to 2020. The data used for this project had dates 1985 to 2008.

Data can be found here.

Because this was a classification problem, values had to be assigned to specific labels. Although typical values are $200, $400, $600, $800, and $1,000 in the first round of Jeopardy!, values double to $400, $800, $1,200, $1,600, and $2,000 in Double Jeopardy! Additionally, there are Daily Doubles hidden within the board which allows a player to pick whichever value they want to wager.

Data Pre-Processing

I removed all values that didn’t conform to the standard structure as their difficulty could not be assessed by their values. Initially, the labels were assigned based on question value, which led to $400 and $800 having disproportionately more questions. After running the initial classifier, I decided to see if the value signifies difficulty only within the round, meaning that $200 is level 1 difficulty in round 1, while $400 was level 1 difficulty in round 2. This led to a much cleaner result, and these labels were used throughout the rest of the analysis.

Additionally, because many questions on Jeopardy! cannot be answered without knowing the category, I appended the category as well as the answer before the question itself for model training and prediction. Once all 3 parts were appended together, sequence limits were hit on the BERT model, therefore I created a new column which limits the word length.

The count of questions per class was highly uneven, leading to a disproportionate amount of predictions to the most popular class. In order to fix this, I sampled the other classes, so that each class had an equal value of questions.

Distribution of Questions by Value Before Processing

Methods

Logistic Regression

Once the data was formatted, I decided to run a logistic regression as a baseline model. I used L2 penalty to try and minimize the influence of unnecessary tokenization, and used TD-IDF feature representation. These vectorized features evaluate how relevant a word is to the document it’s being trained on by multiplying the count of times a word appears in the document times the inverse document frequency across the document set.

If you know how to do a Logistic Regression, this process takes only one extra step. There are other ways you can tokenize a document, which are shown in the Github repository, though TD-IDF was a satisfactory, general option.

Pre-Trained BERT Models

While the Logistic Regression code ran above is limited to learning the data I gave it, there exists a collection of pre-trained models which are already trained on a large text corpus. We can leverage what these models have learned and fine-tune them on the data we want to use for our specific task.

BERT stands for Bidirectional Encoder Representation from Transformers and is trained for general contextual language understanding. BERT is a MLM, masked language model, meaning that they mask (replace with a <MASK> token) 15% of the words in the input and run the sequence through a bidirectional transformer encoder to predict what the word should be. Each model was ran on 2 epochs, with default parameters for things like batch size, learning rate, etc.

While these models are powerful, they do require a lot of computing power, so I would recommend using a GPU if possible. If this is not an option, you will have to change arguments to use_cuda=False when instantiating your ClassificationModel.

  1. Using DistilBERT: Is a distilled version of BERT, able to work smaller and faster on a more standard computing set up. It is still trained on the same corpus as BERT.
  2. Using SciBERT: Is uses the same process as BERT, though it is trained on a corpus of papers from semanticscholar.org. I used SciBERT as well to see if using a corpus involving academic publishing might give better results for typically academic question found on Jeopardy!.

Results and Discussions

Because the predicted labels are ordinal (they are increasing and not independent), I am visualizing results in a confusion matrix, so we can see if observations are misclassified, are they misclassified in neighboring classes? The y-axis shows the actual label, while the x-axis is the predicted label.

Logistic Regression

While the Logistic Regression model was able to predict extreme level the best (level 0 and level 4 both having the highest percentages of correct predictions), it struggled to correctly predict the levels for 1, 2, and 3. The overall accuracy was 21.7%, just slightly better than randomly guessing, which would hypothetically give us a test accuracy of 20%.

Logistic Regression Results

DistilBert

While the DistilBERT model did a better job at at predicting the easiest and hardest questions, and even did a better job at predicting questions at level 3, it struggled at correctly identifying medium difficulty questions. Instead of correctly identifying level 2, the majority of predictions were for level 0 and 3, with the same going for those in level 1. Overall, test accuracy did go up to 25.4%.

DistilBERT Results

SciBert

SciBERT completely failed at learning the difficulty level of Jeopardy! questions. While there are scientific questions consistently asked on Jeopardy!, there are also consistently questions about history, pop culture, geography, and more. Having a more narrow corpus scope did not appear to capture the breadth of knowledge necessary for understanding Jeopardy!.

SciBERT Results

While it looks like DistilBERT performed the best, we saw a similar trend to the Logistic Regression; both of them struggled to correctly learn and identify the medium difficulty questions. The difficulty of a question is also relative to the audience to which it is presented. This allows a multitude of nuances are present in a classification problem like this.

While this exposed a limitation of the models, it also likely shows the limitations of the underlying data. As mentioned before, although difficulty is intended to increase with question value/level, the only way in which this is enforced is through subjectively quizzing of fellow question writers. It seems that though in many cases, the DistilBERT model could tell the difference between an easy and hard question, it could not identify a medium difficulty question.

What’s Next

Other fun ideas to improve up on this work would include running a neural network from scratch in order to customize the input to the model. This would have allowed for more potential features while having control over the model’s structure. It also would have been a good comparison to the pre-trained BERT models. I would have liked to do more work on additional features in the model, if time would have permitted. Features that could have been used include, but are not limited to:

  1. Having a separate embedding input for the answer so that the answer can be analyzed in isolation.
  2. Doing unsupervised NLP work on the categories to create features representing each category type. This could have created learnings within the category to better understand difficulty within a subject.
  3. Adding a feature to represent the year the show aired in case there was a drift in subject matter or valuation standards.

Code is found at: https://github.com/awisk/JeopardyNLP

Now go out and WORK! *snap*

References:

  1. https://www.cheatsheet.com/entertainment/jeopardy-writers-trick-for-making-daily-doubles-easier-or-harder.html/
  2. https://github.com/allenai/scibert
  3. https://github.com/google-research/bert
  4. https://kavita-ganesan.com/news-classifier-with-logistic-regression-in-python/#.YIltBH1KjUJ
  5. https://data.world/sya/200000-jeopardy-questions/workspace/file?filename=JEOPARDY_CSV.csv
  6. https://simpletransformers.ai/docs/multi-class-classification/
  7. https://simpletransformers.ai/docs/usage/

--

--