After concluding our research and review paper, we decided to proceed with our idea of designing and implementing an agent to play Blackjack. We chose to use the Q-learning algorithm, which is one example of temporal difference learning, itself a subfield of reinforcement learning.

Blackjack is a game with hundreds of possible variations, so we opted for one of the simplest for our project (although our program is designed such that more complex variations of Blackjack should be very easily accommodated). In our version, doubling down, splitting, and insurance are not options; the only legal actions are to hit or stand.

We decided that Q-learning and Blackjack would go well together, since Blackjack scenarios can be easily modeled as a Markov Decision Process, and if you do not take suits and cards into account (which we don’t), but instead only consider total point counts, the state space is very small (only a couple of thousand). Given that, we suspected that a Q-learning algorithm should be able to perform very well.

Basic Q-learning is fundamentally table-driven. In our program, we maintain a hashtable where the key is a Blackjack state represented as a class (for instance, our point count is 15 and the dealer’s showing an Ace), and the value is itself a small hashtable, with each possible action (in this case, just HIT or STAND) corresponding to a float, representing the program’s current belief about the value of that action.

During each discrete step in a hand (i.e., cards being dealt), we query our agent about what action to take. It will almost always return the action corresponding to the maximum value in its lookup table, but rarely take a random action instead—--the reason for this being to balance optimization with exploration of the state space (this type of decision policy is known as ‘greedy’, as opposed to other more complex models like ‘soft-max’). Once we know its action, we take the action, and feed the resulting state to a method which updates the table using SARSA, a very minor modification to the basic Q-learning algorithm,

% Q(s,a) <= Q(s,a) + alpha [r + lambda Q(s', a') - Q(s, a)] %,

where %alpha% represents the learning rate, %lambda% represents the weight to put on ‘future’ values, %r% is the reward (which in our case is %0% unless it is the end of a hand, when it can also be %1% or %-1%), and %s'% represents the new state returned after we took our action. Essentially, the reward value is being factored into our table given the state we found ourselves in and the action we took, and as the agent learns, the reward values will percolate backwards through ‘time’ so that early action values will reflect what tends to occur down the line.

Ultimately, we found this method effective, and starting from choosing to always hit, our agent is able to progress from that (the worst possible strategy, average return of about -0.8) to an average return of -0.027 per hand (given suitable training time), improving extremely rapidly and then tapering off logarithmically. This can be compared to optimal performance as specified by Blackjack literature and hardcoded into our program as an alternate agent, which almost exactly breaks even. We consider reaching within 3% of the optimal strategy a success, and attribute that gap to the inherent randomness in Blackjack, which necessarily throws noise into the feedback which the basic Q-learning method is not adequately equipped to handle.