Briscola

From PRLT
Jump to: navigation, search

Contents

Introduction

This environment represent a Briscola card game.

Briscola is a popular Italian card game, played by two players (in the base version we used). The game is played with a 40-cards italian deck, three cards are given to both players, and one is revealed and put at the bottom of the deck, visible. Its suit is called Briscola. During the game each player plays a card, and the two cards are assigned to the player who played first unless the second played an higher card (of the same suit) or played a Briscola, in this case the catch is to the second player, and he will play first in the following hand. The player who scored more points at the end of the match wins. See the wiki entry for the Briscola game for all of the details.

The Briscola environment

The environment keeps track of the status of the game, the hands, the deck, and the score. To the agents, the status is described in a sinthetic way, since the state space for an agent is too big (about 10^55 different states). We designed a compact state-space representation of about 57'000 states.

The perceived variables are:

  • one variable for each of the three cards the agent owns (8 values)
* no card (0)
* zero-point card (1)
* low-point card (2)
* high-point card (3)
* zero-point Briscola (4)
* low-point Briscola (5)
* three of Briscola (6)
* the Top Briscola (7)
  • one variable to describe the card played by the opponent (7 values)
* no card (0)
* zero-point card (1)
* low-point card (2)
* high-point card (3)
* zero-point Briscola (4)
* low-point Briscola (5)
* high-point Briscola (6)
  • one boolean for each card in hand, to describe if they will earn the hand, if played
  • one boolean to determine if we are in the seventeenth hand (the hand which decides who draws the Briscola)

The possible actions are simply which card to play (first, second, or third).

The cards in the hand of the agents are sorted by the Environment, so doing we avoid different permutations of the cards to be mapped into different states. Therefore, many of the 57'000 states are impossible to reach, thus reducing the learning time.

Fixed policy agent

We implemented a fixed-policy agent using our expertise of the game mechanics; he can win, on average, 89.2 games out of 100 against a random-playing agent. The agent's policy is roughly the following:

When playing first, the agent plays the less valuable card he owns.

When playing second, the agent tries to take the hand without playing Briscola. If he can't, he tries to take using Briscola if he's scoring many points, otherwise he tries to let the hand to the opponent without giving away (many) points. At worst, the agent tries to lose as few points as he can.

The agent include an exploration parameter, which can be set to any value between 0 (truly fixed policy agent) and 1 (random agent). This was done because a fixed policy could prevent some states from ever appearing, and this would compromise the learning of his opponent.

Q-learning experiment results

Experiment graph over 3 millions trials. All simulations are run against the fixed-policy agent.

Relaz 3kk.gif

The red line represent the average score of a fixed-policy agent. (60 points means an even match)

The violet line represent the score of a random agent. The first part is from a real simulation (that's the reason of the noise) and the second is simply the average (computed over a large trial base).

The orange line is the score of a Q-learning agent.

After roughly 220'000 trials the learning agent becomes as good as the fixed-policy one; given enough trials the learning agent manages to get better than the fixed policy one. His final average score is 62.5, and his winning rate is 55%.


Parameters of the simulation
parameter value
learning rate 0.03
learning decay rate 0.001
gamma 0.95
exploration 0.4
exploration decay rate 0.001



In the following graph, we plot the average score of the learning agent against the fixed-policy one with different explorations values. Other parameters are fixed, as per the following table.

Relaz explor.gif

Parameters of the simulation
parameter value
learning rate 0.2
learning decay rate 0.01
gamma 0.95
exploration see the legend in the graph
exploration decay rate 0.001

As we can see, changing exploration doesn't affect the overall learning curve. This is mainly because the environment is very stochastic, so every state is visited, sooner or later, despite exploring chance. Also, having set the Q-table with an high initialization, the agent explore the different actions by himself even without exploring.



In the next graph, we plot for different values of gamma. See table below for the other parameters.

Relaz gamma.gif

Parameters of the simulation
parameter value
learning rate 0.2
learning decay rate 0.01
gamma as per legend
exploration 0.4
exploration decay rate 0.001

Evidence from the simulation is that higher values of gamma gives better learning. Zero-gamma experiment shows that a greedy policy learns quick but doesn't converge to a very good value, as can be expected.



Lastly, we plot for different values of the learning rate (alfa).

Relaz alpha.gif

Parameters of the simulation
parameter value
learning rate as per legend
learning decay rate 0.01
gamma 0.95
exploration 0.4
exploration decay rate 0.001

this graph shows that lower alfa gives slower but better learnings.