How does a solver work, and what is meant by equilibrium?
Posted by WarLoGhE
Posted by
WarLoGhE
posted in
Mid Stakes
How does a solver work, and what is meant by equilibrium?
Hi!
How does a solver work, and what is meant by equilibrium?
If you have an answer, I would very much appreciate it if you make it as simple as possible... :P
Loading 36 Comments...
I'm quite a noob at theory things, but if I understand it correctly it's the end point of two strategies perfectly exploiting each other, so each party can no longer change their strategy (exploit) to increase ev
to make it more clear, for ex. Rock, paper, scissors equilibrium strategy (no one can gain advantage by changing it's strategy) would be to randomly choose each 33% of the time. If for example player1 would pick scissors 40% of the time, then Player2 could gain advantage by always choosing rock
so when people are throwing around the phrase "the equilibrium play here is to..." or similar they are saying "the perfect/unexploitable play here is to..."?
Well, it depends on its algo, but I think a solver do an huge max exploits iterations as long as exploits are no longer possible
Unbeatable is the right word, at least in theory.
If you theoretically play a perfect GTO strategy, no matter what villain do the EV for him will be the same.
I don't think that's correct, villain can't gain ev, but he can lose ev
It's very far from true. Just Google libratus challenge.
https://en.wikipedia.org/wiki/Libratus
hmm, witch part are part are far from true :D?
Edited in a quote to clarify
Yep, I'm interested also, because the rock-paper-scissor equilibrium is well known and the EV remains the same, no matter what villain does.
In poker, if you are OTR and you have a balanced betting range, no matter what villain chose (fold or call) his option would yield the same EV. So Villain is indifferent between his options
Not true for like 80% to 95% of range. There will be - EV hands, +EV hands and indifferent hands so it's not as easy as you make it sound.
Kalupso , k ty for pointing that out... clearly I haven't understand what GTO means
Video at 16min shows quite clearly that not all hands are same EV:
https://www.runitonce.com/poker-training/videos/hh-review-1-hand-history-review-nlhe-poker-strategy-pio-solver-nuno-alvarez/
Woah, ty for looking for the video. I'm going to watch it asap
Close to. Actually the solver can't use max exploits as it would cause strategies just swinging back and forth between extremes (i.e. 100% vs. 0% bluffs). Instead the solver starts with an arbitrary strategy (does not really matter) and then slightly shifts into the direction of max. exploit (for each combo). That happens for both players. Then the solver starts the next iteration with the new strategies. And again, slightly shifts strategies.
The adjustments get finer and finer, up to the point where the differences between the current strategies and the max-exploit-strategies are "close to zero". How close depends on the "accuracy" you're ready to accept.
As Prank mentioned, that is not true. Simple example - your range consists of 25% nuts and 75% bluffcatchers. I bet pot with a GTO-balanced range between value (2nd nuts) and trash. Your GTO-strategy (the best you can do) is to call all the nuts and one third of your bluffcatchers for a total of 50%. Obviously you could call with all of your bluffcatchers or fold them all, your EV wouldn't change. But if you folded your entire range - your EV would clearly drop.
BigFiszh
It was really helpful, ty
So lots of nice discussions, nicenice :D
but; can someone explain how the solver works?
BigFiszh told you in the post above
As I mentioned. The solver starts with two arbitrary strategies. Then it compares EVs for each combo for the different actions. And then slightly starts shifting the frequencies into the direction of the maxEV.
Let's make a very simple example: two players, one psb left, V1 has 1 combo of nuts and 1 combo of air. V2 has a bluffcatcher (1 combo). Let's say, for sake of simplicity, V1 can only bet all-in or open-fold, after V1 bet, V2 can call or fold.
Now, let's play solver:
We start with 50% shove and 50% fold for each combo! So, V1 shoves the nuts 50% and open-folds 50%. Same for his bluffs, 50% shove, 50% open-fold. When V1 bets, V2 will call his bluffcatcher 50% and fold 50%. Let's call this strategy-pair "I1" (for 1st iteration):
V1-nuts: 50% shove, 50% fold
V1-bluffs: 50% shove, 50% fold
V2-catcher: 50% call, 50% fold
Now, let's compare the EVs of each combo for I1 (p = pot):
V1-EV(nuts-shove) = 0.5p + 0.5(2p) = 1.5p
V1-EV(nuts-fold) = 0
V1-EV(bluff-shove) = 0.5p + 0.5(-p) = 0
V1-EV(bluff-fold) = 0
V2-EV(bluffcatch) = 0.5(-p) + 0.5(2p) = 0.5p
V2-EV(fold) = 0
Now, let's adjust for preparation of 2nd iteration (I2):
V1: EV for calling the nuts is way better than folding, so let's increase it by 50%. EV for bluffing is equal, so it remains unchanged.
V2: EV for bluffcatching is better than folding, so let's increase it by 20%.
So, new strategy-pair for I2 is:
V1-nuts: 75% shove, 25% fold
V1-bluffs: 50% shove, 50% fold
V2-catcher: 60% call, 40% fold
Again, comparing the EVs for I2:
V1-EV(nuts-shove) = 0.4p + 0.6(2p) = 1.6p
V1-EV(nuts-fold) = 0
V1-EV(bluff-shove) = 0.4p + 0.6(-p) = -0.2p
V1-EV(bluff-fold) = 0
V2-EV(bluffcatch) = 75/125(-p) + 50/125(2p) = 25/125p = 0.2p
V2-EV(fold) = 0
For I3, V1 will drastically shift his nuts (say to 100%) now, and slightly reduce the bluffs. V2 will slightly increase his bluffcatchers:
Strategy-pair for I3:
V1-nuts: 100% shove
V1-bluffs: 40% shove, 60% fold
V2-catcher: 65% call, 35% fold
EV comparison for I3:
V1-EV(nuts-shove) = 0.35p + 0.65(2p) = 1.65p
V1-EV(bluff-shove) = 0.35p + 0.65(-p) = -0.3p
V1-EV(bluff-fold) = 0
V2-EV(bluffcatch) = 100/140(-p) + 40/140(2p) = -20/140p
V2-EV(fold) = 0
Now, V1 will slightly reduce his bluffs again and V2 will reduce his calls for I4 ... and in I99 we will eventually get here:
V1-nuts: 100% shove
V1-bluffs: 50% shove, 50% fold
V2-catcher: 50% call, 50% fold
EV for I99:
V1-EV(nuts-shove) = 0.5p + 0.5(2p) = 1.5p
V1-EV(bluff-shove) = 0.5p + 0.5(-p) = 0
V1-EV(bluff-fold) = 0
V2-EV(bluffcatch) = 100/150(-p) + 50/150(2p) = 0
V2-EV(fold) = 0
And here we are. Neither player can unilaterally deviate from his current strategy anymore to increase his EV. Which is the definition of Nash Equilibrium. :-)
BigFiszh
PS: If anybody wants to check / prove it - start Pio, create a tree (model) and then - without starting the solver, go to the browser tab. See, how the solver will initialize any strategy equally with exactly one / X of the possible actions (1/2 for two possible actions, 1/3 for three etc.). That is because the solver does not know ANYTHING about strategies. The solver does not "know" that folding the nuts makes no sense. It just tries and sees what happens. Then it adjusts. Like a child touching the cooking plate. ;)
I like this explanation, it makes you realise how powerful AI is:o it's like a superhuman
Thank you!!! I will have to spend some more time trying to understand it... but, thank you.
I might just try and write a short script (loop) in Python.. :P could be fun
Surely, you are not a BIG fish? i dont get it
WarLoGhE, it seems you are familiar with coding to some degree given your last post. If that is the case, then you can think of how solvers approach the Nash Equilibrium strategy as conceptually similar to how gradient descent works in Machine Learning.
Think of it like this: Imagine that you are blindfolded and placed randomly on a mountain. How do you go about finding the peak? The answer is to take one step in many different directions and with each step in a given direction, check the change in elevation. Once you determine which direction causes the largest increase in elevation, move one step in that direction and repeat the process. You will reach the peak when a step in any direction results in a drop in elevation.
The solver starts with a random strategy (typically an even distribution in frequency of raise / call / fold). It makes small adjustments to the frequencies of every action randomly (takes a step in a bunch of different directions), and then checks the change in EV with every adjustment (checks the change in elevation). The solver will keep adjusting in the direction that causes the highest EV increase and you will hit Nash Equilibrium (the peak of the mountain) when any adjustment to your current strategy results in a drop in EV (elevation).
In reality, we can never obtain true Nash Equilibrium (solvers are just a toy game representation of the entire game tree after all). But the solver will keep making adjustments with ever smaller and smaller increases to EV until you tell it to stop at a point that is acceptably close to equilibrium.
Hope this helps.
Obv. you did not read my post. Your explanation is not correct.
Hi BigFiszh,
Apologies, seems like I did not read your post before commenting. Would you mind telling mewhich part exactly about my comment is incorrect? Thanks.
It's only that sentence that is not correct. The solver does not adjust randomly, but it's dictated by EV (follow my explanation).
Hm, maybe my comment might have explained this in a way that is not entirely clear.
I completely agree with you that the direction in which the solver ultimately adjusts in is the direction of highest EV, but the part of my comment that you are referring to is not incorrect. What I was getting at with that part my explanation was more of getting at the heart of how some of these optimization algorithms work -- how exactly does the solver know which direction to adjust in that result in the highest EV? These optimization algorithms are not bestowed with the knowledge that raising the nuts has a higher EV than folding the nuts. What it must do first is try every adjustment randomly (even increasing folding frequencies of the nuts), then check the EV of every adjustment. Once that is done, the solver goes "Ah! Increasing the raise frequency of the nuts results in the highest EV change!", then it finally adjusts in that direction and repeats the process. It is easy to see how this is analogous to being blind folded on a mountain and trying to find the peak.
My comment about random adjustments was simply elaborating on how the solver finds the gradient of EV across all strategical adjustments.
If you keep reading a few sentences after the sentence that you quoted above, you will find that my comment is in line with your explanation.
I apologize if my original comment was slightly convoluted and required further explanation to be made clear.
In any case, my description is only an oversimplified / conceptually similar process to what a solver like PioSolver actually does algorithmically. If I recall correctly, in the original 2p2 post for the release of Pio, punter11235 stated that older, pre-release versions of Pio were using a type of CFR algorithm but they have since moved on to faster algos.
I still disagree. :) Follow my explanation and you will see that the only "random" decision of the solver happens with initialization. From there on it's a directed adjustment.
It sounds as if we were on the same page nevertheless. You say, the solver got to take random steps to see what works best. But there's no need to do that. The solver knows the EV of each option in the current strategy set anyways. And he just does more often what has a higher EV and less often what works not so good.
(Simplified) example - let's say we are at iteration X:
Raise% = 25%, EV = 50
Call% = 25%, EV = 10
Fold% = 50%, EV = 0
Next iteration will then be:
Raise% = 75%, EV = ??
Call% = 20%, EV = ??
Fold% = 5%, EV = 0
No try-and-error, just magnification of the current state.
Ahh I see what you're saying, BigFiszh. Yes, you are correct. I'm aware that solvers use a guided adjustment method and not a random one (to be honest random is quite inefficient) and the raise/call/fold frequencies in the example that you gave are close to consistent with the regret matching / CFR algorithm that is actually used in solvers and poker AIs. I ultimately wanted to avoid talking about that and give an overly simplified conceptual description but in the end I just ended up spreading a bit of misinformation on the topic.
Thanks for all the responses. I appreciate that someone is keeping my answers in check!
Cheers
What would actually be the outcome if a strong player battled the solver over a large sample? I always assumed anyone would be destroyed for a huge lose rate. Not so sure it’s the case though...
The solver creates the GTO strategy by max exploiting back and forth until there there are no exploits left - equilibrium. This is possible because the solver is clairvoyant to it's opponent (itself). Actually, when the solver creates a solution, it is always clairvoyant to the opposing strategy. If we have a spot and plug it in to a solver, we either assume our opponent plays perfect equilibrium, or we put in a node lock how we think they deviate. Regardless, when we solve it, the solver is clairvoyant both ways.
However, if we play vs a solver output (like you can in various different software today), for example BTN RFI vs a BB defend. The solver is just playing the output it got from reaching equilibrium against itself and is obviously NOT clairvoyant to OUR actual strategy. IF it was somehow clairvoyant to our strategy (all mixing errors etc etc.) it would absolutely destroy us by taking advantage of our deviations from equilibrium. However, it won't be. It will only assume we play perfect and respond with a "static equilibrium strategy".
If we just play vs the static solver output (like you typically would in such software), the solver only gain EV when we make blunders that are 0% frequency . Mixing errors are at most making small EV losses from one option with slightly higher EV in a vacuum being played too infrequent. Perhaps by losing value/bluff potential etc. in certain nodes on certain run outs(?). However the solver wouldn't adjust to those imbalances because it just assume equilibrium ranges in all nodes.
A good reg probably can "survive" by making roughly 2bb/100 worth of mistakes from EV-losses in 6-max. Remember in 6-max we only vpip around 25% of hands, and many of them never reach post flop. So in 100 6-max hands, there's only really 10-15 hands where we risk an EV-loss. If you want, you can try to import a hand sample in a GTO software and you can see the avg EV-loss/hand.
Questions that come to my mind:
1) Would a solver be loosing after rake if it reg battled nl500z vs the top 5 regs in the 500z pool?
1b) Would it even beat lower stakes with higher rake (let’s not consider all non existent GTO spots)?
2) Lets imagine someone proposed a prop bet where we got a 7BB/100 head start over a big sample vs the solver, would it be an easy win? Could we just play a "safe strategy" where we never worry about mixes but just select an option that we are almost sure is at least some percent frequency? What I mean is if we suspect a play is 85% A and mixing a small frequency of option B & C, but we are not sure - we just 100% play option A to avoid potential EV blunder. We now get away with a very very small EVBB/100 loose rate and the head start would make it a profitable side bet?
2 b) The solver gains EV by playing a more advanced strategy? For example multiple sizes in spots most people would play 1 size.
Wow - long wall of text and I asked myself what the point is. Until the questions came up.
Clarification: a solver that comes up with an "equilibrium" strategy will never win. He will play break-even - and if rake comes in it will lose.
Why? Imagine the solver being a boxer. He's fighting (or has planned his strategy to do) an absolutely equally skilled boxer. He knows, once he's leaving his guard to prepare for a punch, the opponent will immediately foresee it and punch himself. What's the equilibrium strategy? No punch whatsoever. Both boxers will dance around each other (for ever) without ever taking their hands off their face.
Now, what will happen if Muhammad Ali would fight against such a "defender"? Nothing. He could punch against the guarding fists of his opponent (imagine that wouldn't have any impact at all), but that's it. No win. For nobody.
Other example is RPS: if one plays the optimal strategy of 1/3rd each, you can play 100% paper, 50/50 paper/stone, dance around the table, sing your name or whatever else comes to your mind, the result will always stay the same. Add a "rake" for each draw any the solver will lose (as well as his opponent).
Summary: a non-adjusting solver is a break-even defender (which - in a zero-sum-game means break-even for anybody, in a rake-environment losing). That as well is the root of the big misunderstanding that most have, assuming GTO as a (purely) defensive strategy. Which couldn't be further from truth, you already nailed it in your initial sentences.
”Other example is RPS: if one plays the optimal strategy of 1/3rd each, you can play 100% paper, 50/50 paper/stone, dance around the table, sing your name or whatever else comes to your mind, the result will always stay the same. Add a "rake" for each draw any the solver will lose (as well as his opponent).”
I’m a little surprised you said this after reading your previous posts (that were very good). The quote is a common miss understanding.
It’s true for RPS but in poker the equilibrium strategy can gain EV by the opponent taking an option that is lower EV then the optimal option(s).
That is correct, but it's a) not true as a general statement and b) it has limited relevance:
1) Scenarios where it is "possible" to lose EV by deviating from optimal strategy are limited to missing aggression.
2) It's possible in the other direction as well - if human player is overaggressive, solver can lose EV.
Examples:
Solver bets the river pot with optimal balance. Human player deviates from MDF and folds 100%. EV gain for solver = ZERO.
Human player bets pot on the river. Deviating from his optimal frequency, he bluffs his entire range (any two). Solver defends MDF. EV gain for solver = ZERO.
Human player has 6 valuecombos and 6 bluff combos. His "EQ-EV" (for betting pot with 6 + 3 combos) is 9/12 pot, solver's EQ-EV is 3/12 pot. Human checks 100%. EV gain for solver = 3/12 pot.
Solver bets the river for psb with 6 valuecombos and 3 bluffs. Human player has 100% airballs (hence his EQ-EV is zero), but elects to 3-bet 100%. Solver defends MDF and calls with 3 valuecombos and folds the rest. EV loss for solver = full pot.
That said, yes, my statement that a solver is absolutely playing break-even (in a non-raked-environment) is not correct in it's absoluteness, but I'd expect it's very close.
A clarification: A human player WILL lose EV vs a equilibrium output. It’s just a matter of how little/how much.
Strong players will not be making “EV loss” decisions that often, whereas weaker regs will make more.
see above
The equilibrium strategy can never lose EV with it’s strategy. A specific combo can “loose EV”, but other combos will then gain EV and make up for it to a minimum of what it lost.
I think you would be right if poker was a game starting at the river with simple ranges and SPR 1. To see that we can obviously loose EV let’s look at extreme’s. It’s usually a good way:
Extreme ex 1:
Solver open BTN 2,5BB, Human calls with 3c3s in BB.
Flop: 3h3d8s
Human check, Solver bets 25%, human fold.
This is clearly an EV loss. Calling and raising has obviously many BBs in EV for the human and we chose an option with 0 EV. The solver gain EV from winning a pot it clearly should lose at a minimum what was in the pot after it’s bet.
Extreme ex 2)
Solver Opens 2,5BB OTB, Human calls BB with 7h2d.
Flop AsKd8s
Human checks, Solver bets 150% pot, human calls.
Also clearly an EV loss for human and an EV gain for the solver. Even if the solver doesn’t take advantage of the humans bad preflop ranges or terrible flop strategy, it’s winning a bunch just by playing normal.
Obviously this won’t happen in reality vs any human that knows the rules of poker. It’s just to illustrate a point. A realistic example could be something like this:
In this action line, if we bet small or checked the Qs9s combo we loose about 0.1BB in EV because it's a pure 2/3 bet
Actually there’s a very good video on this topic here on RIO. I will put a link to it, and I really recommend it regardless…
My main question remains; What is a good estimate of 6-max bb/100 lose rate for a good pro vs solver strategy (non clairvoyant)?
https://www.runitonce.com/poker-training/videos/clairvoyance/
Be the first to add a comment