The Surprising Power of Reinforcement Learning (and Why It’s Everywhere)
John Stuart Mill in the 19th century, B.F. Skinner in the 20th, and Q-learning today all hinge on a simple, almost universal truth: What gets rewarded gets repeated. Whether it’s shaping behavior in people or machines, the logic remains the same—define the reward, and you define the outcome. This simple statement lies at the root of Reinforcement Learning (RL). I’ve written on Supervised Learning and Unsupervised Learning, and this post continues the learning sequence. I’m going to explain what it is, of course, as well as its key concepts, how it works, real-world application, and more. As we explore this fascinating technology, you may find that you recognize examples of it around you.
What is Reinforcement Learning
You’re probably familiar with the practice of using a treat to encourage a dog (or a person) to do something. One of my favorite examples comes from my favorite sitcom, The Big Bang Theory. Sheldon is trying to change some of Penny’s behaviors by tossing her a chocolate when she does what he prefers — but not telling her why he’s rewarding her. He’s trying to condition her without her realizing it. On some level, she perceives the desired action, performs it, and receives the desired reward. As a theoretical concept, if Sheldon decided to reward her with two chocolates for a different behavior, Reinforcement Learning would indicate that Penny would prefer that action instead.
An RL agent learns how to act in an environment to maximize cumulative reward. We see RL agent behavior in each major character in the show. However, where the characters in The Big Bang Theory are all just trying to survive their relationships with each other, a machine has to be programmed to seek that maximization.
To put it in as basic terms as Reinforcement Learning will provide, a machine is programmed to accumulate. It does so through the trial and error of the actions available, receiving or losing whatever the environment’s currency is. The RL agent learns to repeat actions that increment whatever it’s trying to accumulate, and to avoid actions that decrement.
Key Components of Reinforcement Learning
- Agent: the learner or decision-maker
- Environment: where the agent operates
- Actions: What the agent can do
- States: Situations in which the agent finds itself
- Rewards: feedback from the environment
How it Works: The Reinforcement Learning Process
The agent observes the environment and chooses an action. At first, the agent doesn’t know much about his situation, like being dropped onto an island. It assesses the state it’s in – a location, a screen, a configuration – as well as the possible actions and obstacles. At this point, it doesn’t know what’s the best thing to do; it’s figuring out what’s available to do. It has only these observable conditions to use to plan actions.
Because it’s a new environment, the RL agent doesn’t have a basis to determine “good” actions, “bad” actions, or “better” actions. It simply chooses an action and acts, and, as a result, it receives a reward of some type. Along with the reward, the agent receives a new state, because every action will introduce some change. Based on the results of the action and its new state, the agent updates its strategy, which is called a policy.
The agent now enters an OODA loop – Observe, Orient, Decide, Act. Only for the agent, we replace the Orient segment with a Recall segment. The agent takes in the environment and its current state. It recalls what it did in the past and the results of it. Over time, the agent has a bank of actions that worked in one state but not in another state. It becomes able to compare its current state and available actions with past states and actions. It selects the action that in those states has resulted in the highest rewards, and it performs that action.
Exploration vs. Exploitation
To explain this concept clearly, we’ll need to use the dictionary definition of “exploit”. It means to make full use of and derive benefit from something. That’s very different from the more negative social meaning it often carries today. In reinforcement learning, “exploitation” simply means the agent is using what it has already learned to make the best possible choice. It’s not unethical or harmful—it’s just efficient.
Exploration for an RL agent is the agent taking an action for which it hasn’t a historical justification. It doesn’t know what the outcome will be. Well, it doesn’t always know exactly what the outcome will be. The agent can recognize patterns and generalize from what it has experienced to estimate what might happen next. Although it doesn’t “infer” in a philosophical sense, it can make probabilistic inferences.
How does it work?
For example, the last time it walked into the pond in the game, some of its parts got damaged. Now it encounters another lake. Even though it hasn’t seen this lake before, the RL agent can conclude that damaged parts are a possibility again, and it might choose to avoid it. On the other hand, if the last time it entered a lake it found a useful tool and took no damage, it might decide the lake is worth investigating again. The decision depends entirely on the rewards and penalties it has experienced in similar situations.
Given that Reinforcement Learning is all about getting rewards, then, how does the agent decide to enter the lake in the first place? We can see a similarity between the agent and humans who default to their comfort zones, can’t we? We humans often need some sort of incentive to entice us into unfamiliar territory. But then, we are a different sort of agent, with hopes and dreams, wishes and desires, which provide us with the motivation to get off the couch, so to speak.
What we do for the agent is provide it with a policy or a decision strategy that includes a way to explore new actions (even if they’re risky or unknown) and a way to exploit known good actions. We may offer more incentives for exploration at different phases of the experience, shifting for more incentives for exploitation at a different phase. Perhaps we include “curiosity bonuses” for trying new things. We can state that under certain probability outcomes, explore, otherwise exploit.
Without exploration, the “learning” aspect of “Reinforcement Learning” would be pretty thin. Each decision will provide more data for future decisions, but without the possibility for exploration, the event would be either extremely short or endless and boring.
Real-World Application
We don’t hear Netfix telling us that it uses Reinforcement Learning to help us enjoy our experience on the platform, but it does. We all know the algorithm is paying attention—but what most people don’t realize is that every click, pause, rewatch, or thumbs-up is a tiny lesson. You’re not just watching; you’re training the system to get better at pleasing you. The same is true for Spotify, TikTok, YouTube, and Amazon Prime Video – and any other platform where you may see something along the lines of “because you liked…”
There are examples in self-driving cars as well, and, fortunately, much of their “exploration” occurs in simulated environments rather than out on the road with the rest of us. Taking those lessons learned – including crashes that were only virtual – can help a self-driving vehicle avoid making hazardous choices. Knowing how to merge, change lanes, and maintain appropriate speed for the conditions are all learned behavior, and some of the training was done with RL.
Benefits and Challenges
RL agents provide a more autonomous learning experience than Supervised Learning, and it offers adaptability and performance in dynamic environments. That’s sort of the point of it – the ability to learn and keep learning when conditions aren’t static. RL presents a future of possibilities where human intervention isn’t a necessary component for improvement.
As always, though, there are hurdles and limitations to what we should expect from RL agents. Reinforcement Learning can require long training times, because the agent is basing improvement on experience rather than feedback or a set of provided answers. Also, the agents rely on someone having defined the reward functions ahead of time, or at least before the agent gets to them. We’d like to think that we would know what an RL agent would prefer, but, truthfully, we may not. Additionally, for every situation where life and limb may be affected, erring on the side of caution is the obvious choice.
The Human Element in Reinforcement Learning
Machines may lean toward exploitation once they’ve found a strategy that works – but so do we. We play a card game called Rukus that requires pairs to lay cards down and players can steal cards from other player’s played cards; the one who lays down all the cards in his hands wins the round. One young man figured out a couple of hands in that he should hold all his cards until he had a full hand to lay down. He won that hand soundly. I warned him, “ Yeah, that works until it doesn’t.” The next hand, it didn’t, and he abandoned the strategy. The results were the trainer, and he learned.
But sometimes we simply opt for the familiar – the job we have, a safe conversation, the restaurant we always go to – because comfort feels better than the unknown. We get rewarded for predictability, and over time, we stop exploring.
Maybe it’s time to ask yourself:
- Have I been reinforcing old choices just because they’re comfortable?
- Is there unexplored potential I’ve been avoiding?
- What would “exploration” look like in this part of my life?
Machines learn best when they explore – even at the risk of failure. Maybe we do, too.
Your Turn
Now that you’ve seen some examples of Reinforcement Learning, where else do you notice it? I’d love to hear your stories, drop a comment below the “Related Posts” segment.
Here are a couple of places to learn more about Reinforcement Learning. It’s also becoming popular in Massive Open Online Courses (MOOC) from sites like Udemy and Coursera.
Reinforcement Learning | GeeksforGeeks
Reinforcement learning – Wikipedia
My photography shops are https://www.oakwoodfineartphotography.com/ and https://oakwoodfineart.etsy.com, my merch shops are https://www.zazzle.com/store/south_fried_shop and https://society6.com/southernfriedyanqui.
Check out my New and Featured page – the latest photos and merch I’ve added to my shops! https://oakwoodexperience.com/new-and-featured/
Curious about safeguarding your digital life without getting lost in the technical weeds? Check out ‘Your Data, Your Devices, and You’—a straightforward guide to understanding and protecting your online presence. Perfect for those who love tech but not the jargon. Available now on Amazon:
https://www.amazon.com/Your-Data-Devices-Easy-Follow-ebook/dp/B0D5287NR3