Digging into RL 🤖

This week we are digging into optimization through renforcement learning

🧙‍♂️ Hello Fellow Magicians!

Last week, we took a high-level tour through some key frameworks and tools for tackling optimization problems. We briefly discussed:

  • Reinforcement Learning 🤖

  • Mathematical Optimization 🧮

  • Monte-Carlo Simulation 🎲

This week, we’re going to dig deeper into Reinforcement Learning!

But first - I’m excited to welcome our first sponsor of the newsletter - 1440! Here at NMJM we approach AI, ML, and Data Science with a clear lens, bringing no fluff just our honest experiences. The 1440 newsletter approaches news similarly - please check it out by following the “join for free today” link! It helps keep this newsletter free too!

Seeking impartial news? Meet 1440.

Every day, 3.5 million readers turn to 1440 for their factual news. We sift through 100+ sources to bring you a complete summary of politics, global events, business, and culture, all in a brief 5-minute email. Enjoy an impartial news experience.

🤖💡 Reinforcement Learning (RL)

Reinforcement Learning (RL) is based on the idea that through self-guided exploration, a system can learn a near-optimal solution for a problem described as a Markov Decision Process (MDP). An MDP is characterized by four key values:

  • State (S) 🗺️

  • Action (A) 🎮

  • Probability (P) 🎲

  • Reward (R) 🏆

A key constraint for a valid MDP is that the state must represent only the current state—no access to historical states. (Though you can be a bit sneaky 🕵️‍♂️ and store a replay buffer as part of the state… but we’ll save that trick for later!)

🐍🎮 Example MDP: The Classic Game of Snake

Let’s illustrate how to formulate an MDP using Snake, the beloved arcade game. 🕹️

Simple Rules of Snake:

  • The game board is a 100x100 grid. 🎯

  • Every time the snake eats an apple 🍎, it grows by 1 unit.

  • The game ends when the snake either goes off the board or collides with itself. 🚧

  • The player scores 1 point for each unit in the snake when the game ends. 🏁

Formulating Snake as an MDP:

  • State (S): A 100x100 matrix where each cell is either:

    • 0 for empty ➖

    • 1 for a snake piece 🟩

    • 2 for the snake head 🐍

    • 3 for an apple 🍎

    • Include an additional value, 0 or 1, to indicate if the state is terminal (game ends). 🏁

  • Action (A): The actions are up, down, left, right, coded as 0, 1, 2, 3. ⬆️⬇️⬅️➡️

  • Probability (P): This represents the likelihood of transitioning from one state to another given an action. In RL, you often don’t need to know these exact probabilities beforehand—the system learns them. For example, when the action is "up," the snake’s head usually moves up one unit, but there’s a chance it could collide with itself, triggering a terminal state. 💥

  • Reward (R): One way to define the reward is by points earned at the end of the game: for most of the game, state-action pairs give 0 reward, but at the end, the reward equals the snake units collected. Alternatively, you might reward each time the snake eats an apple 🍎, encouraging the model to chase apples—but this could make it harder to learn to stay alive. Designing the reward function is both fun and crucial—a major area of RL innovation. 🎯💡

Once you’re confident that your problem fits the MDP formulation, RL can help you learn an optimal strategy, or policy, through self-guided experimentation.

✅ Advantages of RL:

  • Doesn’t require large, pre-existing datasets to learn. 📚❌

  • You don’t need a full understanding of the system being optimized. 🤔🔍

⚠️ Disadvantages of RL:

  • Requires the problem to be well-defined as an MDP. 📝

  • Designing an effective reward function can be tricky. 🎯🧐

  • Balancing exploration vs. exploitation is complex. 🤷‍♂️🔄

🛠️ Real-World Tools Using RL:

  • Warmy.io 🔥:

    I suspect Warmy uses RL in their email warm-up functionality. The reward could be a point for each email successfully delivered to the main inbox.

  • Google Ads 📈:

    Google ads uses RL to learn optimal times to show ads and optimal user types to show it to. Here, the reward would be something like a point for each conversion.

✍️💭 Author’s Notes:

I often reach for RL when a problem is poorly defined, but the goal is relatively simple. For example, RL has been incredibly useful in online marketing, where the objective is clear: maximize conversions per dollar spent. It’s hard to define the exact equations behind why someone buys a product, but over time, RL models can learn strategies to maximize conversions through experimentation.

This is precisely what giants like Facebook and Google use in their ad platforms. If you’ve ever started a Google or Facebook ad campaign, you’ve probably noticed a “learning” phase in the first few hours or days—that’s RL in action, figuring out how to best optimize conversions from your ad! 📈🚀

Reply

or to participate.