Delivering Your Marketplace's Orders in Time with Machine Learning

Our Recommendations on Building a Preparation Time Predictor

Posted by Mateo De Monasterio

on January 29, 2021 · 13 mins read

This post is an entry to our Three-Sided Marketplaces + Machine Learning series. In our introduction we mentioned one of the possible ways to improve a Marketplace: predicting the preparation time of an order. Having this information brings two great benefits to any Marketplace: users now know an estimate of when their order will arrive home and better logistics can be achieved since you can now intelligently assign couriers based on when the order will be ready.

This post can be read as a standalone entry, but feel free to check out the introduction, it'll tell you what Marketplaces are and how Machine Learning can optimize them!

Another key concept here is that we'll be working with Time Series data, a series of data points ordered in time. Once we have this data, it's possible to forecast and predict its future values based on the previously observed ones. There's a lot to talk about this topic (and we might do a follow-up post on Time Series forecasting later!) but for this entry just knowing that Time Series data involve a series of data points in time is enough.

This post details some of the experiences we gained by solving a preparation time prediction problem in the food delivery business. We learned a lot about its inner workings and how to adapt its needs into a Machine Learning problem, and we'd like to share that with you!

The problem

First of all, let's lay out the problem clearly under the context of a food marketplace: once an order is confirmed and accepted by a restaurant, how long until it's ready and we can go pick it up? This means that we are gonna be focusing on predicting the preparation time of an order, not on predicting its whole duration.

That could be obtained by adding our preparation time prediction to a travel time prediction (for example from any GPS service), since the courier just has to travel from point A to point B.

Considering the whole timeline of an order, we are trying to predict a variable amount of time that occurs during the red line: it starts when the order is confirmed and ends when the courier actually picks up the item. But a courier might wait more time if he arrived before the order was actually ready, or the order might spend a long time on the restaurant's counter if we pick it up late.

As we've mentioned, knowing the preparation time of an order would be great for logistics: with it we'd know an approximate time for when the couriers should be near the restaurants to pick up an order! In business terms, this translates into less time wasted waiting for orders that are not ready and ensuring that our couriers can actually fulfill them in time.

In Machine Learning terms this means that the y column we want to predict would be the preparation time (in any unit of time such as seconds or minutes) of an order, and we want it to be as close to the real thing as possible.

Two sides of the Marketplace can help us obtain data to solve this problem:

Restaurants notify that the food is ready for pick up.
Couriers get to the restaurant, and can notify us if they had to wait because the food wasn't actually ready.

But why try and predict this? In any normal food delivery Marketplace restaurants notify the app when the food is ready and that notification should always be accurate, right?. What's the need for predicting the preparation time? If we get an order for a hamburger on Tuesday then it should take exactly the same time as it did on Monday. Sadly, this is not the case. Here are some reasons why using a Machine Learning solution is way better:

Adaptable to real world scenarios: the truth is that the same item will not always take the same preparation time. For example if the restaurant has too many orders going at the same time then even a simple order might take longer than expected.
Restaurants notify that the food is ready too late or too early: restaurants are not perfect. If they are part of our Marketplace, then they'll have to notify that the order is ready themselves. Even if our integrating app is great and has a very clear UI, this takes time. And turns out that some of them forget to notify that the order is ready to pick up, letting it get cold. And we found out that some even try to game the system by notifying that the order is ready immediately!

We want our prediction to be good even when restaurants don't notify that the order is done, and for it to adapt to different live conditions (current amount of orders, festivities, weather, etc). These two requirements can be met with the help of some feature engineering, data cleaning and adding the business logic to our Machine Learning system. These tools will help for all the restaurants in our Marketplace, but since each restaurant is its own world a unique Machine Learning model has to be trained for each of them. Let's take a deep dive into these ideas.

Feature Engineering

Good features are key to the performance of any Machine Learning system. In this case, they'll allow us to model what is going on in the real world when the order is placed on a restaurant. Has the restaurant been getting a lot of orders? How long have they been taking to prepare that dish in the past few hours? And what about last week? All of this information can be translated into features and used to train the Machine Learning algorithm. Here are some of the ones we had the most success with:

Past preparation time: lagged features are a classic staple of Time Series data. They transform the information of past events into a feature. Knowing how long the same order took in the past is a good indicator of how long it should take right now. Of course the lagged information we use as a feature should be related to the order somehow: getting how long it took the same day, same hour last week and how long it has been taking the past few hours is a good idea.

Not only that, but using averages is not a bad idea either, for example if no one has ordered that item in the past few hours or last week then an average might be the next best thing by getting a mean of the preparation time through the day.

Amount of orders in the past few hours: if a restaurant is extremely busy then they'll probably take longer to cook their food. Adding the amount of orders they've had in the past few hours could be a good indicator of how busy they are at the moment.
Price and amount of the order: as the owner of the Marketplace chances are that when an order is placed you'll know it's price and how many items were in it. These can be great features, maybe the restaurant takes an even longer amount of time for each item in the order, or prioritizes expensive items by cooking them first. This feature will help finding any of these correlations.
Current and future weather: weather is a common cause for delays not only on the delivery side of things, but also on the preparation of an order. People tend to stay at home during days that will have bad weather, and more people staying at home mean more orders coming in! This bigger than usual demand usually translates into longer preparation times.
Categorizing rush hours: whether the meal was made during breakfast, lunch or dinner can make the difference. This feature is simple enough, since it's easy to get the hour of when the order was placed and use that to define whether it was made during a rush hour or not.
Holidays feature: similarly to the categorizing rush hours it's possible to know if an order was made on a holiday. Instead of hard coding all possible holiday dates yourself, we recommend using a library such as holidays.

Cleaning up the restaurant data

A key part of the Machine Learning process is cleaning up the data available. Having more data is great, but feeding incorrect or dirty data to our system will just decrease its performance. Restaurants can accidentally or purposely feed us incorrect information. How can we detect if that's the case? Is it even possible to save that data?

A good first step when tackling a new problem is doing some good data exploration, and visualizations are key here. They'll help us understand what we're dealing with, and share those findings to then make informed decisions. But at first your visualizations might end up looking like this:

But not everything is lost! This is telling us we have some serious outliers in our data, both positive and negative. This calls for some Data cleaning. Two things we found out that work well are:

Discard extreme outliers, such as negative or incredibly big values. For some reason, due to a problem with the UI or restaurant forgetting to report that the order was ready, the value is complete nonsense. Discarding it is the best choice here.
Set a threshold and clip values greater or smaller than those thresholds. It's not entirely unlikely for an order to take an extremely short or long time (e.g an order taking 3 minutes or 90 minutes) without it being an error. These are still outliers, but the data is correct. Part of adapting our business logic into a good Machine Learning systems is adapting these values to our business needs: some orders might take 3 minutes to prepare, but we might prefer our predictions to err on the side of overprediction in this case. This avoids couriers getting there extremely early if the prediction was wrong. The same idea works for orders which take too long: we might prefer to under predict those cases so that we don't get there extremely late. Clipping helps with this, for example if we define the interval [i, j] all values smaller than i get replaced by it, and all values bigger than j become j.

Let's do what we just said and discard those extreme outliers and use a clip of [5,45].

Much better! Now we know the data we'll use to train our algorithm will never be above 45 minutes or below 5 minutes. It seems like most of our orders fall around the 15-25 minute range.

Metrics and adapting the business logic

We already mentioned how by clipping values we were adapting the business logic into our Machine Learning system. Coming up with metrics that link the system to our business goals is also part of this process, as it helps us understand if and how the system is helping with the business. From experience we know that this step is as or even more fruitful than some complex Machine Learning techniques.

There are lots of classic functions that can measure the error of a Machine Learning system. Mean Absolute Error, Mean Squared Error or Root Squared Error work great to measure error when training our supervised regression algorithm, but it's hard to understand how these results actually impact our business. An easy and simple alternative is MAPE (Mean Average Percentage Error). It translates these metrics from an error number into an error percentage. A MAPE of 30% means that, on average, our predictions are 30% off. We can then decide if this percentage is good or not for our business.

MAPE is a classic metric that fits every problem, but taking the time to create a custom metric that fits our specific business needs is also a great idea. Do we prefer our couriers to get to their destination before the order is ready? Or maybe later? Do we want to know if our model is more accurate on expensive orders? These things and more can be measured with a custom metric, but it's up to us to build it. And they are useful for more things than just measuring how well the business is doing: we can train our Machine Learning system to optimize its results based on any of these metrics.

For example, let's say you have a business goal to never arrive 20 minutes late to pick up an order. If our predictions cause that then it means that our system is over-predicting, which causes a courier to get 20 minutes late to their destination. To palliate this, we can punish those mistakes by weighing them differently in our function, and use that function to train our algorithm. This doesn't guarantee the best general performance for our system, but it'll help it fit into your business goals much more.

This article by Prince Grover and Sourav Dey has much more information about how to wrap your head around this idea and implement it.

And good news: all Machine Learning libraries that have a Scikit-Learn interface can easily do this. For example, this the Scikit-Learn Wrapper interface for XGBoost. Based on the implementation from the article, this is how it the code look:

def custom_asymmetric_objective(y_true, y_pred):
   """Compute asymmetric objective function punishing big over-predictions.
   Implementation based on: https://towardsdatascience.com/custom-loss-functions-for-gradient-boosting-f79c1b40466d
"""
   residual = (y_true - y_pred).astype("float")
   grad = np.where((residual < -20), -2 * 10.0 * residual, -2 * residual)
   hess = np.where((residual < -20), 2 * 10.0, 2.0)
   return grad, hess

Using this function as the objective parameter will do the trick.

Tracking our growth with MLFlow

During this article we mentioned lots of possible ways to improve the performance of a preparation time prediction system. Features, metrics, different ways of training our algorithm and even data clean up. But how do we know that our new changes have good results through time? Of course metrics are great ways of checking this, but looking at them once before deployment just isn't enough to determine if our changes were better nor worse. People often forget that Machine Learning systems are still software, and good Software Engineering means that the performance of software needs to be constantly monitored. We need to track our system through its lifecycle by multiple metrics. This is where MLFlow comes in to help.

MLFlow fits into the Machine Learning lifecycle by tracking its results, packaging the model to be easily reproducible, deploying it to any system and having a registry of past models. All of these are great features, but on this entry we'll focus on its tracking component. In our projects it helps us monitor if each of the features and changes we mentioned brought positive results or not. If you want to learn more about how MLFlow works and what it can do, it has an awesome documentation for each of its components.

The important thing you need to know about MLFlow is that it has a tracking server that supports saving metrics, parameters and artifacts (any kind of file, for example a .csv or a plot image) and a UI to interact with this data. Having a UI is great because not only the one who programmed the system can check and analyze the results: now everyone on the team can do it, even if they are not so tech-savvy.

Here are some recommendations based on our experience:

A good experiment name will help you easily find the results you're looking for. In our case, since each restaurant got its own model and the experiment was run on an hourly basis, we used the following format: <EXPERIMENT_NAME>_<YYYY-MM-DD hh:ss>
Don't forget to use parameters to track which features and configs were used while training the model. Being able to turn on and off and modify your features by config is always a great idea, track that information as parameters and see which config gets the best results!
Saving files as artifacts lets us visualize plots, raw data and even interactive .html files! When things go wrong, a good plot will let you visualize when and how your results started getting worse. In this problem, we found out that using a heat-map showing the performance of each of our metrics during the day was of great help. And to double-check that we properly cleaned and processed our data we stored a sample of the data, so that we can make sure it looks as it is supposed to.
Querying the results of your experiment locally with MLFlow, using MLFlow's UI can get a bit cumbersome once we want to do a more refined analysis of the results. It's easy to glance around quickly, but it's ideal to be able to do a deep analysis by looking at all the data points available to determine if our latest changes brought positive results or not. This is why MLFlow allows for programmatic queries that can be done anywhere, even locally as long as you have access to the server. Using its search syntax you can get all the parameters and resulting metrics of your experiments!

There's even more to talk about MLFlow, but we'll leave the code-samples and in deep discussion for the next entry!

Some potential improvements

Sadly, we didn't get to do all the things we've wanted to do. We lacked the time for some of them, and some others we found out way too late! Here are some of the possible improvements that could boost any preparation time prediction solution:

More features: we listed just some of the features we've had good results with, but there are lots of things to try out! For example if you had data telling you the amount of staff currently working in the kitchen, then that might be a good indicator on how long the preparation time will take and would make a great feature.
Oversampling data: restaurants with a low volume of orders are wildcards. Since we don't get a lot of data to train our algorithm, the model might have really inconsistent predictions. It would be worth to try out oversampling when their datasets are really imbalanced.
Trying out different models: each restaurant is its own world, and our model has to adapt as best possible to their reality to get the best results. We trained each model with the data of each restaurant, but maybe restaurants with a low volume of orders need a different model completely than the ones with a high volume of orders. Discarding and adding features and trying out different libraries and techniques are just some of the things to custom fit our models for the best results.

Running this in production

Getting this to run reliably in production is another task entirely. We'll want our system to update its predictions fairly often, and since most of the features use recent data then updating it hourly makes sense. But that means that all data extraction, feature engineering, model training, data validation, tracking and saving the results has to be done each hour! Of course running it manually each hour is a no-go, which means that we want an automated process that dependably does all of this. Using tools such as cron might be a good first solution, but it lacks the ability to re-run the whole process or just one of its steps, reports that notify us if something went wrong and a good UI for anyone to use.

To solve this our go-to solution is Apache Airflow, which lets us schedule and monitor any of our workflows. Making our preparation time predictions on an hourly basis is just a config away, and anyone can re-run any failed process just by using its UI. We plan to talk more about *workflow management in the future, and we'll tell you of our use cases and all the tips and tricks we've learned along the way!

References

Dey, Sourav. Grover, Prince. “Custom Loss Functions for Gradient Boosting”. Towards Data Science, URL

The source of the header picture is Wikimedia Commons:

The life of Benjamin Franklin: Holley, O. L. (Orville Luther), 1791-1861 Anderson, Alexander, 1775-1870 Benjamin Franklin Collection (Library of Congress) DLC. URL