SoaM is a library designed by Mutt. Its goal is to provide a “batteries included” framework for forecasting and anomaly detection pipelines. SoaM stems from multiple project experiences, hence the name: Son of a Mutt = SoaM.
We developed this library because we encountered the need for a strong and clear framework for forecasting projects and we didn’t find an open-source solution that fitted all of our needs. As Mutters, one of our core values is being Data Nerds, upon this: we are always seeking ways to optimize our time and productivity when working with data. By developing SoaM, the same design and code could be reused on different projects when needed.
SoaM stands out from other solutions by providing an end-to-end framework for forecasting and anomaly detection products. From the extraction of the data to the reporting of the results. At every stage, SoaM provides plug-and-play components that will help you reduce development time.
This long road started back on the 21st of April 2020 when a couple of Mutters had the bright idea to start developing this library and merged the first version of SoaM.
Since then, we have continually pivoted and iterated on the scope of the project but the main goal was always the same: to be the best open-source framework for time-series forecasting data products.
We built this project by setting ambitious goals for each Sprint and dividing the workload among the Mutters involved. We usually meet twice a week. Fridays are for planning and catchup and Mondays are for live coding. During the week, our priority is always set on external projects, but from time to time, we find some periods we can dedicate to contribute and commit some code to this library.
We also incentivize new Mutters to solve issues during their onboarding period so they can learn about our working methodologies and schemes in a safe environment and most importantly: meet new colleagues!
We can understand SoaM by breaking it down into four main features where each of them represents a stage in a forecasting pipeline. We divide each stage into different conceptual steps, and we use
Prefect tasks and flows to help us build these data pipelines.
This module will assist you with the extraction and the aggregation of your data from any given
SQL data source and provide you with a
pandas DataFrame for easy manipulation.
Once your data is loaded, SoaM provides you with out-of-the-box tools to perform preprocessing tasks such as any
Scikit-Learn transformation –like the
Min-Max Scaler– or even custom ones. Additionally, you can
slice your loaded
pandas DataFrame with our built-in modules.
Since SoaM is intended for time-series forecasting, we have already built-in forecasting methods such as Statsmodel’s
Exponential Smoothing and
Orbit, and Facebook’s
Prophet for you to easily fit and compare models performance. Of course, we don’t limit the usability just to these models, you can always import your preferred forecasting algorithms as well.
After you forecast for future periods, you might want to persist the results somewhere, to detect anomalies, backtest your experiments, plot your results, and even report them by email or slack to your colleagues.
Well… we understand this might be a lot to cover. The good news is SoaM will also help you fulfill these tasks in a blink of an eye.
This is how a simple and generic use case of SoaM would look like:
# 1. Extraction # Extractor object initialization with the db client and the table name given. from soam.workflow.time_series_extractor import TimeSeriesExtractor extractor = TimeSeriesExtractor(db=db_client, table_name='table_name') # 2. Pre-processing # Transformer object initialization with a chosen Scikit-Learn transformation. from soam.workflow import Transformer from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() transformer = Transformer(transformer = scaler) # The add_future_dates method refers to the amount of periods we will later predict, # we create a prefect task for this job. from soam.utilities.utils import add_future_dates from prefect import task @task() def add_future_dates(df: pd.DataFrame): df = add_future_dates(df, periods=periods, frequency="d") return df # 3. Forecasting # Forecaster object initialization with the output_length equal to the amount # of periods we will predict. from soam.models.prophet import SkProphet from soam.workflow.forecaster import Forecaster my_model = SkProphet() forecaster = Forecaster(my_model, output_length=periods) # 4. Post-processing # Generate a plot and send it by email. # 4a. SOAM ForecastPlotterTask object initialization. from soam.plotting.forecast_plotter import ForecastPlotterTask from soam.constants import PLOT_CONFIG from copy import deepcopy from pathlib import Path plot_config = deepcopy(PLOT_CONFIG) forecast_plotter = ForecastPlotterTask( path = Path('img_path'), metric_name = "metric_name", plot_config = plot_config ) # 4b. SOAM Mail Report object initialization. from soam.reporting.mail_report import MailReportTask mail_report = MailReportTask( mail_recipients_list = ["firstname.lastname@example.org"], metric_name = "metric_name" ) # 5. SoaMFlow # Using SoamFlow to run everything as a prefect pipeline. from soam.core import SoamFlow with SoamFlow(name = "flow") as flow: # EXTRACTION df = extractor(build_query_kwargs = query_dict) # PRE PROCESSING df['column_name'] = transformer(df['column_name']) df = add_future_dates(df = df) # FORECASTING predictions, time_series, model = forecaster(time_series=df) # PLOTTING forecast_plotter(time_series, predictions) # REPORTING mail_report(current_date = "current_date", plot_filename = 'img_path.png') # Execute the pipeline flow.run() # Voilà! # Now check your inbox and you should find an email that looks like the following:
Of course, SoaM is present in lots of our projects. A common use case would be to easily bootstrap an anomaly detection pipeline for monitoring time-series data. In one of such instances, we needed to quickly spin up a pipeline to detect anomalous values on business KPIs as part of the implementation of a DataOps Platform to improve data pipeline creation and management across the whole organization. For this, we leveraged SoaM modules to extract time-series data from
Amazon Redshift, train a model on the data, detect anomalies, plot the results, and send a report to a
Slack channel. We successfully got this pipeline up and running with minimal custom code, just adding custom plot logic for styling purposes (copy and paste it from your
This way SoaM significantly reduces our workload and structures our code the Mutt way. In the image below you can see how we receive this report on our Slack Channel. We scheduled this pipeline to run daily with
Apache Airflow and got positive results by enabling the client to make data-driven decisions with the minimal development effort.
In another case we needed to detect and report significant drifts on the input data of an automated ad budget allocator, this helped us to reduce losses by adjusting the allocator’s parameters on volatile periods.
Have we caught your attention yet? Ready to start using SoaM? Relax… we’ve got you covered. Mutters understand the importance of clean and precise documentation, and SoaM was no exception.
First, you can take a peek at our Sphinx docs to have an overview of the project.
Anyways, if you made it this far you’re probably craving for some code, if that’s the case we have some fresh
Jupyter Notebooks with examples such as our Quickstart I and our Quickstart II where we take you for a ride on how to leverage SoaM tools effectively.
That’s not all, you might be wondering how an end-to-end product using SoaM will look like? How to integrate it with other tools such as
Apache Airflow? Well… we also have documentation for that. Check our End to End data product with SoaM post to see how that will look like.
You can install it directly from PyPI with a simple:
pip install soam.
Check the installation section on our README for further information on installation and available extras.
While developing SoaM we applied efficient practices to ensure code quality and robust deployments. This helps us guarantee a baseline of code quality and automate the deployment process. By these we mean:
mypy, and others.
docstringcontrols to ensure a minimum percentage of docstrings coverage.
One of our core values at Mutt is that we are an Open Team. We all make mistakes and need help fixing them. We foster psychological safety. When we don’t know something we ask for advice.
We expect everyone contributing to SoaM to follow this principle. Be kind, don’t be rude, keep it friendly; learn, teach, ask and help.
Please don’t hesitate to contribute to our library in case you feel like it, we encourage you to do so. Check our Contributing Guidelines to see how to make useful contributions.
Regarding plans and possible future releases, some of the things we have in our scope contemplate:
Poetry. Check our blogpost on Why Should You Use Poetry to Manage Python Dependencies.
Muttlibis our utils library, check our blogpost on Muttlib Public Release.
There is a lot of hard work behind this library and a big team responsible for it. We acknowledge the effort of all the data nerds behind this development:
It has been a great experience to build SoaM together as a team and we are super proud of the result.
Thanks to all of them!
time-series forecasting anomaly detection