If you are reading this article, then you’ve probably heard of DevOps: a set of good practices with the goal of making the development and deployment of code more automated, robust and reliable.
Working in the field of DevOps means that you’ll spend most of your time building tools so that you and your team can build better software like continuous deployment and continuous integration pipelines to make deploying software both easy and reliable, or Terraform to define and then maintain infrastructure as you would with code.
But it’s not only about building tools: it’s also a change in paradigm. This means that DevOps is also about changing the culture of your workplace towards embracing the DevOps principles we’ve previously mentioned.
What about DataOps? Just like Data Engineering being about using Software Engineering principles but focused in the field of Data, DataOps applies the DevOps principles to Data. Of course applying these principles to Data bring their own unique problems and solutions: for example being reliable now means being able to re-run your data pipelines multiple times and you might want to use statistical methods to ensure the final quality of your data.
DataOps means having governance over your data storage and processes, ensuring its quality, making sure it’s available to your users and being able to easily develop tools with it.
In this post we’ll tell you about how you can start applying DataOps with some of our favourite tools: DBT for data management and governance, Airbyte for data extraction, data quality checks with Great Expectations and our very own SoaM for anomaly detection.
DataOps embraces agile development practices to allow organizations to easily adapt their project pipelines to changes in data, priorities, or requirements. The volume of data and the complexity of systems has increased in giant leaps over the past years making Data Ops necessary to sustain efficiency and reduce pipeline’s time-to-value.
Implemented processes, methodologies, and tools allow data experts to avoid manual, time-consuming testing and environment reproducibility, leaving them with time to focus on productive and strategic objectives instead of micro-decisions, error tracking, or troubleshooting fixes.
Reducing development times for data pipelines and working with an organized data flow, less vulnerable to errors also means not only delivering valuable insights but doing so while data is still relevant. With constantly changing markets and business contexts data has to be captured, transformed and analyzed as fast as possible.
Standardizing data processes, introducing automatization practices, validating and monitoring data throughout the pipeline, and planning for rollback capacities in every stage of a data pipeline are best practices that reduce errors. Even if errors do manage to sneak through controls, discovering them and the extent of their damage is easier.
DataOps plays a fundamental role in keeping data consistent, reliable, and top quality. This is also key to sustain good relationships with clients as data experts answer to issues and system failures that may arise.
Organizing, optimizing, and testing the data pipeline provides much more than efficiency and agility. DataOps provides a clearer picture of communication and collaboration in any organization as well as a clear map of data flow in a business.
DataOps can also facilitate knowledge transfer both internally and externally with stakeholders.
A math error carried foward in a formula, we’ve all been there. Frustatingly so, the method makes sense and we’re on track with the process but the end result just isn’t correct leaving us with few points or none at all. School or college tests aside, the same concept applies when building data pipelines. Mistakes in one step can lead to bad values that then propagate downstream.
If only we could go back in time… Well, we sort of can. Rollbacks are essentially about rewinding and finding a version of our data pipeline where things were still working correctly. In layman’s terms, if everything breaks you get a second chance, a do-over by going back to the last date in your memory where things were running smoothly.
The challenge lies in designing an idempotent system from the ground up. These systems allow you to re-run processes multiple times, having the same effect every time.
The solution is twofold. The first part is preventive, part of being able to find errors and travel back in the pipeline is being organized and clear. We always want to make sure that we’ve taken the time to think about and carefully design our data lineage. The clearer our lineage is the easier it will be to find out the extent of the pipeline that was affected by an error.
But even if our lineage is top of the line we also need to implement a standard logic that will allow us to delete and reprocess old data in order to resolve the issue or issues backing up to clusters in a clean state.
The bottom line is systems should be thought out from day one to be as rollback-friendly as possible this will reduce costs and increase efficiency in the future if issues arise.
If all is well, you should be able to identify errors and damage extent efficiently, delete the faulty data and bring it back to a functioning clean state.
The purpose of data governance is to sustain the highest possible level of quality of a companies data throughout the entire data pipeline. Different practices are applied in order to protect consistency and security as well as to remain compliant with regulations. Clear procedures and consistent data management result in improved analytics, decision making, and optimization efforts.
Governance also answers to a common challenge, making sure the correct people have access to the correct data. When working with external temporary hires or recently hired employees you probably won’t want them having the same permissions as your Chief of Technology (CTO). However, you might want them to have easy access to specific data for them to make possible modifications or create views.
We recommend using DBT: a tool we believe incentivizes you to employ software engineering best practices to ensure data quality.
DBT simplifies data transformation visualization with jinja-templated SQL queries, basically, it’s a great democratization tool with a language used by virtually all users (devs, analysts, product owners, etc). Additionally, it also lends a helping hand with the following benefits:
A big part of keeping the data pipeline clean, agile and efficient has to do with data validation and monitoring. It’s fundamental to be able to detect data issues as soon as they happen and not further down the line when making corrections is harder, costly, and time-consuming.
Luckily there are a few tools and practices that make detecting these issues quickly easier, we usually recommend the following:
Bad data leads to bad results, which in turn can lead to costly decisions. Chances are you have a good general knowledge of your industry and have some backed assumptions about what to expect of your data. Why not validate it against those assumptions before working with it? If your data smells bad, or if something is icky or out of place, you want to know.
Instead of using DBT’s built in capabilities for this purpose we opted for another tool, Great Expectations. We chose to do so, because it provided a great deal of built-in tests, profiling and a detailed User Interface to explore the results of running the tests. All of these with a relatively simple configuration.
Great Expectations is an open-source tool that facilitates testing data pipelines and solving pipeline debt. In short, pipeline debt is what happens when pipelines remain untested, undocumented, or unstable due to bad practices.
An expectation is an affirmation regarding a specific quality or characteristic of a dataset. For example in a dataset of house prices, an expectation could look something like this:
“We expect the values in a column, say X, to be between 100sq feet and 200 sq feet 65% of the time.”
This is, of course, extremely summarized. The bottom line is that the tool validates new input data against our established expectations.
Instead of dealing with bad data after its propagation, validating one’s expectations before this event will save you tons of debugging time as well as weird or bug-infested results.
The relationship between Data Quality Checks and Anomaly Detection is that common Data Quality checks usually have fixed values. i.e. there should be more than 0 transactions yesterday or yesterday’s total transacted value should be greater than the 5th percentile of the total daily transacted values of the last 10 days.
Using Anomaly Detection for Data Quality Checks means using an ML model that learns a KPIs usual behaviour an extrapolates what a “normal value” is according to its history. Therefore, this value adjusts to day of week, hour of day, and other characteristics of the KPI. i.e. if the last Tuesday of the month is usually the most busy in terms of sales then the threshold to validate the KPI with the Data Quality Check will be higher since that is the patter the ML Model learned from the past.
To quickly implement these kinds of data checks, one tool we highly recommend is SoaM (Son Of A Mutt), our very own in-house development. SoaM is a library designed by Mutt Data with the purpose of providing a “batteries included” framework for forecasting and anomaly detection pipelines.
In line with Dataops agile development methodologies, this library seeks to optimize our time and productivity when working with data allowing for reusable code for different projects.
SoaM stands out from other solutions by providing an end-to-end framework for forecasting and anomaly detection problems. From extraction of the data to the reporting of the results. At every stage, SoaM provides plug-and-play components that reduce development time.
Son Of A Mutt is built upon four main features:
Extraction and aggregation of data from any given SQL data source with pandas data frame for easy manipulation
Pre-Processing tasks such as any Scikit-Learn transformation –like the Min-Max Scaler– or even custom ones. Additionally, you can merge, concat, and slice your loaded pandas DataFrame with our built-in modules.
Forecasting models built in to easily fit and compare model performance.
Post-Processing features to persist forecast results, detect anomalies, backtest experiments, plot and report results.
Interested? Lucky for you we’ve dedicated an entire blog post to SoaM, its beginnings, its features, and uses including steps on how to use it. You can check it out here.
Results are only as robust as the data behind them, we’ve covered employing data quality checks and keeping a lookout for anomalies and errors, however, we should also try to simplify and standardize data ingestion.
Working with large datasets from a variety of different data sources can be tough. It implies lots of coding hours: setting up connections between different sources with different ways to query their data, dealing with specific formats, and tons of other small, yet time-consuming and draining, differences that lead to lots of grunt work.
The load of this grunt work falls on the Data Engineers responsible for creating and maintaining ETL procedures. Not only is it time-consuming to set them up, but everytime users of their dataset find use-cases that are not contemplated the Data Engineers have to manually create a new ETL job.
These tools provide standardized connectors for a great variety of data sources such as PostgreSQL or MySQL relational databases, MongoDB collections, Github commits, Asana tasks, Zendesk tickets, Google Analytics and Facebook Ads, as well as many CRM and user tracking tools.
Our recommended practice is to trigger data extraction jobs using modern schedulers like Airflow. We usually try to standarize on using pre-existing connectors from Airbyte or develop our own following their conventions.
Our solutions with these tools are ELT (Extract Load Transform) instead of ETL. Extracted data is instantly loaded as raw data into their destination, creating a Single Source of Truth. Thanks to the cost of storage, compute and bandwith being incredibly affordable thanks to cloud-services like AWS, raw data can be stored allowing Data Analysts and BI Professionals to create any transformation they want. The best part is that they’ll be able to use tools they already know like SQL to transform this data! No need for a Data Engineer to create a new ETL procedure each time a new visualization is needed.
We started working with ClassDojo in late 2020, for those unfamiliar with their work, ClassDojo is a school communication platform that teachers, students, and families use every day to build tight-knit communities by sharing what is being learned in the classroom through photos, videos and messages.
ClassDojo was facing a challenging opponent, technical debt. Their current data platform had ETL issues, duplicated extraction processes, troubled data ingestion and a general lack of testability with no monitoring in place in input and output tables.
This meant tracking errors and identifying their impact was no easy task, and if found there were no automated rollback procedures in place to fix them in an agile manner.
In order to solve these problems Mutt Data started by implementing Airflow to manage their data pipelines, replace their previous workflow scheduler with something more robust and scalable and prepare the playing field for implementing the following DataOps best practices based on the Modern Data Stack:
Working with ClassDojo’s teams was an amazing experience, the implemented changes allowed them to leverage their data, improving their data platform and producing real business value. Continously seeking to improve, ClassDojo is forming an analytics engineering team to focus on data architecture and pipelines. Working on their capabilities will open the playing field to a variety of machine learning implementation opportunites which we look forward to helping with.
Still Curious? You’re in luck! You can also hear ClassDojo’s side of the story in their latest blog entry: “How We Built A DataOps Platform”.
Once you’ve implemented these different DataOps practices and tools you’re ready to go and, if we may say so ourselves, much more agile. But before we go ahead and wave the checkered flag to indicate the end of this post here are some DataOps goals you should be aiming for. An efficient and properly designed data pipeline should:
Be as simple as possible, making it easy for non-data engineering developers to create new steps in analytics pipelines.
Be deployment-ready, new steps should be efficiently deployed without causing crashes or data issues
Consider possible errors implementing standardized recovery tools and methodologies. Redeployments should be taken into account from day one so that teams can and know how to fix issues as soon as they are detected.
Be agile, error-causes should be quickly traceable: be it a DAG, a table, a transformation all the way to the owner of that implementation.
Be preventive, employing tests on both data and pipelines to avoid harming existing functionalities.
Be quick to detect and report data issues or anomalies throughout the data pipeline
Be well documented so that data is easy to discover and track.
At Mutt Data we have experience working on projects employing DataOps techniques and methodologies. Need a hand implementing these tools in your own use case? From rollback-friendly data pipelines to robust and easily scalable machine learning models we adapt to your business needs in order to develop the custom-built solution your company needs.
We hope you’ve found this post useful, and at least mildly entertaining. If you like what you’ve read so far, got some mad dev skills, and like applying machine learning to solve tough business challenges, send us your resume here or visit our lever account for current team openings!
DataOps Machine Learning Data Pipelines Data Validation Big Data Data Engineering