Data stack comes from “technology stack”, meaning all combinations of technologies used to make data useful and palatable to all parties involved. It mainly supports storage, management, and access to data.
Every century has a precious new resource to exploit and progress. Well, data is the XXI century’s oil. The amount of data we can gather nowadays is a) unparalleled to anything we have seen before b) extremely useful for catering to all business needs.
Have you ever felt like your phone listens to you because there is no way it can know your need for the product or service that just popped in front of your eyes? Well, it isn’t a spy. It’s well-used data.
Data in itself is not really that useful, think of it as a bunch of random ingredients. You will need much more than that to run a restaurant. That’s why Data Stacks provide:
Warehousing: Technologies that allow all data to be stored in one place. The pantry to your ingredients.
Loading: The processes responsible for extracting data from the source systems and loading it on the warehousing system. In a way, these would be your ingredient providers.
Transforming: This stage is what makes data useful for consumption. Your chefs. They take raw data and transform it into something palatable.
Analytics: This is where you can start to derive meaningful insight from the process. Basically, serving it to the users. Bon appetit!
The whole process provides a main dish crafted to your taste and needs, aka, what is best for your business.
Data Stack is a generic term that is often used to signify a set of technologies used to build the analytical system of a company.
And they’ve always been around.
But now we also have the Modern Data Stack, which is an improved version of certain qualities of the Data Stack.
There was a time before 2012 in which data handling was a pretty expensive and slow process. It also had some other problems:
On-Premises Storage meant two things: you had little space and it was expensive. You had to be very careful with what you stored, because you couldn’t waste money or space on useless data. But nowadays, we know every little fact is useful; it just has to be put in context and “cooked” correctly. These limitations were crucial, since it was like having a very small kitchen with poor pantry space and having to serve only tomato soup.
Extract data from external sources, transform it into a format for storage to derive useful insights, and load it into the database. This made sense when a small group (like IT) controlled all access to the organization’s data.
Keeping up with the gastronomical allegories, in the ETL method your chefs had to gather the (limited) ingredients, chop them thinly and store them in your (small) pantry.
There is nothing wrong with this method, actually ETLs are still relevant and useful, it just depends what your business needs are.
For example, let’s say the only purpose of your analytical system was to produce a report with the amount of sales per day, then the most logical and efficient way to proceed would be an ETL, where you extract the data from the CRM system and just add the amount of sales per day.
This would take the least amount of space in your storage, an excellent method when computers and storage systems are really expensive.
So, if your restaurant runs well with a limited menu, then this will work perfectly for you. However, if you want to expand your database in order to know, let’s say, the amounts of sales per day for each branch and customer, you’ll need to write a new ETL that creates and updates a table with those characteristics.
What does this mean?
Using ETLs to just store the least amount of data takes less resources (and money) but limits the freedom of the analysts unless they have a Data Engineer develop new ETLs constantly to supply data to suit their needs.
If you wanted to pull data from the warehouse you had to wait until someone that knew SQL decided to help you. Again, there is only a small ladder in the kitchen and you can’t use it to reach your pantry if someone else was using it.
The greatest weakness of ETLs and on-premises storage? They are not that simple to scale up. There is no problem with having a small pantry in your restaurant when you have few diners, but as soon as your business grows, you’ll be in dire need of expanding your storeroom.
That was a bit of the problem with the old ways, the lack of scalability on the tools and software used.
Once your initial solution wasn’t enough, the work of rescaling was a very slow and expensive process, since it’s a manual job that required software engineers working to solve the issues. And since to err is human, it might be a faulty job after a lot of crunching anyways.
With the introduction of new technologies and innovations we were able to enter a new era of data handling.
Companies like Amazon, Google and Microsoft allow organizations to store all their data in one place, much cheaper than before. Suddenly, your pantry and kitchen space doubled and tripled for pennies.
Remember when you had a small team controlling the organization’s data? Well, now there’s far too many teams and users that need it. A small group (of chefs) can’t process all the data and serve it up to them. Besides, using the same storage format for everything isn’t efficient anymore. (You are not just making tomato soup now, stop trying to chop all the ingredients before storing them!)
An ELT process allows you to load data into the warehouse before it is transformed and then allows business users to transform it themselves as efficiently as they want. (No more chopped cheese, I can store it whole now and grate it later.)
The main advantages of a Data Stack like this are:
Reduced time: No more wasting time by the staging area of the warehouse. As we said before, scalability has been the motor behind most of the Data Warehouse improvements. Stop relying on elbow grease to do stuff has made Data handling much faster and more useful.
It is also important to choose wisely between what source of Data is the best for your company. The best options being Data Warehouses or Data Lakes.
Data Lakes are ponds, no pun intended, of unfiltered and raw data.
What’s the difference between those two, you might ask, well:
Again, the choice depends a lot on the kind of business you are running, it might be better for you to have faster yet unfiltered data or a slower yet more organized flow of data.
Democratic usability: Everyone can transform data now. (You had to know Java, Python, Scala, etc. before.)
Cost-effective: Thanks to the elasticity of cloud service providers, an ELT process can scale up or down according to the needs of each organization. (ETL was only affordable for big companies.)
Better Analytics: You can use the context better suited for your organization when analyzing data. This will give you better results.
Self-service. We talked about how data was suddenly available for everyone in your organization. But we can also talk about how this leads to the democratization of analysis too.
Now you can do as you please with all the data you have collected. Everything leads to this: the transformation of your data.
Let’s go back to the kitchen. You finally have a huge kitchen with immense pantries filled to the brim with ingredients and a full staff. Well, now you have to cook the dish and hope for the best.
The same happens with data, there is not a single fit for all recipes. Each analysis will demand different preparation steps and they will need different quality standards. Once your data is prepared it will become the foundation of your analysis, predictions and insights.
If done correctly, your data can become a five-star dish.
Think of your data stack as a layered cake. Not every layer has to be covered or be the same but it all stacks up to build a delicious dessert.
For example, you might be a small company that just needs to connect its data source to an analytics tool. No need for the rest of the transformation tools, but they help.
Where does your data come from? Probably from your own production database, the logs of a web server, or a third-party application. You can (and it’s recommended) have multiple data sources, flowing into a central data storage.
How data gets from place to place. From your data source to your data storage. We recommend using Airbyte.
This is where all the data from the data sources is collected and stored. Most of the time we talk about data warehouses, but it can be a read-only replica of your database in some companies.
Where you can see the insight of the data. It is usually called data visualization and involves other representations too. It can include the development of dashboards and other tools for monitoring. We recommend using Superset.
Sometimes called “Reverse ETL” because it is the process of moving data from a warehouse back into third-party systems to make data operational.
To build a healthy and robust Data Stack, you’ll have to think of the right tool for each of the components we’ve mentioned before. You’ll probably have to choose a cloud services provider to scale each of these components to your needs. Thankfully, you can take care of storage with them too since it’s quite cheap.
Some of our favorite tools we recommend to build a modern Data Stack are: Apache Airflow, DBT and Airbyte.
But making a sturdy, cost-effective and scalable Data Stack might not be so easy. You’ll need a team of experienced Data Engineers, Cloud Architects and DevOps experts.
Our team of experts at MUTT DATA are experienced in building Data Stacks, and we always make sure that they fit the needs of our clients. For example, you can check out the latest DataOps solution we’ve built for ClassDojo.
We have mentioned how you can build a Modern Data Stack, and how it can both decrease your costs and allow more people to make use of your data more quickly. Cloud computing and storage and modern data tools have helped solve that problem, but there are still some challenges that can change the Data Stacks we know of today:
Data Governance: now everyone can create, access and transform data to create their analysis. But that might also create chaos, with people accessing data they shouldn’t or creating duplicate data. Big Data doesn’t mean Bad Data! Some tools that can help with this are Airbnb’s Dataportal and/or Netflix Metcat.
Real-time data: the endless possibilities of real-time data are still unexplored by many. From real-time data analysis to process automation, having real-time data in all your systems might make a huge quality difference.
Data exploration for all: at least with a Modern Data Stack you just need to know SQL! No need to know a programming language. But we could take it a step further and make data exploration even more accessible to anyone. There are lots of BI tools going around, but none has the same power as SQL yet.
We hope that after reading this post you’ve seen the growth of the old Data Stack we used to know, and got to appreciate the multiple possibilities a Modern Data Stack can bring to your business.
Are you interested in building your own modern Data Stack? Or maybe take the one you already have to the next level? We at MUTT DATA can help you, contact us here!
Are you a Data Dev looking for a challenge? Come build the future of Data Stacks with us, we’re hiring! We’re always looking for curious Data Engineers, Data Scientists and Machine Learning Engineers to join our growing team. Apply here!
For some further reading, you can find more information in here: