For both new and experienced data practitioners, managing data is not a trivial matter. Organizations increasingly rely on efficient data collection, storage, and access so cross-functional teams can use it effectively, across the business.
The management of these vast quantities of data residing in a central data lake can be compared to the experience of budding data practitioners in university housing who are dealing with the dreaded shared fridge situation.
One solution is a data mesh. The concepts of ‘domain-driven ownership of data’, ‘data as a product’, ‘self-serve data platforms’, and ‘federated computational governance’ have existed for a while. Still, data mesh brought them under one roof.
For those new to data mesh, we will explore how it can address the issues associated with using a centralized data lake.
Who took my food?
The lessons we can learn from those fighting for space on the middle shelf are comparable to the main pain points of centralized data lakes, particularly concerning data privacy issues.
Imagine a house occupied by a family of four. This represents a smaller organization. It has a single fridge, shared by all family members, to store groceries. All the family members have their own tastes and dietary requirements. When one person takes control and manages shopping and meal planning, everything runs smoothly.
Now imagine a house of 40 university students attempting to share a single fridge. Initially, they think their fridge, being ten times the size a family of four needs, will be ideal. However, as with large-scale data lakes, scaling up capacity while increasing the number of users comes with challenges.
Due to a lack of monitoring of the fridge, it is now full of moldy food that no one is taking responsibility for, making it less useful and more confusing than it should be. In the realm of data, this is called a ‘data swamp.’ Similarly, if you put everything in your data lake without controls, it would gradually develop into a ‘data swamp’ that most users will avoid.
Whose sandwich is going green?
When food is left in the fridge for too long, it’s up to the housekeeper to determine what to throw out. The housekeeper in the field of data is often the data engineer.
However, this is not practical. Data engineers would need a thorough understanding of every domain and use case context. Even then, it would be hard for them to scale when there are concurrent requests from various use case teams.
To solve this, a data mesh dictates that each data product has a clear data product owner that drives the day-to-day execution. This involves defining what the data is, fixing quality issues and dedicating time exclusively to the domain.
With a data product owner, smaller teams can work in parallel rather than in a single, overworked team. It also enables the owner to apply a set of controls, which includes the ability to frequently monitor, observe and validate the high quality and available data.
Finding the fresh food you need
Another problem with storing everyone’s food in one location is that it can be challenging to find the things you want. Imagine you’ve just started making your favorite meal, you’ve got all the right ingredients. That’s when disaster strikes, you realize the milk has gone off.
Similarly, as a business user, you may have encountered situations where you’ve spent hours trying to source a key piece of information, only to realize that the data is several months out of date.
Cooking as a team
Now imagine three people in the dorm decide to cook together. One chooses the recipe, one buys the food, and the other cooks. However, by not working in unison, the ingredients end up scattered around the kitchen, slowing the cook down.
It’s the same in a business setting; not having the required metadata for your data products or not having a clear set of responsibilities amongst team members (for instance: data product owner, data steward, data custodian), will lead to problems in communication, alignment, and productivity. This is why an enterprise needs a combination of data governance, a marketplace, and a data catalog to ensure responsibilities are clear and data is managed correctly.
With so many people living together, everyone will have their preferences on what they want to eat. However, there will also be opportunities to work together; the flexitarian can cook with the vegetarian on some days, and the person who only eats takeaway food will benefit from a dose of spinach.
This mirrors a business with its various departments. From marketing and sales to HR and finance, each division wants to manage and protect its own sensitive data but also be able to access shared information from across the business. Managing this process is crucial to performance, as each team needs a balance between autonomy and collaboration.
Some data is sensitive and can also be more vulnerable to theft. Depending on the task or project you may want to only provide read access to a subset of individuals.
Like keeping food in the fridge of a shared dormitory, storing all the company data in one central location is likely to cause issues. It’s difficult to keep track of what data is present in the data lake, who it belongs to and who should have access to it.
In business, possible solutions include role-based access control (RBAC), data sensitivity tagging, data masking and anonymization. In brief, here’s what each of these means:
- RBAC, also known as role-based security, is a mechanism that restricts system access. It involves setting permissions and privileges to enable access to authorized users.
- Data sensitivity tagging enables you to classify your organization’s data in a way that shows how sensitive the data is. This helps you reduce risks in sharing information that shouldn’t be accessible to anyone externally.
- Data masking is a way to create a fake, but realistic version of your organizational data. The goal is to protect sensitive data, while providing a functional alternative when real data is not needed, for example, in user training, sales demos, or software testing.
- Anonymization is the process of protecting private or sensitive information by erasing or encrypting identifiers that connect an individual to stored data. For example, you can run personally identifiable information such as names, social security numbers, and addresses through a data anonymization process that retains the data but keeps the source anonymous.
Many organizations have invested in a central data lake and a data team with the expectation to drive their business based on data. This central system can result in organizations struggling to enable and empower their employees to make the most informed and timely decisions possible. Centralized data platform architectures fail to deliver insights with the speed and flexibility scaling organizations need. Data mesh serves as one of many solutions to these problems.
Special contributors include: Bruce Philp — Partner, Alex Arutyunyants — Senior Principal Data Engineer, Saravanakumar Subramaniam — Senior Principal Data Engineer, Mathieu Dumoulin— Senior Expert, Maheshwar Muralidharan — Specialist, and Danny Farah — Senior Data Engineer