Unveiling Data Mesh architecture principles.

In the world of Big Data, it is necessary to choose the right architecture to allow for effective data management, scaling, and resource optimization. In recent years, Data Mesh architecture has become one of the leading concepts for Big Data. I hope that this article will provide answers to the above questions for all organizations looking for optimal data management solutions.

1. Evolution of data architecture
2. Data Lake
3. Data Lakehouse
4. Why Was the Data Mesh Concept Created?
5. Data Mesh Principles Explained
6. Data Mesh benefits
7. Data Mesh challenges
8. Data Mesh – is this for me?
9. Implementing Data Mesh strategy – how to get started?
10. Data Mesh Implementation – summary

Evolution of data architecture

A lot has changed in the world of digital solutions in the context of data analysis. However, from a historical point of view, one thing is still the same. In order not to disrupt the operation of systems or applications during the analysis, the operational data is copied. What is constantly evolving is the target form of analytical data.

Initially (at the end of the last century), the destination was a data warehouse powered by ETL processes. The data was extracted (Extract) from the source systems, transformed into the target form (Transform), and loaded (Load) into the target destination – most often to fact tables and dimension tables forming a structure resembling a snowflake, optimized for quick reading and conducting specific analyses.

Data Warehouse Challenges

However, this approach caused some problems. When creating a warehouse, it was necessary to determine in advance the requirements for the analyzes that you want to conduct. ETL processes that transformed data cut out some of the attributes from sources, so we lost some of the information that could have been obtained. Time-to-Market is yet another issue. Firstly, it took a long time from the idea of the analysis to its implementation. And secondly, the data loading process itself often took a relatively long time, so the information was available with a delay. Scaling the solution was also problematic.

Data Lake

The above problems, the growing scale of data, the cheaper storage space, and the growing popularity of cloud solutions led to the creation of a new type of architecture – the Data Lake.

In this case, we store the data in a raw, often unstructured form, which we later transform for all target analytical data solutions. Thus, the ETL process turned into an ELT one. This approach means that it is no longer necessary to know the requirements in advance and all attributes from the source data are still available.

Data Lake Challenges

However, the Data Lake architecture also has its drawbacks; for example, the complexity of data management, security problems, data standardization, and data quality issues. Without effective classification and access management, the data lake often turns into a “data swamp”, where we have the data, but nobody knows where to find it, whether or not it is complete and who is responsible for it.

Data Lakehouse

A Data Lakehouse is a data architecture concept that combines the advantages of both the Data Lake and the Data Warehouse. At the same time, this approach allows us to address some of the disadvantages of both of them. Without losing raw data, we try to simplify data management by using data structures such as tables or views. This facilitates data management and improves safety with security features (e.g., column-level or row-level access controls).

In addition, the Data Lakehouse structures help to comply with data privacy regulations. With it, we can also standardize data processes and monitor the quality of data. There are also tools available to obtain ACID transactions, indexing, or caching of data, which means that we use a Data Lake as a Data Warehouse, keeping the advantages of both solutions.

Data Catalog

Regardless of the approach, we are still faced with the issue of the 3 Vs (Volume, Variety and Velocity) as the amount of available data – both in terms of volume and diversity – is growing very quickly.

If we want to keep control over all this, it becomes necessary to use a “data catalog”, a central location where information about the available data, its sources, structures, formats, meaning, and availability is stored. A data catalog allows users to easily search and discover datasets available in the organization. Thanks to the data catalog, users can quickly find the information they need and understand its context. In addition to the documentation role, these tools often also allow you to manage data access.

Revolutionizing Data Management with Data Mesh. Why Was the Data Mesh Concept Created?

What other problems may we encounter? Many other questions arise, such as:

Doesn’t one central data platform resemble a well-known monolithic system?
With the scale growing, does the data platform team have the time to integrate new data, or do they spend most of their time repairing the flows related to changes in existing data sources?
How well do these people know the source data and will they be able to model it well?
Who knows this data best?

Data Mesh Concept

The answer to the above challenges may be the Data Mesh architecture, an approach that was proposed by Zhamak Dehghani. Data Mesh assumes the decentralization of data management and treating data as a product. Data Mesh is the answer to the challenges of managing data in large organizations, such as complexity, lack of scalability, and difficulties in data quality.

It is a similar approach to decomposing a monolith into microservices. Data Mesh is based on a combination of three concepts (Product Thinking, Platform Thinking, and Domain-Driven Design) and applies them to analytical data.

Data Mesh definition

The definition proposed by the concept’s author defines Data Mesh as

“A decentralized sociotechnical approach in managing and accessing data at scale”.

This tells us that it is not only about technical issues, where we scale or optimize machines, but also changing the approach of people in the organization. As a result, we want to be sure that if the organization’s complexity grows, we will still be able to quickly obtain value from the data.

It is worth emphasizing that Data Mesh is a concept unrelated to any specific technology and can be implemented in many different ways.

Data Mesh Principles Explained

Data Domains Ownership – decentralized and distributed responsibility for the data
Data as a Product
Self-Service Data Platform
Federated Data Governance

Data Domain Ownership, or the decentralization of analytical data

This principle refers to the decomposition of analytical data into business domains and, more importantly, shifting responsibility for them to domain teams. Consequently, people from a given business area have the responsibility for and ownership of analytical data from their domain, because they understand this data best.

Some data domains are based on operational data from the systems that produce the data, but others enrich their data with information from other domains. There can also be domains which combine and transform data (e.g. aggregate it) from other domains only.

Thanks to this decomposition, one central data team is no longer a bottleneck.

Data as a Product Principle

In the Data Mesh approach, data is treated as a product that we provide to (share with) others. This product has a product owner who is responsible for the quality, availability and security of the data. There is demand for the product outside the domain and there are “customers” interested in it – that is, recipients of our analytical data.

For example, a CRM domain has all customer information (their profile, change history, segment, etc.) and can share this information as a data product with the rest of the organization. This data can therefore be used by analysts to create dashboards for management, or by marketing specialists to prepare campaigns or find applications in controlling reports.

Domain Data Team

The domain team addresses the needs of other domains by providing high-quality, well-documented, and reliable data. Information on products throughout the organization should be easily accessible so that anyone who needs them can find them quickly.

Contract

Product data is made available through one or more output ports that implement the data contract.

The data contract is a document defining the structure, format, semantics, quality and terms of use of data between a data provider (a given domain) and “customers” (other domains).

The contract is a communication tool that enables a common understanding of the structure and interpretation of data. Thanks to their structure, contracts can also serve as a basis for code generation, testing, schema validation, quality control, monitoring and access control.

Self-Service Data Platform

Domain teams need an infrastructure and technical tools to create and share analytical data. That is why we need a shared data platform that can be used for all domains but respects their autonomy. It is known as a self-service platform because domain teams should be able to use it on their own to avoid creating another bottleneck.

By not implementing the principle of the self-service data platform, we could find ourselves in a situation in which each domain would build their own data platform.

As a result, we should have a dedicated data platform team that provides domain-independent features, tools, and systems that enable to create, run, and maintain interoperable data for all domains.

Data platform

The data platform should make it possible to:

Build an analytical data model
Acquire, and store data
Query the data (preferably combining data from different products) and create visualizations
Monitor data quality
Control access
Maintain common standards, policies or regulations (often automating these)
Make product information public so that other teams could discover it
Document product access and usage across domains

It can be very useful to create templates that can be applied by new teams creating their “data products”. This will save time and money.

Federated Governance

This principle is about establishing common standards, policies and rules (regarding e.g. modelling, data quality, security, and documentation) that define how domain teams are to create and share their products.

Often such standards are created by a team consisting of different domains representatives. Unified interoperability rules that allow other domain teams to use data products are key. It is about determining:

a uniform way of accessing data
data exchange formats.
global identifiers to combine data from different domain
the form of documentation describing the products

It is also important to establish a common approach, in accordance with the implementation of established rules and policies (e.g., in the context of using personal data or data retention).

A very important role of the data platform is to automate the implementation of the above arrangements as much as possible, so that the platform itself allows for their application.

In conclusion, the most important rule is the one about the division into domains. It allows you to parallelize many works and maintain the effect of scale despite the growing complexity of the organization.

Wherever we distribute ownership and responsibility, there is a risk of chaos or anarchy, which applying the other three principles allows you to avoid. Thanks to them, it is possible to share data between domains, to cooperate, and prevent domain isolation.

Data Mesh benefits

Scalability and flexibility – the team can independently manage their data, which they know very well. This allows them to adapt faster to changing business needs.
Transferring responsibility for analytical data to their “producers” – no more searching for where the data comes from and which flow caused the problem.
Improvement of data quality – Data Mesh introduces data management standards and procedures, which translates into more accurate information for business units.
Facilitating cooperation and data exchange – despite data decentralization, Data Mesh allows you to maintain data consistency and integration by using data standards, APIs and services for communication between data teams.
Greater space for innovation and experimentation – data decentralization in Data Mesh allows different teams to experiment with data and create innovative solutions without having to rely on central data resources. This promotes creativity and allows you to come up with new ideas faster.
Facilitated data discovery – this is made possible by publishing data contracts and sharing a data catalog that contains information about the available data sets in the organization.
Better understanding of data connections – this is made possible thanks to breaking down a large, complex data model into many that are smaller and easier to manage.

Data Mesh challenges

Complexity of implementation – Data Mesh implementation can be complex and requires significant work in terms of transforming the organizational culture, adapting processes and changing infrastructure.
Performance issues – in the case of combining data from many different domains.
Duplication of data – it can cause problems with finding a single source of truth.
Lack of a consistent approach to technology – individual domains can implement data as a product in different technologies.
Additional responsibilities – system owners may not want to take on additional responsibilities related to building the analytical part.
Lack of a broader perspective – individual domains may focus strongly on their data and lose a broader overview of the entire organization’s data.
Duplication of solutions – deficiencies related to the data catalog or documentation may lead to creating similar solutions in different areas.
Risk of chaos – shifting responsibility for analytical data to source systems without technological support may cause more chaos.
The need to introduce all principals and organizational changes – investing only in the data platforms infrastructure or buying out-of-the-box tools supporting Data Mesh without organizational changes will not bring the expected results. Without shared global rules and standards, each domain will work most conveniently for them. This is not necessarily good from the point of view of the entire organization. Non-compliance with the Data Mesh rules will simply result in having areas without a data product approach and its advantages.

Data Mesh – is it for me?

You might be looking at the benefits of Data Mesh with enthusiasm. On the other hand, the list of potential issues and challenges is also long. However, many result from the lack of implementation (or implementation at an insufficient level) of all the rules. These principles are the pillars of Data Mesh. The way to implement the Data Mesh approach is not short and easy. But if your organization is faced with the challenges that led to the Data Mesh concept, it is worth considering using it.

It may be helpful to ask yourself the following questions:

Is my organization large, complex, and has many different data sources?
Is my organization already Domain-Oriented?
Do we use modern data solutions, CI/CD, DevOps, data on cloud?
Do we use a Data-Driven strategy (ML and advanced analytics)?
Do we have management support?
Are we aware that the implementation of Data Mesh is a long-term process?
Do we have the technical capability to build a data platform?

If the answer to most of them is yes, then the Data Mesh approach will be helpful.

Implementing Data Mesh strategy – how to get started

It is worth starting off the implementation of the Data Mesh approach with small specific data use cases, in teams that are interested and enthusiastic about this topic. If possible, it is worth using the existing technology and infrastructure to implement this approach to data.

Small steps will allow you to prove that this approach works and brings value to the organization. With use cases, it is easier to get management’s support and obtain an investment budget to implement a Data Mesh approach in other systems. It may also be helpful to create a team of data enablers that will support new domain teams by sharing examples, templates, or best practices.

Data Mesh implementation – summary

We all know the potential of data. With data, you discover new information, make better business decisions, understand trends, predict behaviors and create new products. Data is one of the most valuable resources in the world today, provided we can arrange it properly to extract information and value quickly and easily from it. For this, it is necessary to choose the right architecture that will allow you to effectively manage data, scale, optimize resources, and respond to business needs.

The Data Mesh architecture stands out among the available options and is an ideal solution that should be of interest to any large domain-oriented organization with Big Data from various sources.

1. Evolution of data architecture
2. Data Lake
3. Data Lakehouse
4. Why Was the Data Mesh Concept Created?
5. Data Mesh Principles Explained
6. Data Mesh benefits
7. Data Mesh challenges
8. Data Mesh – is this for me?
9. Implementing Data Mesh strategy – how to get started?
10. Data Mesh Implementation – summary

Data Mesh: Unveiling the Future of Data Management