In this blog post
In the recently announced Technology Trends in Data Management, Gartner has introduced the concept of “Data Fabric”. Here is the link to the document, Top Trends in Data and Analytics for 2021: Data Fabric Is the Foundation (gartner.com).
As per Gartner, the data fabric approach can enhance traditional data management patterns and replace them with a more responsive approach. As it is key for the enterprise data management strategy, let us understand more about the details of data fabric in this article.
What is Data Fabric?
Today’s enterprise data stores and data volumes are growing rapidly. Data fabric aims to simplify the management of enterprise data sources and the ability to extract insights from them. A data fabric has the following attributes:
- Connects to multiple data sources
- Provides data discovery of data sources
- Stores meta data and data catalog information about the data sources
- Data ingestion capabilities including data transformation
- Data lake and data storage option
- Ability to store multi-modal data, both structured and unstructured
- Ability to integrate data across clouds
- Inbuilt graph engine to link data for providing complex relationships
- Data virtualization to integrate with data that need not be physically moved
- Data governance and data quality management
- Inbuilt AI/ML engine for providing machine learning capabilities
- Ability to share the data both within enterprises and across enterprises
- Easy to configure work-flows without much coding (Low Code environment)
- Support for comprehensive use cases like Customer 360, Patient 360 and more
As evident, Data Fabric aims to provide a super subset of all the desired data management capabilities under a single unified platform, making it an obvious choice for future of data management in enterprises.
While most of the above capabilities are part of existing data management platforms for the enterprise, one important capability that is part of data fabric platform is the data virtualization.
Data virtualization creates a data abstraction layer by connecting, gathering, and transforming data silos to support real-time and near real-time insights. It gives you direct access to transactional and operational systems in real-time whether on-premise or cloud.
The following is one of the basic implementations of data virtualizations whereby an external data source is queried natively without actually moving the data. In the below example, a Hadoop HDFS data source is queried from a data fabric platform such that the external data can be integrated with other data.
While this kind of external data source access it there for a while, data fabric also aims to solve the performance issues associated with the data virtualization. Some of the techniques used by data fabric platforms are:
- Pushes some computations to the external source to optimize the overall query
- Scales out computer resources by providing parallelism
As explained earlier, one another critical capability of data fabric platforms is its ability to integrate data from multi cloud providers. This is at the early stages as different cloud platforms have different architecture and no uniform way of connectivity between them. However, this feature will grow in the coming days.
Advanced Use Cases
Data fabric should support advanced use cases like Customer 360, Product 360, etc. These are basically comprehensive view of all linkages between enterprise data typically implemented using graph technologies. Since data fabric supports graph databases and graph queries as an inherent feature, these advanced linkages are part of the data fabric platform.
Data fabric platforms should also focus on data sharing, not within the enterprise but also across enterprise. While focus on API management helps with data sharing, this functionality has to be enhanced further as data sharing also needs to take care of privacy and other data governance needs.
While the earlier platforms similar to data fabric have worked on the enterprise data warehouse as a backbone, data fabric utilizes a data lake as it is the backbone. A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics – from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.
Data Fabric Players
At the time of writing this article, there are no ratings from Gartner in the form of magic quadrant for Data Fabric Platforms.
Based on industry trends, following are few of the major players:
- Denodo Technologies
- Cambridge Semantics
While the detailed explanation and architecture of these platforms can be covered in a subsequent article, a sample building blocks of Talend data fabric platform is illustrated in the below diagram.
Enterprises can also think of building their data fabric platform by combining the best of features of various individual components. For example, from the Microsoft ecosystem perspective:
- SQL Server Big Data Clusters has data virtualization capabilities
- Azure Purview has data governance and metadata management capabilities
- Azure Data Lake Storage provides data lake capabilities
- Azure Cosmos DB has graph database engine
- Azure Data Factory has data integration features
- Azure Machine Learning and SQL Server have machine learning capabilities
However, as evident, we are yet to see strong products and platforms in the areas of multi cloud data management, especially data virtualization across cloud providers in a performance focused manner.