Modern Data Governance for Data and AI Initiatives

Balakumaran Vaithyalingam
8 min readAug 19, 2023

--

Today, it is estimated that at least 120 zettabytes of global data will be created and consumed by the end of 2023 — that’s 120 followed by 21 zeros!!! This provides more opportunities to derive valuable insights from the data.

With the rapid increase in the adoption of Generative AI, Organization’s data is much more valuable than ever before. With government regulations on the rise for managing Data and AI, poor data governance or AI governance (requires trust in data) can have an enormous impact, including hefty fines, brand reputations, risks, and declining market share. At the same time, generative AI and advanced analytics require trusted and relevant data.

Low quality data is a cost to the organization whereas high quality data delivers significant value to the organization.

Challenges with Traditional Data Governance Approach

Data Governance evolved over two decades but the traditional approach has some major challenges:

  • Inability to meet data demands in a timely manner: Data Governance Implementation require multi-year journey
  • Manual and complex Process: Traditional data management and data governance tools requires significant manual processes to identify, prepare and protect the data for business needs. Requires too many personas with too many steps, and also requires multiple skills to establish the data governance program
  • Challenges with data democratization (data access and data literacy) due to data silos, data spread across the hybrid cloud, complex data structure (semi-structured and unstructured), data volume growth, data in motion, and so on
  • Practically impossible to catalog every data structure and increased risks due to dark data
  • Difficult to understand or prevent complex data management landscape, multiple copies of varying instances of the same data
  • Inadequate data protection to support growing data privacy regulations

The traditional data governance approach focuses on delivering data-driven insights but lacks digital transformation and acceleration such as self-service data consumption, automated data orchestration, automated governance policy enforcements, automated data quality management, etc.

Key Principles of Modern Data Governance

The above-mentioned traditional data governance challenges can be addressed through Modern Data Governance design principles. Often, this has been called “Data Governance 2.0”, “NextGen Data Governance” or “Smart Data Governance”.

Modern Data Governance is an improved Data Governance program from a traditional approach by applying data fabric architecture design principles along with optimization of People and Process alignment, federated governance, leverage technology advancements (AL and ML) and implementation though an Agile framework.

The primary goal of modern data governance is to deliver value and comply with regulations at a faster speed to meet enterprise use cases and to support AI initiatives. The following diagram shows the need for modern data governance:

Figure 1 — Modern Data Governance

Data fabric is an architecture design to dynamically orchestrate data delivery pipelines with augmented business knowledge, self-service data access, community collaboration for all personas and automated data protection, immaterial of where data resides.

When you adopt a Data fabric architecture design to your data governance program, Organizations can quickly see significant improvements in the process and business outcomes.

  1. Governance Policy:

Governance policies are objectives (mission and vision), standards, Security controls, and obligations (regulations). Any data governance or data management initiative must be aligned with governance policies. Any key data and AI initiatives require alignment of people, processes, and technology (platform) and are specifically tailored to the organization’s governance policies. In traditional data governance, the policies are often in web portals or PDF documents. It is important to manage governance policies as business metadata within the data catalog and tie this to other business metadata (such as classifications, business taxonomy, data protection rules, and governance rules) and technical metadata. This helps everyone to easily understand by everyone who deals with data and AI. Organizations should consider automated and uniform policy enforcement across their data management landscape supporting data privacy and data sovereignty policies.

Figure 2 — Aligning People, Process and Technology to achieve Governance Policy

Establish data governance policies to revisit data silos and data copies. There needs to be a balanced approach to data centralization and de-centralization. Data governance policies to establish standards for data centralization and consider initiatives such as a data marketplace (data product hub) to better collaborate and create quality data products for the data domains. Data marketplace can be used by enterprise constituents across various lines of business to save and make money.

2. Optimization of Technology (Platform):

a. Platform approach with “Quarter inch hole analogy” The traditional data governance tools must be integrated as shown below. This introduces further complexity in upgrades, security configurations, logging mechanisms, connectivity setup with necessary data sources, and monitoring.

Figure 3— Challenges with Tool based approach

Let’s look into the famous “quarter-inch hole” analogy. The drill bit is a feature but what people want is hole(s) as per their needs One step further, some users don’t care about the drill or hole size, they need end product(s) such as a shelf. If a user has a requirement for a quarter-inch hole, they don’t want a quarter-inch drill. However, in a real project scenario, a simple misunderstanding of these requirements and different tools for each persona can cause confusion, unnecessary delays, and a serious financial and operational impact on a company’s business. In addition, the tool-based approach creates tiers of data competency and leads to different interpretations of the same data. Finding the right information at the right time is critical for users’ productivity.

Figure 4— Quarter-inch analogy

These types of multiple, inconsistent, and silo tools, and different skill levels will seriously affect Organizations’ ability to derive value from data and expensive processes to protect the hybrid multi-cloud data. People consume data, and insights, and make decisions but should be less focused on plumbing work (tools integration or software development). This is why Organizations need to modernize their data architecture from a tool-based approach to a data fabric architecture with a platform based approach.

IBM Cloud Pak for Data is a unified and collaborative data management platform for all users to connect, access, organize, refine, integrate, and consume business-ready data with centrally enforced data protection policies to safeguard the use. IBM Cloud Pak for Data also enables all the users to collaborate from a single, unified, and browser-based interface that supports a large number of cloud-native microservices that are designed to work together. For example, users can easily configure data sources connectivity only once. Then this data source connectivity can be leveraged for multiple purposes — metadata import, data profiling, data quality scorecards, data integration, data virtualization, self-service data refinery, reports development, AI/ML development, data protection, and so on. Therefore, IBM Cloud Pak for data platform approach provides significant savings to IT administration as opposed to configuring the same database connectivity with multiple tools. Similar savings for security configurations (LDAP, SSO, SAML, etc), backup/restore setup, etc. Hence, Organizations can significantly save time, effort, and consolidate administration costs with simplified provisioning and scaling, common security, logging, monitoring, platform connections, and seamless upgrades.

IBM Cloud Pak for Data provides additional capabilities to implement modern data governance:

b. Microservices design in IBM Cloud Pak for Data helps to exchange insights across multiple use cases. For example, data profiling insights will be automatically passed into Match 360 services for automated data mapping and provide confidence indicators for match algorithm creation based on data frequency distribution, data availability etc. Similarly, the same data profile insights will be readily available for ETL developers to create effective DataStage flows and business users to create data refinery flows through self-service. The automated exchange of valuable insights between various microservices is essential for self-service data access and data consumption.

c. Active and Automated metadata management: IBM Knowledge Catalog in Cloud Pak for Data provides a simple point and click self-service user interface (metadata enrichment). By selecting one or more data source, IBM Knowledge Catalog powered by machine learning automatically import technical metadata, create data profiling reports, create data quality scorecards, perform data classification, link business metadata with technical metadata, and automatically enforce data protection policies. More importantly, the same metadata enrichment process can be scheduled periodically to ensure active metadata in the knowledge catalog.

d. IBM Cloud Pak for Data offers data virtualization to query data more easily and more securely across multiple sources, on the cloud or on-premises without physically copying the data between different systems. Hence, data accessed through the virtual layer ensures automated enforcement of data protection rules wherever the data resides and whenever sensitive data is accessed.

e. Leveraging graph technologies in IBM Knowledge Catalog and Match 360 helps to simplify the creation, management, and visualization of complex data relationships.

f. IBM Cloud Pak for Data offers Data Observability to proactively monitor end-to-end data health across the enterprise. Therefore, Organizations with data observability can 1) easily understand data pipeline health, 2) alert on data latency issues 3) check data sanity for validity and completeness 4) analyze data trends and 5) provide lineage based on operational data flows for impact analysis.

3. People:

Data management is a cross-functional effort. Managing data requires both technical and non-technical skills and expertise. People play key roles in successful data governance implementation. In modern data governance, organizations must focus on the shift in culture from a governance structure or management hierarchy to a data governance community. Most of the users have completely different primary job functions in the organization (“day job”) and their participation in data governance is often a secondary role or voluntary contribution. In a typical marketplace, consumers shop for products as well as contribute to the marketplace in the form of ratings, reviews, and promoting the products via social media.

Consider data as an enterprise asset with unique properties and data quality should be a shared responsibility.

Executives and Data Governance sponsors should focus on culture to embrace governance, community practice and motivation. It is equally important to celebrate key milestones, and reward the community contributors.

4. Process:

In any data governance program, the process steps are often defined from the RACI (Responsible, Accountable, Consulted, Informed) chart. It is important to establish RACI with the alignment of Policies (objectives), People (role), and technology.

Figure 5 — RACI

In modern data governance, revisit your existing RACI to ensure that the process is simple and much automated. Reduce the process steps with automated discovery, and automated scanners with active metadata management. Centralized data governance often leads to complex processes therefore federated data governance should be considered for effective process steps and promote crowd-sourcing, and community collaboration.

In Summary, modern data governance with data fabric architecture design using IBM Cloud Pak for Data provides the following benefits for business and technical users:

Figure 6 — Modern Data Governance Benefits

If you are interested in Modern Data Governance, please schedule a consultation with IBM Experts Lab or join IBM Knowledge Catalog community to connect with experts and join us for one of the upcoming Community events.

--

--