Successful DataOps Framework for your Business

Balakumaran Vaithyalingam
6 min readMay 28, 2021

--

What is DataOps?

Data Operations (DataOps) is the orchestration of people, processes and platforms to deliver continuous, high-quality data to data citizens, focused on enabling collaboration across an organization to drive agility, speed and new initiatives at scale.

With a properly governed data pipeline, organizations can enable intelligent data fabrics to realize the value of their data with optimization on decisions and time. Additionally, businesses can comply with new privacy regulations and protect confidential data from unintended or inappropriate use.

Evolution of DataOps

DataOps is not completely new way of managing data in the organization. In 1988, IBM researchers Barry Devlin and Paul Murphy introduced the concept of the “business data warehouse”. Since then, data management solutions have evolved in conjunction with project methodology by addressing the business challenges associated with inefficiencies in accessing, preparing, integrating, consuming and protecting data.

  • At the beginning of the 21st century, most organizations started implementing data warehouses for the purpose of Business Intelligence (BI) and data mining. This caused unexpected complexity and significant effort was required in implementing frequent changes and enhancements to the data integration process, data warehouse tables, data mart tables and BI reports. Hence, the concept of “Metadata Management” was introduced for the purpose of impact analysis and data lineage by capturing tables, views, files and data integration jobs metadata in a central place so that any changes to the data sources or business rules can be applied with ease. During this period, greater emphasis was placed on Waterfall-based project implementation for data management projects, which worked for certain projects where the end state was clearly defined.
  • After few years, business began to see value in technical metadata and thus expanded these capabilities by capturing business metadata and business lineages. As Waterfall-based project management falters when there are continuous change in project requirements, organizations gradually moved to incremental and iterative Agile-based methodologies, as it was better suited for projects where all the requirements are not available upfront.
  • “Data Governance frameworks” were then introduced, and data quality, data profiling, and quality score became an essential part of data management principles.
  • After few years, organizations were forced to create and manage centralized data catalogs to meet new compliance regulations such as the Basel Committee on Banking Supervision’s standard number 239 (BCBS 239) and the General Data Protection Regulation (GDPR), among others. During this same period, organizations started adopting DevOps design principles by bringing Development (Dev) and Operations (Ops) together with automation.
  • Traditional data governance processes were manual and it became increasingly unwieldy for organizations to create and manage data catalogs with increases in business processes and growth in data volume, data variety (structured, semi-structured and unstructured), data veracity, and data velocity. Machine Learning Data Catalog (MLDC) with data discovery was a necessity for the organizations to build data catalogs with automatic linking of business metadata, design metadata, technical metadata, quality metadata and operational metadata. So, ML powered data catalog enabled “Google like search engine” for organization’s data assets.
  • The term “DataOps” was first introduced by Lenny Liebmann, a contributing editor at InformationWeek, in a blog post on the IBM Big Data & Analytics Hub on June 2014. However, DataOps became significant when Gartner named DataOps on the Hype Cycle for Data Management in 2018. The core principle of DataOps is to bring data governance frameworks, DevOps, Agile and Artificial Intelligence (AI)/Machine Learning (ML) principles together focused on a continuous delivery of data pipeline. This requires a unified and collaborative data & AI platform for all data citizens to collect, integrate, organize, shape and consume business-ready data with active data protection policy enforcement.

The following chart depicts the evolution of DataOps over the last few decades:

Figure 1: Evolution of DataOps

People, Process, Platform and Policy

4 Ps for successful DataOps Implementation

Any successful DataOps implementation depends on People, Process, Platform, and Policy.

People: Each organization has unique needs where stakeholders in IT, data science, and the line of business add value to drive a successful business. DataOps teams should focus on aligning the delivery of necessary data with the value it can bring to the business.

Process: Process improves the quality and speed of data flowing to the organization. Business processes that ensure accurate and complete data is captured at the source, classified, retained, and disposed. Obtain commitment from leadership to support and sustain a data-driven vision across the business.

Platform: The DataOps Toolchain with integrated and collaborative platform is critical to support AI-enabled automation in five critical areas to transforms data pipelines:

  • Data curation & Advance data integration Service
  • Metadata management
  • Data governance
  • Master data management
  • Self-service integration.

Policy: Policies are high-level goals, ideas, approaches, obligations and standards. Policies are designed to manage the creation, transformation and usage of data and related information owned by or in care of your organization. Any DataOps initiative must have a policy driven approach to define the focus areas as well as to successfully measure and monitor the program.

DataOps Framework

DataOps should be adopted by bringing DevOps, Agile and AI principles together and be focused on continuous delivery of trusted and high quality business-ready data to all data citizens with data protection policies. A DataOps roadmap should be established based on business outcomes by implementing various policies and data management capabilities.

The following chart shows the DataOps Framework with detailed process steps for Business and IT users:

Figure 2: DataOps Framework

DataOps implementation methodologies should focus on the innovation pipeline and the data value pipeline to rapidly provide benefits to the organization:

  • The innovation pipeline defines and governs the data management goals (objectives, standards, ideas, approaches, and obligations) through collaborative workflow by business users.
  • The data value pipeline is implemented through automated orchestration and accelerated development leveraging AI and machine learning by technical users.
Figure 3: DataOps Methodology — Innovation Pipeline and Data Value Pipeline

People play a key role for the successful DataOps implementation. It’s thus key to define the RACI chart (Responsible, Accountable, Consulted and Informed) and establish the workflow governing the data pipeline. The following chart shows a sample RACI chart for the DataOps team. The actual roles, responsibilities, and FTE requirements varies based on the Organization structure and their goals.

Figure 4: DataOps RACI Chart (Sample)

DataOps Implementation

IBM Cloud Pak for Data is a cloud-native solution that enables organizations to put their data to work quickly and efficiently. Some of the key benefits of Watson Knowledge Catalog on IBM Cloud Pak for Data are:

  1. All personas can collaborate in a single unified platform — Chief Data Office (CDO), Chief Security Office (CSO), Governance Risk and Compliance SMEs, business analysts, business users, data quality specialists, data curators, data engineers, data scientists and other data citizens.
  2. IBM Cloud Pak for Data platform provides all the capabilities (integrated data management, data governance, data science, and analytics) necessary to deliver business-ready data that supports DataOps guiding principles.
  3. Manage end-to-end data workflows to make sure that your data is high-quality to deliver accurate, automated insights and decisions.
  4. IBM Cloud Pak for Data offers data virtualization to query data more easily and more securely across multiple sources, on cloud or on premises. Benefit from built-in models and accelerators for various industries including finance, insurance, healthcare, energy, utilities, and more.
Figure 5: IBM Watson Knowledge Catalog for DataOps

IBM Watson Knowledge Catalog on Cloud Pak for Data is a data catalog tool that powers intelligent, self-service discovery of data, models and more to support your DataOps initiatives. One of the common challenges faced by Organizations is to align business and IT to collaborate and work together in an efficient manner. IBM Watson Knowledge Catalog offers automated linking of governance artifacts and data assets with automated data protection rules enforcement.

IBM Cloud Pak for Data is the industry leader for the DataOps platform and offers all the capabilities necessary for managing Data & AI are available in the modern architecture. If you are interested in DataOps, please schedule a consultation with IBM Experts Lab or join IBM DataOps Community to connect with experts and join us for one of the upcoming DataOps Community events.

--

--

No responses yet