Organizations are becoming more data intelligent by trying to make as many decisions as possible, at as many levels as possible, based on insights from all the data gathered by different systems. They have a lot of historical data which they want to digitize and utilize. That increases the variety of data that must be processed. With more systems being built and connected to fetch more information, the volume of data has significantly gone up. These systems also generate the data at a much faster rate. Individuals in the organization want the changes in data to be reported faster than ever before to demonstrate quick responsiveness and agility.

The age of self-service BI that dawned on us no more than 10 years ago, is already behind us. Self-service BI still requires a very capable IT team to build the warehouses and marts that host data models meeting requirements of a class of users as chosen by the top management, based on importance and urgency. Those data models are designed and implemented based on the required set of metrics that one wants to visualize as stories and dashboards. What if the end-users don’t have those questions readily available to guide the IT team to build relevant models? What if they want to observe the patterns in data as if it is flowing in, and then decide what questions they want answered? Within an organization, for the same product, the questions that an Engineer wants answered could differ from those relevant for the corresponding Project Manager. Similarly, the questions could be different for a Program Manager, Director or the CIO. We are now in an age when there are hundreds of analytics tools, platforms and products that allow users to peek into data at various stages in the pipeline. What they don’t do well enough is data governance and metadata management.

What is data governance and how could it lead to organizational success?

An organization will thrive if its data users can find, understand and trust their data to make better decisions and deliver better results. Data governance is a comprehensive framework for the overall management of an organization’s availability, usability, integrity, and security of its data. Addressing the most crucial aspect first, well defined data governance policies ensure better security. Per Gemalto survey “nearly two-thirds (64%) of consumers surveyed worldwide say they are unlikely to shop or do business again with a company that had experienced a breach where financial information was stolen, and almost half (49%) had the same opinion when it came to data breaches where personal information was stolen.” The impact of consumers’ perception will be higher than any other kind of loss which an organization might bear. There are concerns around security breaches, storage/archival of legacy data, customer consent, and third party liabilities. Some of these are industry specific. Additionally, there are regulatory requirements that must be met as per the industry standards such as – BCBS 239, CCAR, SOLVENCY II, HIPAA, IDMP, and GDPR. Implementing changes associated with these requirements is more like an opportunity than hindrance. It gives higher return on investment because the data is stored in more intelligent format. A data governance solution takes care of these concerns, and that coupled with effective data governance practices enables an organization to develop greater confidence in its data which is a prerequisite to making data-driven business decisions.

For each governance activity, tools execute policies such as

  1. Extract, transform, and load
  2. Data quality maintenance
  3. Master Data Management (MDM)
  4. Life-cycle management

These tools also monitor security and metadata repositories. Forrester identifies in its Q2-2017 data governance report that there is immense scope of improvement in these tools in the form of UI which could support more user roles/stakeholders and integration of varied platforms. A data value chain consists of following entities – producer, publisher, consumer, decision maker. Governance puts constraints on the publisher, such that the dataset is delivered in a format which is acceptable to all consumers. It then defines the responsibilities for both these entities to establish the protocol for each such dataset. Things like – availability of data, what information does the data represent, is it derived/cleansed, what is its request latency, associated vocabulary, etc. – decide the protocols. These responsibilities and protocols translate to the following action items when zoomed out:

  • Manage data policies and rules
  • Discover and document data sources
  • Manage compliance with evolving regulations

Data Governance through Metadata Management

Per Gartner, by 2020, 50% of information governance initiatives will be enacted with policies based on metadata alone. Metadata provides information enabling to make sense of data (E.g. datasets and images), concepts (E.g. classification methods), and real world entities (E.g. people, products and places). There are three types of metadata – Descriptive metadata which describes source for discovery and identification, Structural metadata which describes data models and reference data, and Administrative metadata which provides information that helps managing/monitoring that source. There are a variety of specs and frameworks which define how the metadata should be managed for optimum data governance.

Metadata Management on Hadoop

Hadoop, because of its distributed computing prowess, is now centerpiece of any organization’s data strategy. It allows data users to do descriptive, predictive and prescriptive analytics in real time. Many have implemented or are implementing the quintessential data lake which can ingest, store, transform and help analyze data. It establishes circular connections between data sources, publishers and consumers. With many such data streams flowing into and within the system, it is an urgent requirement to take care of attributes called out above, such as – security, availability and integrity. Hortonworks Data Platform (HDP) has a combination of tools which configured together can provide the complete data governance picture through a metadata based approach. Apache Atlas which is part of HDP, is an application that allows exchange of metadata with other tools/processes which could be within or outside Hadoop stack. Thus, it provides platform independent governance controls that effectively address compliance requirements. It works in tandem with other tools in HDP such as Ranger, Falcon and Kafka to complete the data governance package. The services include abilities to – capture data lineage, perform agile data modeling through a type system that allows custom metadata structures in a hierarchy taxonomy, use REST API for flexible HTTP access to various services on HDP, and import/export metadata from current tools and to downstream systems.

Metadata Management of the entire data infrastructure

GGK has designed and implemented Hadoop clusters for several customers and integrated those with their existing Business Intelligence infrastructure. We also help maintain and evolve those clusters to meet changing data flow requirements. An implementation generally starts with a few data sources, specific ingestion, transformation and presentation requirements. The scope of these requirements increases as the organization wants to onboard newer data sources, more users and consuming applications. Either of these entities could be outside Hadoop infrastructure. To ensure consistency, integrity and availability of data to authenticated/authorized users as data moves between systems, GGK leverages on suite of applications available out of the box on Hortonworks Data Platform. We have built a communication layer that allows disparate systems like Oracle, Cassandra, Kafka, Hive to exchange metadata that can be governed centrally using Apache Atlas.

The image above gives a view of metadata management in the age of data lakes, which act as source as well as destination of data. Atlas is an open source framework/repository for metadata management developed for Hadoop stack. Its flexibility allows exchange of metadata with other tools and processes within and outside the Hadoop stack, such as SQL Server or Oracle. Ranger, which takes care of authentication through cluster Kerberization and Apache Knox, authorization and auditing, also extends the data governance features provided by Atlas using Tags. Tag is a label, for example – PII, which could be put on any field like SSN. It is at the granularity of these tags that we can use the entire governance infrastructure to establish data movement and monitoring controls. Falcon provides features to define data lifecycle management, compliance (lineage and audit), replication and archival. The fourth tool which may or may not be part of the Hadoop infrastructure but is the fourth pillar of data governance is a dataflow integration/workflow suite. These put together provide a complete picture, that helps create a self-service data marketplace within the organization irrespective of the number of publishers and consumers. It will allow the data driven organization to take on ever-changing environment of vocabularies, taxonomies, and coding schemes.