I feel it is important to talk about different approaches to Data Lineage that are used by data governance vendors today. Because when you talk about metadata, you very often think about simple things — tables, columns, reports. But data lineage is more about logic. It is more about programming code in any form.
It can literally be anything that allows you to somehow move your data from one place to another, transform it, modify it. So, what are your options and how to understand that logic? No, I am not crazy! There are products building lineage information without actually touching your code. They read metadata about tables, columns, reports, etc.
They profile data in your tables too. And then they use all that information to create lineage based on similarities. Tables, columns with similar names or columns with very similar data values, those are examples of such similarities. And if you find a lot of them between two columns, you link them together in your data lineage diagram. And to make it even more cool, vendors usually call it AI another buzzword I hate very much. There is one great thing about this approach — if you watch data only and not algorithms, you do not care about technologies and there is no big deal if customer uses Teradata, Oracle or MongoDB with Java on top of it.
But on the other hand, this approach is not very accurate, performance impact can be significant you work with data and data privacy is at risk you work with data. There are also a lot of details missing like transformation logic for example, very often requested by customers and lineage is limited to the database world ignoring the application part of your environment. Talking to application owners, Data stewards, data integration specialist should give you fair but often contradictory information about data movements in your organization.
I will go simply to the point here — trying to analyze the technical flows manually is simply destined to fail. When you start considering the complexity of the code and especially rigging nz need to reverse engineer the existing code this becomes extremely time consuming and sooner or later, such manually managed lineage falls out of sync with the actual data transfers within the environment and you end-up with a feeling of having lineage that you cannot actually trust.
Do you know the story of Theseus and the Minotaur? Minotaur lives in a labyrinth and so does Ariadne who was in charge of the labyrinth. Ariadne gave Theseus a ball of thread to help him navigate the labyrinth by tracing his path back. And this is a little bit similar approach. It is like that Theseus. This approach looks great but it works well only as long as a transformation engine controls every movement of data.
A good example is a controlled environment like Cloudera. But anything happens outside its walls and lineage is broken. It is also important to realize that lineage is there only if transformation logic is executed. But think about all exceptions or rules that apply only once per a couple of years.Data Lineage is defined as the life cycle of the data. Data Lineage shows the complete data flow from origin to destination.
Data lineage is the process of understanding, documenting and visualizing the data from its origin to its consumption. This life cycle includes all the transformation done on the dataset from its origin to destination. Data lineage gives a better understanding to the user of what happened to the data throughout the life cycle. It also enables companies to trace the errors, implementing changes in the process and implementing system migration to save time and resources for efficiency.
Data Lineage helps the user to make sure if the data is coming from a reliable data source, transformations are done appropriately and loaded correctly to the designated location. Data Lineage plays an important role where key decisions rely on accurate information. Without appropriate technology and processes in place tracking, data can be virtually impossible or at the very least a costly and time-consuming endeavor.
Data lineage enables the tracking of the data stream from both endpoints to ensure the data is accurate and consistent. It allows the user to look for the data in both directions forward and backward between origin to destination of the data. ETL job is a function where we need to extract data from any defined data source and put it into another location after applying some data transformation on the collected data. It also enables us to check for any changes in some of the data fields such as column deletion, renamed or added.
It is called Impact Analysis. While dealing with complex reports, it helps in the identification of the data source which should be used in that report. To play the role of a data steward, the person needs to know everything about the data which is being used in an organization. Data lineage helps the person to identify the least and most usable data assets in an ETL job.
Data lineage provides transparency to the user who is responsible for that particular data asset. Data lineage helps a business user to find the reports based on any particular data fields or column. Example: there is some data source that includes data fields named sales and gender if the user needs to find the reports of the bases of these data fields.
Data Lineage can help the business user to check whether the data is accurate or not. When we need to troubleshoot for any of the wrong reports, lineage can help us to identify which process and jobs are involved in creating that particular report.
In the case when we have some failed jobs, data lineage can help us to find the target tables and fields affected which are being used in the reports.
One of them is who is using the data and where? When we have the visuals of the data lineage it is easy for us to find out the answers to these questions.
From the data lineage graph, we can track this and find out who is using this data. There is also some parameter which needs to define at the time of data creation. The data owner has the responsibility to store the data into the appropriate location and to grant access to the data.Data Lineage is an essential component in all business metadata management.
Often overlooked, the value of data lineage can be seen in many areas. There is a growing interest in data lineage for many reasons, across all areas of the enterprise data management community, especially as business metadata becomes more necessary to non-IT professionals. There are several groups of stakeholders within any company that might be interested in data lineage. Formerly, only the Information Technology IT department understood the concept of data lineage and its value.
As the explosion of data has affected every business area, business stakeholders have embraced the need for data lineage. Stakeholders in finance and risk have become the biggest data lineage enthusiasts. End-to-end data flows illustrate where the data originated, where it is stored and used, and how it is transformed as it moves inside and between diverse processes and systems. Therefore, these terms often are used interchangeably.
Data lineage is a description of the path along which data flows from the point of its origin to the point of its use. Still, the definitions say nothing about documenting data lineage. To understand the way to document this movement, it is important to know the components that constitute data lineage.
Data lineage components The same guides give clarification on data lineage component.
TOGAF 9. Rather, it refers to the concept of a data lifecycle. Many specialists consider data lineage as the ultimate remedy to meet these requirements. All conclusions about the necessity of data lineage are based on careful investigation of legislation requirements and consequent matching of these requirements to the data management methods and techniques, with data lineage forming part of it. Very often, a company deals with different types of business changes, such as changes in information needs and requirements, changes in application landscape, organizational changes etc.
As an example, consider a change in a database of a business application. Usually, data is transformed and processed through the chain of applications, as noted in Figure For convenience, the chain consists of just a few applications, but in reality, especially in large companies, such chains consist of dozens of applications.
In this case, data lineage will be able to ease the impact analysis of the change. For example if changes touch, information and reporting requirements the end point of the chain in Figure 1professionals will need to use root-cause analysis that will allow them to assess which data is required to produce this new information, where data should come from and how it should be transformed.
In such a case, a root-cause analysis will be much easier to do if the data lineage is already recorded. Usually, knowledge about data processing is kept in the minds of professionals or in the best-case scenario, on local computers in the form of Word or Excel documents.In his presentation he discussed the […].
In his presentation he discussed the importance of Data Lineage and how it has become an essential tool for enterprises to gain the most value from their data for everyone in the enterprise, at all levels:. It used to be really hard to get CEOs to care about data.
Rowlands promised big rewards for those who can confidently deliver accurate results. Data moves. It changes. It gets misunderstood.
Rowlands presented seven critical areas to examine, showing how those areas might be classified according to four different phases of development: chaotic, initial, progressive and dynamic. And you have to be able to do that in a repeatable, reliable, defensible, and increasingly accelerated manner. The scary thing about regulation to me is not that there is a demand for credibility, but there is a demand for on-demand demonstration of credibility.
Terminology can vary from department to department, so putting data terminology and historical changes to terminology in one central place is important. Knowing what a term means and agreeing on what the term means are two different steps, he said. Rowlands advocates creation of Reference Data on spreadsheets, in a managed data store, or ideally, a managed collaborative process.
The objective here is to be able to look at the end-to-end flow, he said. There are now many tools available to ensure Data Quality. Rowlands advocates documentation to understand the roles of all involved. Having a workflow that automates flagging, investigation and resolution of issues is useful.
You cannot understand your data unless you can pull off this little trick. How did the information get onto that report? We use technologies such as cookies to understand how you use our site and to provide a better user experience. This includes personalizing content, using analytics and improving site operations.In the first three articles Data Lineageand we have discussed why we need data lineagewhat data lineage actually is and what are the key legislative requirements for data lineage.
In this article, I would like to discuss and give my answer to the most complicated question: how should data lineage be documented? Before you even start thinking about documenting data lineage, there are a few crucial decisions to be made beforehand:. Horizontal data lineage represents the path along which data flows starting from its point of origin to the point of its usage. Horizontal data lineage can be documented on different data model levels such as conceptual, logical and physical.
Figure 1. Usually, all companies start their journey with descriptive data lineage. What does descriptive data lineage mean? Descriptive data lineage means that you make a description of data lineage manually using one or another application. There are some well-known data governance applications such as Axon by Informatica or Collibra. Regardless of the tooling you choose, there are several common features of descriptive data lineage:. Automated data lineage means that you automate the process of recording of metadata at physical level of data processing using one of application available on the market.Data lineage overview
You can find an extended list of providers of such a solution on metaintegration. The company provides meta integration components to major providers of the metadata lineage function. Of course, this kinds of solutions sound very attractive. But before choosing which one you want to use, keep in mind the following:. Different groups of stakeholders have different requirements for data lineage. There are at least two key stakeholder groups: IT technical professionals and business users such as financial and business controllers, business analysts, auditors.
5 Key Principles for overcoming the challenges of Enterprise-wide Data Governance
The key expectations of business users are the ability to follow changes in data values and the ability to get historical information on data processing up to months in the past. The automated data lineage is basically data processing design documentation. Strictly speaking, data lineage has nothing to do with such requirements.Interactive data lineage diagram: preview, edit, add, comment etc.
Our data landscape today and why it is a problem for an Information Architect Many companies — especially in Financial Services, Healthcare, and others — have a hugely scattered application landscape.
From front to back office systems, over several data warehouses, organizations have many local and global single points of the truth and a vast diversity of business information reporting tools ranging from plain good-old MS Excel to the more popular BI tools like Qlik and Tableau.
Data Lineage 104: Documenting data lineage
Very often, none of these systems are adequately documented, and even if there is documentation, it is often outdated. Sounds familiar, right? The trend with cloud data warehouses, software-as-a-service, big data, the internet of things is certainly not going in the direction of a consolidation and centralization of multiple data sources into one single data location.
How does an IA tackle this problem today, and why this is not working So how does a regular Information Architect tackle this challenge to create a nice, easy-to-navigate, easy-to-understand, easy-to-maintain, easy-to-document, and more importantly, easy-to-consume architectural picture of this application and data chaos? Well, probably one step at a time and one data flow at the time.
Do you start at the end with the reports? But which reports first? For financial institutions, it makes sense to start with your compliance report models ex.
Healthcare institutions might start with the systems that provide an adequate picture of patient history. For other industries, it will be other starting points for sure. A popular approach is to use a Critical Data Elements methodology.
So first our IA will spend numerous days, weeks, and months investigating and talking to the different SMEs of all those different systems and business processes. He will capture all of this information and write it down in another file somewhere on the network.
As a next step, our IA will pick one stream and he will design an elaborate architectural picture of different systems and applications interacting with each other, including how the data flows from these systems to the different data warehouses and how the data warehouses feed the different reporting tools and how those tools produce hundreds of reports.
Hopefully he will use supporting classical data lineage tools as there are many on the market to automate some of that work. Next, our IA will publish these architectural beauties and will distribute them in a read-only PDF format to the different business users and analysts within different departments and ultimately he will find out that nobody uses them. Because everybody has a different background and a different vocabulary business versus technical languagea different need for granularity of information management wants a high level picture, a mortgage loan specialist is looking for a more detailed picture, auditor wants to see it all and be able to go into the nitty gritty details.
Even the DBA needs to understand the context for data. And even when the architectural pictures are good enough, the consumers are faced with the traditional governance challenges:.
How many of those architectural pictures are just consuming disc space?Data lineage is critical to regulatory compliance, cost-effective data management, and arming the business with accurate and timely information on which to make decisions.
It is also challenging to implement, difficult to sustain and often suffers from a lack of management buy-in and funding. I have also seen firms using Excel, but this is not the best approach. That said, getting data lineage right comes with numerous challenges. Stephen Veasey, CEO at 3d Innovations, noted problems around managing large volumes of data, legacy environments, disparate systems, mixed data formats, data quality, data ownership and extracting useful information.
Responsibility for these challenges and solutions, he said, should lie with the chief data officer. Bucosky suggested that with components of lineage in various parts of large organisations, a centralised metadata group could provide expertise and guidance.
To achieve this, data governance is a necessary partner to lineage, providing an underlying understanding of data for both technology and business teams, and ensuring data is used appropriately. Business buy-in is best achieved by demonstrating the benefits of applying lineage to an important business project, gaining a beachhead and moving on to develop a broader and sustainable solution.
Your email address will not be published.
But data lineage can add business value beyond regulatory compliance. As they move to automate data lineage processes by incorporating Find out more.
Read More. Now in its 4th year, the RegTech Summit in London explores how the European financial services industry can leverage technology to drive innovation, cut costs and support regulatory change. High-profile and punitive penalties handed out to large financial institutions for non-compliance with Anti-Money Laundering AML and Know Your Customer KYC regulations have catapulted entity data management up the business agenda.
So, too, have industry and government reports on the staggering sums of money laundered on a global basis. Less apparent, but equally important, are Sign up for our newsletter Email:.
Events Awards People Podcasts Videos. Browse by Category. Twitter ateaminsight regtechinsight tradingtechins datamgmtinsight. Share article. Leave a comment. Leave this field empty.