Data Lineage: Where did this data come from?

Understanding where data comes from or answering simple questions such as “who has modified this value and why?” remains a major challenge.

Read this article and learn about:

  • Why buy-side firms (and regulators) increasingly focus on data lineage
  • How data lineage concretely impacts buy-side firms
  • Why firms must combine multiple approaches to improve the status-quo
  • What best practices can be leveraged to start building a culture of trust in your data
Olivier Kenji Mathurin, Head of Strategic Research, AIM Software

Data has become as complex as it is vital to a financial institution's success. Asset management firms rely on the accuracy, quality and correlation of their data for trading, accounting and risk mitigation.

A typical situation in an organization

Picture this: An auditor calls and asks, “can you justify the prices used for the NAV calculation of last week?” The analyst will investigate back the process, starting from the NAV report delivered last week, to the multiple systems involved – including reporting, data warehouse, portfolio management, accounting system – to the system collecting asset valuations from different pricing sources and selecting the correct price.

The information is all there but it takes time to investigate.

On that particular day, the Hong Kong stock market was closed due to a typhoon striking the entire city1; a pricing exception was raised for each instrument domiciled there. The team of analysts thus decided to export the suspect records outside the system into a spreadsheet, copy/paste the valuations from the most-liquid market, and then reimport the corrected values back into the system – ensuring the NAV cut-off deadline is respected.

The analyst reports back to the auditor, having used several hours investigating, calling, emailing and reviewing technical log files to understand what happened with that price, on that day.

Requests from regulators and client auditors who demand “as-of” determination – how the data was arrived at, source and calculation methods used – has become part of the daily routine.Olivier Kenji Mathurin, Head of Strategic Research, AIM Software

This type of request is far from unusual. All recent regulations, from Dodd-Frank, to EMIR, IFRS, AIFMD, Solvency II or MiFID II, have put a major focus on data transparency. Requests from regulators and client auditors who demand “as-of” determination – how the data was arrived at, source and calculation methods used – has become part of the daily routine. One global asset manager I work with reported more than 14 due-diligence meetings conducted every year from one European regulator – each one requiring deep dive reviews of pricing methods in use and review of historical information.

Why is it so hard?

Understanding data lineage is typically impacted by three main issues:

  • Distance to access the information: Massive amounts of market data flow into a buy-side organization. The data is manipulated, extracted and reworked by different functions and within different systems. Although the information exists, it is often spread across different systems in different organizational siloes. It is also captured in technical log files and  databases which require IT staff to access the information, before the information is usable for business users.
  • Gaps in ownership of data, standards and enforcement processes: With dozens of applications, myriads of repositories and data models, the challenge of data lineage is more than daunting. The lack of a matured data governance able to cover these gaps will fail to improve the situation of access to data lineage information.
  • Usage of spreadsheets: Spreadsheets are popular due to their ease of use when executing complex business processes. However, they are also difficult to control, and typically run outside of data management processes and controls. The risk that mission-critical information is lost or altered remains a real concern to operational managers and now increasingly to regulators.

The visible cost of a lack of data lineage is the amount of time spent in data forensics. A case study from IDC*i, Data Lineage Management: Impact and Value,  shows that data stewards can spend up to 30-50% of their time on data forensics when responding to requests from business users.

A critical need is to be able to answer these requests very fast, to explain what happened to a particular portfolio or for a specific client.

One thing is clear, new regulations and revisions will continue to pressure firms for more granular transparency on the data reportedOlivier Kenji Mathurin, Head of Strategic Research, AIM Software

One thing is clear, new regulations and revisions will continue to pressure firms for more granular transparency on the data reported – thus increasing the costs on financial services firms.

  • ESMA already announced[ii] that upcoming revisions of EMIR and MiFID will look to improve the quality of the data reported
  • Basel III’s FRTB will require high-quality, granular historical data when using internal models and keep track of long risk factor histories
  • IFRS-9 will introduce comprehensive data requirements, with specific needs to source loan origination information.

 Regulatory requirements for data lineage


The hidden costs of unknown data provenance

Beyond the direct costs of data forensics, consider the impacts on the organization – and the related costs and risks – when data provenance and data quality controls are not known:

  • Redundant data control activities: The same or duplicated controls are often performed several times in different departments, because there is no shared view of the controls previously applied on the data element received.
  • Incorrect bookings: One asset manager I know of, uses several hours per month on correcting bookings because the data is not fit for purpose. These corrections would also involve further data forensics activities.
  • Data quality streams in analytics or reporting initiatives: Projects such as risk data aggregation, customer analytics, data warehouses, reporting or IBOR will often embed a data quality stream to ensure at the point of consumption that data is quality controlled prior to usage – even though it has probably already been controlled before.
  • Increased data costs: Different departments often decide to acquire data directly from the vendor, even when that data is probably already available to them.
  •  Market data usage and compliance risk: With stricter usage agreements, data vendors demand increasingly more details on data usage and distribution. Inability to relate data provenance and usage exposes the firm to difficult contract negotiations and compliance risks which can incur massive additional data costs.
  • Accuracy of analytics and models: Difficulty to investigate why models result in sub-optimal outcomes – in particular when back-testing data contained look-ahead bias.
  • Client reporting: Inability to determine the data provenance of the reported values, or delays in reporting it can lead to client risk and reputational risk.
  • Slowdown of growth and M&A initiatives: Lack of data provenance information results in significant difficulties to integrate data sets from another entity.

What can be done?

For financial services firms, the complexity doesn’t lie in the need to collect more information; the new challenge in the next few years will revolve around managing data lineage. It becomes less about how much metadata we collect and more about how the information can be more easily accessed and integrated.

According to the EDM Council, over 80% of buy-side firms are still at the early stages in understanding data lineageiii. Data governance remains a challenge but major steps were achieved in recognizing it as a priority; and on the technology side, EDM platforms, data lineage and data governance tools are converging to partner and cover the needs of both understanding data flows and offering investigation features.

From the conversations I’ve had over the last months, I’ve observed a set of recurrent practices that are commonly adopted to make decisive progress.

  1. Take an end-to-end business process perspective
  2. Documenting the end-to-end data flows remains a priority for many buy-side firms, but this does not constitute the entire solution. Modern lineage technology can scan ETL jobs and SQL queries, but the lineage diagrams produce wall-to-wall flows that non-technical users would see as unusable.

    The ideal solution should provide  end-to-end transparency on the process steps which have changed a particular record of data, from the source to the use of the data. It would act as an entry for analysis: how a price has been selected and controlled, or all prices for all holdings or a portfolio; be able to understand the changes applied on reference data records such as portfolios, holdings, instruments, index compositions, corporate actions, legal entities, etc.

  3. Easier access to Data Lineage information by business users
  4. As the data management function continues to migrate from IT and into corporate functions (i.e. IT/Ops, enterprise data management and risk), it is important to give access to lineage information in a meaningful way for business users, with the independence to reach it as soon as necessary.  

    This can only be achieved through applications designed with this business context in mind. User interface screens must deliver relevant information using the right business perspective, while data lineage information for a given set of portfolio holdings, or for all portfolio of a given client. It is also about exploring the information through visual charts and dashboards, and using the company’s glossary of business terms and concepts.

  5. Adopt a “Data Quality Firewall” approach

One global insurance company I work with has setup a global data management organization to service internal clients and subsidiaries globally. The global team ensures the quality of reference data before it reaches daily operations and systems.

This approach of a “data quality firewall” relies on a central EDM platform that enforces data policies and ensures data quality controls are executed ex-ante versus ex-post – an important step towards a matured data governance.

The central platform should systematically record all changes to the data. It also keeps track of the changes in context, data policies, data quality controls and parameters linked to the portfolio – which provides important information when business users investigate past data.

According to the EDM Council, over 80% of buy-side firms are still at the early stages in understanding data lineageOlivier Kenji Mathurin, Head of Strategic Research, AIM Software

Conclusion: Towards a culture of trust in data

Regulation will continue to change, analytic requirements will evolve and business activities will advance to handle more complex products.

The only way firms can prepare for this constant change is to strategically design and plan for a consistent way to manage data flows, enforce data governance standards, and make the lineage information actionable to the rest of the organization – ensuring that the technical implementation of the data fabric in the organization is built intelligently to scale both wide and deep.

The right solution will cherry pick technical assets and allow different lines of business to add processes, data sets and policies run by the organization. Enabling customizable views that combine both business and technical information is critical to access data lineage information and using it effectively, the next step into establishing data as a trusted asset in the organization.


About the author

Olivier Kenji Mathurin leads the Research Lab of AIM Software where he is primarily responsible for investigating industry challenges and leverage the community of clients and partners of AIM Software to identify new product opportunities.

Olivier has over 10 years of experience as a consultant, business analyst and project manager, bringing transformational initiatives to market, with the majority of his career focused in the financial services industry.

Mr Mathurin holds a MSc. degree in Management of Innovation projects from the University of Technology of Compiègne (France). He is a regular guest speaker at the Vienna University of Economics & Business (Austria).



[i] “Data Lineage Management: Impact and Value”, IDC – April 2016

[ii] “First steps in the review of EMIR, the European derivatives regulation”, ESMA – May 2015

[iii] “2015 Data Management Industry Benchmark Report”, EDM Council – November 2015