Data lineage concepts
Data lineage is the concept of tracking and observing data flowing through a data pipeline. You can use data lineage to understand data sources, troubleshoot job failures, manage PII, and ensure compliance with data regulations.
Lineage on Astronomer
In the Cloud UI, the Lineage tab renders the lineage metadata generated by your DAGs as a dynamic graph. For more information on using the lineage tab, see Data lineage.
Astro leverages the OpenLineage open source standard to emit lineage metadata. OpenLineage standardizes the definition of data lineage, the metadata that makes up lineage data, and the approach for collecting lineage data from external systems. It also defines a formalized specification for data lineage.
Core concepts
The following terms are used frequently when discussing data lineage and OpenLineage with Astro.
- Integration: A means of gathering lineage data from a source system such as a scheduler or data platform. For example, the OpenLineage Airflow integration allows lineage data to be collected from Airflow DAGs. A full list of OpenLineage integrations can be found here.
- Extractor: In the
openlineage-airflow
package, an extractor is a module that gathers lineage metadata from a specific hook or operator. For example, extractors exist for thePostgresOperator
andSnowflakeOperator
, meaning that ifopenlineage-airflow
is installed and configured for your Airflow environment, then lineage data is generated automatically from those operators when your DAG runs. An extractor must exist for a specific operator to get lineage data from it. - Job: A process that consumes or produces datasets. In the context of Airflow, an OpenLineage job corresponds to a task in your DAG as long as your task is an instance of an operator with an extractor. Jobs can also represent work completed in other applications that emit lineage data, such as a Spark job or a dbt model. Jobs appear as nodes on your lineage graphs in the lineage UI.
- Dataset: Any collection of data that your jobs interact with. For example, a dataset can correspond to a table in your database or a set of data that you run a Great Expectations check on. A dataset is typically registered as part of your lineage data when a job writing to the dataset is completed. For example, when data is inserted into a table.
- Run: An instance of a job where lineage data is generated. In the context of the Airflow integration, an OpenLineage run is generated with each DAG run.
- Facet: A piece of lineage metadata about a job, dataset, or run. Also known as a “job facet”.
OpenLineage and Airflow
Using OpenLineage with Airflow gives you insight into your complex data ecosystems and can lead to better data governance. Airflow is a natural place to integrate data lineage because it touches and moves data across many parts of an Organization.
The following are the primary capabilities that OpenLineage with Airflow provides:
- Quickly locate the cause of task failures by identifying issues in upstream datasets. For example, you might see that a task failed because an upstream job outside of Airflow failed to populate a particular dataset.
- Easily see the affected area of any job failures or changes to data by visualizing the relationship between jobs and datasets.
- Identify where key data is used in jobs across an Organization.
Integrating OpenLineage with Airflow provides the following benefits:
- Allows you to quickly recover from complex failures. The faster you can identify the problem and the affected area, the quicker you can resolve and prevent erroneous decisions being made from bad data.
- Makes it easier for teams in your Organization to work together. Being able to visualize the full scope of where and how a dataset is used reduces the time you spend on analysis.
- Ensures compliance with data regulations by helping you understanding where your data is used.