What is a Data Catalog?
In one sentence: A data catalog is a detailed index of all the data available in an organization.
The value in a data catalog is the high-level information on each data source/table/column. How is it used? What does it mean? Is there anything strange you should know about this column because of how its usage has evolved?
For example, a data catalog would help you answer the question:
- What datasources do I have that contain people?
or tell you:
- The
lnamefield is used to store First and Last name because the legacy system limited first name to only 2 characters and doesn’t support compound primary keys
Optionally, it might answer the questions:
- How many characters are allowed in the
first_namefield? - How many rows are in the
Userstable?
It doesn’t necessarily need to contain technical detail, since that information can be queried from your database directly. This is exactly what we’ll do in our next post showing a data catalog implementation.
If the right information is included, a data catalog is beneficial to anyone who interacts with the data; from stakeholders, to product teams, to engineers.
Implementing a Data Catalog
Spreadsheets: intuitive, but disconnected
One of the fastest ways to get started with a data catalog is with a shared spreadsheet. Online collaborative tools like a Google Sheet are a natural fit for a data catalog.
The problem with a spreadsheet is that it can easily sprawl over time, and gets out of date quickly. The more technical information you include, the larger the task to keep it updated. Not to mention proliferating copies and versions.
On-platform data catalogs with Databricks
An ideal place for your data catalog to live is right next to your data, for a few reasons:
- easy to cross-reference the data itself
- enhance existing data platform UIs (Databricks “comment”/Description, for example)
- up-to-date column information, pulled from the database information schema
Databricks is our data platform of choice, so in our next post, we’ll show you how we keep our data catalog close to our data. Leveraging features like Databricks Apps allows us to do this while maintaining usability for a variety of internal users.
Disambiguation: Other terms with “catalog” in the name, defined by Databricks:
- Unity Catalog: Databricks-branded product which handles governance of data and AI on their platform.
- catalog: “Catalogs are the first layer in Unity Catalog‘s three-level namespace (catalog.schema.table-etc)”