Why You Need a Data Catalog, and Why We Host Ours Within Databricks

Part 1: What is a data catalog?
databricks
data
Author

Cameron Avelis

Published

May 8, 2025

What is a Data Catalog?

In one sentence: A data catalog is a detailed index of all the data available in an organization.

The value in a data catalog is the high-level information on each data source/table/column. How is it used? What does it mean? Is there anything strange you should know about this column because of how its usage has evolved?

For example, a data catalog would help you answer the question:

  • What datasources do I have that contain people?

or tell you:

  • The lname field is used to store First and Last name because the legacy system limited first name to only 2 characters and doesn’t support compound primary keys

Optionally, it might answer the questions:

  • How many characters are allowed in the first_name field?
  • How many rows are in the Users table?

It doesn’t necessarily need to contain technical detail, since that information can be queried from your database directly. This is exactly what we’ll do in our next post showing a data catalog implementation.

If the right information is included, a data catalog is beneficial to anyone who interacts with the data; from stakeholders, to product teams, to engineers.

Implementing a Data Catalog

Spreadsheets: intuitive, but disconnected

One of the fastest ways to get started with a data catalog is with a shared spreadsheet. Online collaborative tools like a Google Sheet are a natural fit for a data catalog.

The problem with a spreadsheet is that it can easily sprawl over time, and gets out of date quickly. The more technical information you include, the larger the task to keep it updated. Not to mention proliferating copies and versions.

On-platform data catalogs with Databricks

An ideal place for your data catalog to live is right next to your data, for a few reasons:

  • easy to cross-reference the data itself
  • enhance existing data platform UIs (Databricks “comment”/Description, for example)
  • up-to-date column information, pulled from the database information schema

Databricks is our data platform of choice, so in our next post, we’ll show you how we keep our data catalog close to our data. Leveraging features like Databricks Apps allows us to do this while maintaining usability for a variety of internal users.

Note

Disambiguation: Other terms with “catalog” in the name, defined by Databricks:

  • Unity Catalog: Databricks-branded product which handles governance of data and AI on their platform.
  • catalog: “Catalogs are the first layer in Unity Catalog‘s three-level namespace (catalog.schema.table-etc)”