Why This Post Exists
Organizations drown in undocumented data spread across systems with no shared understanding of what it means → A data catalog provides a single, searchable index that makes every dataset discoverable and understandable to stakeholders, product teams, and engineers.
The Problem
As organizations accumulate data across dozens of systems — claims databases, CRMs, legacy platforms — no one has a reliable answer to basic questions like “what tables contain customer information?” or “what does the lname field actually store?” Tribal knowledge fills the gap until the person who holds it leaves, and then entire teams are stuck reverse-engineering schemas.
Spreadsheet-based catalogs are the common first attempt, but they sprawl, fall out of date, and proliferate into conflicting copies. The catalog becomes yet another unreliable data source.
The Solution
This post explains what a data catalog is, why every data-driven organization needs one, and why hosting it directly on your data platform (in this case, Databricks) solves the staleness and disconnection problems that plague spreadsheet approaches. By keeping the catalog next to the data, you can pull column-level metadata automatically from the information schema and surface it through tools like Databricks Apps.
Example Use Cases
- Onboarding a new engineer who needs to find patient eligibility tables without asking five people.
- Scoping an analytics feature by browsing available data sources in one place.
- Understanding a cryptic column name that hides a legacy data quirk.
- Auditing which datasets contain PII across the organization.
What is a Data Catalog?
In one sentence: A data catalog is a detailed index of all the data available in an organization.
The value in a data catalog is the high-level information on each data source/table/column. How is it used? What does it mean? Is there anything strange you should know about this column because of how its usage has evolved?
For example, a data catalog would help you answer the question:
- What datasources do I have that contain people?
or tell you:
- The
lnamefield is used to store First and Last name because the legacy system limited first name to only 2 characters and doesn’t support compound primary keys
Optionally, it might answer the questions:
- How many characters are allowed in the
first_namefield? - How many rows are in the
Userstable?
It doesn’t necessarily need to contain technical detail, since that information can be queried from your database directly. This is exactly what we’ll do in our next post showing a data catalog implementation.
If the right information is included, a data catalog is beneficial to anyone who interacts with the data; from stakeholders, to product teams, to engineers.
Implementing a Data Catalog
Spreadsheets: intuitive, but disconnected
One of the fastest ways to get started with a data catalog is with a shared spreadsheet. Online collaborative tools like a Google Sheet are a natural fit for a data catalog.
The problem with a spreadsheet is that it can easily sprawl over time, and gets out of date quickly. The more technical information you include, the larger the task to keep it updated. Not to mention proliferating copies and versions.
On-platform data catalogs with Databricks
An ideal place for your data catalog to live is right next to your data, for a few reasons:
- easy to cross-reference the data itself
- enhance existing data platform UIs (Databricks “comment”/Description, for example)
- up-to-date column information, pulled from the database information schema
Databricks is our data platform of choice, so in our next post, we’ll show you how we keep our data catalog close to our data. Leveraging features like Databricks Apps allows us to do this while maintaining usability for a variety of internal users.
Disambiguation: Other terms with “catalog” in the name, defined by Databricks:
- Unity Catalog: Databricks-branded product which handles governance of data and AI on their platform.
- catalog: “Catalogs are the first layer in Unity Catalog‘s three-level namespace (catalog.schema.table-etc)”