Draft

Document data extraction on Databricks

ai
llm
openai
databricks
playbook
Author

Cameron Avelis

Published

June 23, 2025

Data extraction from documents on Databricks

Purpose

This notebook serves as an example of using multimodal LLMs to extract data from documents on Databricks, returning structured outputs using pydantic models. It uses the latest foundation models, which are easy to set up with an API key.

Setup

  • Configure a model serving endpoint in your Databricks instance for gpt-4.1-mini using your OpenAI API key
  • Create a .env with the keys shown in .env.example, including the model serving endpoint

Notes

All documents used are publically available samples or freely provided by our colleagues.

I found that none of the Databricks-included models could handle both multimodal inputs and structured outputs.

Claude 4 Sonnet would not extract personal information from the passport sample.

%%capture
%pip install -U mlflow openai
dbutils.library.restartPython()
%%capture
%pip install python-dotenv 
from dotenv import load_dotenv


# Load environment variables from a .env file
load_dotenv()
import mlflow
mlflow.openai.autolog()
from openai import OpenAI
from documents import PassportBasic, PassportFull, i797
import os
import base64
from IPython.display import Image, display


DATABRICKS_TOKEN = os.environ.get("DATABRICKS_TOKEN")
DATABRICKS_BASE_URL = os.environ.get("DATABRICKS_BASE_URL")
DATABRICKS_MODEL_NAME = os.environ.get("DATABRICKS_MODEL_NAME")

client = OpenAI(
  api_key=DATABRICKS_TOKEN, # your personal access token
  base_url=DATABRICKS_BASE_URL,
)

# Function to encode image to Base64
def encode_image_to_base64(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

def extract_and_show_document(document, pydantic_model=None):
    base64_document = encode_image_to_base64(document)
    completion_args = {
        "messages": [
            {
                "role": "user",
                "content": [
                    { "type": "text", "text": "What's in this image?" },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_document}",
                        },
                    },
                ],
            }
        ],
        "model": DATABRICKS_MODEL_NAME,
    }
    
    if pydantic_model is not None:
        completion_args["response_format"] = pydantic_model    
    chat_completion = client.beta.chat.completions.parse(**completion_args)

    display(Image(document))
    response = chat_completion.choices[0].message
    if pydantic_model is not None:
        print(response.parsed.model_dump_json(indent=4))
    else:
        print(response.content)
    return response
response = extract_and_show_document("samples/passport-iran.jpg", PassportBasic)

{
    "passport_type": "P",
    "name": "FATEMEH",
    "number": "S00002812"
}
Trace(trace_id=tr-200998bdb997128540addb630b475405)
response = extract_and_show_document("samples/passport-iran.jpg", PassportFull)

{
    "passport_type": "P",
    "name": "FATEMEH",
    "number": "S00002812",
    "country_code": "IRN",
    "nationality": "IRANI",
    "date_of_birth": "11/02/1979",
    "sex": "F",
    "place_of_birth": "TEHRAN",
    "date_of_issue": "09/10/2014",
    "date_of_expiration": "09/10/2019",
    "issuing_authority": "ISLAMIC REPUBLIC OF IRAN"
}
Trace(trace_id=tr-4310486b3f416dc0cc366e5193325604)
response = extract_and_show_document("samples/i-797c-sample.jpg", i797)

{
    "receipt_number": "WAC-1",
    "case_type": "I130 PETITION FOR ALIEN RELATIVE",
    "receipt_date": "2010-11-01",
    "priority_date": "2010-10-28",
    "notice_date": "2015-01-29",
    "page": "1 of 1",
    "petitioner": "John Doe",
    "beneficiary": "Jane Doe",
    "notice_type": "Approval Notice",
    "section": "Brother or sister of US Citizen, 203(a)(4) INA"
}
Trace(trace_id=tr-649f64874663ac3411d9ad40408c02f5)
response = extract_and_show_document("samples/diploma-bachelors-sample.jpg")

This image shows a degree certificate from the University of Pune (formerly University of Poona). The certificate certifies that Piraee Mehdi Farhad, whose mother's name is Vida, has been examined and found duly qualified for the degree of Bachelor of Science in Computer Science. It mentions that the degree was placed in the First Class in April 2006. The certificate is dated 3rd April 2007 and is signed by the Vice-Chancellor. The certificate also bears the seal of the university.
Trace(trace_id=tr-1e29adb770cfcf57b8da67a70bff3b5a)
from pydantic import BaseModel
from datetime import date

class Diploma(BaseModel):
    name: str
    degree: str
    field: str
    institution: str
    date_issued: date

response = extract_and_show_document("samples/diploma-bachelors-sample.jpg", Diploma)

{
    "name": "Piraee Mehdi Farhad",
    "degree": "Bachelor of Science",
    "field": "Computer Science",
    "institution": "University of Pune",
    "date_issued": "3411-04-03"
}
Trace(trace_id=tr-d35107f4ea8ba2719b8fa57792a3aeae)