%%capture
%pip install -U mlflow openai
dbutils.library.restartPython()Data extraction from documents on Databricks
Purpose
This notebook serves as an example of using multimodal LLMs to extract data from documents on Databricks, returning structured outputs using pydantic models. It uses the latest foundation models, which are easy to set up with an API key.
Setup
- Configure a model serving endpoint in your Databricks instance for gpt-4.1-mini using your OpenAI API key
- Create a
.envwith the keys shown in.env.example, including the model serving endpoint
Notes
All documents used are publically available samples or freely provided by our colleagues.
I found that none of the Databricks-included models could handle both multimodal inputs and structured outputs.
Claude 4 Sonnet would not extract personal information from the passport sample.
%%capture
%pip install python-dotenv
from dotenv import load_dotenv
# Load environment variables from a .env file
load_dotenv()import mlflow
mlflow.openai.autolog()from openai import OpenAI
from documents import PassportBasic, PassportFull, i797
import os
import base64
from IPython.display import Image, display
DATABRICKS_TOKEN = os.environ.get("DATABRICKS_TOKEN")
DATABRICKS_BASE_URL = os.environ.get("DATABRICKS_BASE_URL")
DATABRICKS_MODEL_NAME = os.environ.get("DATABRICKS_MODEL_NAME")
client = OpenAI(
api_key=DATABRICKS_TOKEN, # your personal access token
base_url=DATABRICKS_BASE_URL,
)
# Function to encode image to Base64
def encode_image_to_base64(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
def extract_and_show_document(document, pydantic_model=None):
base64_document = encode_image_to_base64(document)
completion_args = {
"messages": [
{
"role": "user",
"content": [
{ "type": "text", "text": "What's in this image?" },
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_document}",
},
},
],
}
],
"model": DATABRICKS_MODEL_NAME,
}
if pydantic_model is not None:
completion_args["response_format"] = pydantic_model
chat_completion = client.beta.chat.completions.parse(**completion_args)
display(Image(document))
response = chat_completion.choices[0].message
if pydantic_model is not None:
print(response.parsed.model_dump_json(indent=4))
else:
print(response.content)
return responseresponse = extract_and_show_document("samples/passport-iran.jpg", PassportBasic)
{
"passport_type": "P",
"name": "FATEMEH",
"number": "S00002812"
}
Trace(trace_id=tr-200998bdb997128540addb630b475405)
response = extract_and_show_document("samples/passport-iran.jpg", PassportFull)
{
"passport_type": "P",
"name": "FATEMEH",
"number": "S00002812",
"country_code": "IRN",
"nationality": "IRANI",
"date_of_birth": "11/02/1979",
"sex": "F",
"place_of_birth": "TEHRAN",
"date_of_issue": "09/10/2014",
"date_of_expiration": "09/10/2019",
"issuing_authority": "ISLAMIC REPUBLIC OF IRAN"
}
Trace(trace_id=tr-4310486b3f416dc0cc366e5193325604)
response = extract_and_show_document("samples/i-797c-sample.jpg", i797)
{
"receipt_number": "WAC-1",
"case_type": "I130 PETITION FOR ALIEN RELATIVE",
"receipt_date": "2010-11-01",
"priority_date": "2010-10-28",
"notice_date": "2015-01-29",
"page": "1 of 1",
"petitioner": "John Doe",
"beneficiary": "Jane Doe",
"notice_type": "Approval Notice",
"section": "Brother or sister of US Citizen, 203(a)(4) INA"
}
Trace(trace_id=tr-649f64874663ac3411d9ad40408c02f5)
response = extract_and_show_document("samples/diploma-bachelors-sample.jpg")
This image shows a degree certificate from the University of Pune (formerly University of Poona). The certificate certifies that Piraee Mehdi Farhad, whose mother's name is Vida, has been examined and found duly qualified for the degree of Bachelor of Science in Computer Science. It mentions that the degree was placed in the First Class in April 2006. The certificate is dated 3rd April 2007 and is signed by the Vice-Chancellor. The certificate also bears the seal of the university.
Trace(trace_id=tr-1e29adb770cfcf57b8da67a70bff3b5a)
from pydantic import BaseModel
from datetime import date
class Diploma(BaseModel):
name: str
degree: str
field: str
institution: str
date_issued: date
response = extract_and_show_document("samples/diploma-bachelors-sample.jpg", Diploma)
{
"name": "Piraee Mehdi Farhad",
"degree": "Bachelor of Science",
"field": "Computer Science",
"institution": "University of Pune",
"date_issued": "3411-04-03"
}
Trace(trace_id=tr-d35107f4ea8ba2719b8fa57792a3aeae)