Summarize Intercom Data

To enhance retrieval for the RAG chatbot, we need to generate summaries of Intercom chat data using an LLM. This initial implementation will use a generic summarization prompt to produce summaries that capture key details without optimization for specific evaluation metrics.

Install Libraries

This notebook has been tested on Databricks Runtime 16.2 ML and Serverless (Environment version 2)

%load_ext autoreload
%autoreload 2 
# To disable autoreload; run %autoreload 0

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

%%capture
%pip install databricks-langchain

Imports and Variables

%run ./00_setup

import os
import sys
# Add the project root to sys.path to make raglib importable
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))

from raglib.shell.call_models import _llm_summarize_chat

Summarize Chats in Silver Layer

Uses Llama 3.3 70b to summarize the conversation.

# Load chat data from Unity Catalog
silver_conversations_df = spark.read.table(f"{UC_NAME}.{SCHEMA_NAME_SILVER}.{SILVER_CLEANED_CONVERSATIONS_TABLE_NAME}").collect()


summarized_df_as_list = []
#TODO: this will probably need to be parallelized for large datasets. Do not use spark or pandas parallel processing because it may inadvertantly excessively call the LLMs if results are not immediately stored. 
for row in silver_conversations_df:
  summarized_df_as_list.append({"id": row.id, "cleaned_conversation": row.cleaned_conversation, "summary": _llm_summarize_chat(row.cleaned_conversation)})


summarized_df = spark.createDataFrame(summarized_df_as_list)

# # Store the summarized chats back in Unity Catalog
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {UC_NAME}.{SCHEMA_NAME_SILVER}")
summarized_df.write.option("delta.enableChangeDataFeed", "true").mode("overwrite").saveAsTable(f"{UC_NAME}.{SCHEMA_NAME_SILVER}.{SILVER_SUMMARIZED_CONVERSATIONS_TABLE_NAME}")