Microsoft Fabric Updates Blog

Harness the Power of LangChain in Microsoft Fabric for Advanced Document Summarization

Author(s):
Amir Jafari, Senior Product Manager in Azure Data.
Sheryl Zhao, Principal Applied Scientist in Azure Data.
Mark Hamilton, Senior Software Engineer in Azure Data.
Nellie Gustafsson, Principal PM Manager in Azure Data.

In our previous blog, we showcased the capability of Microsoft Fabric and SynapseML to utilize large language models (LLMs) for efficient question and answer tasks on PDF documents. In this blog post, we aim to provide a deeper understanding of the capabilities of Microsoft Fabric and SynapseML by specifically focusing on the process of document summarization and organization at scale through integration with LangChain.

LangChain is an open-source library designed to enable users to build complex and dynamic applications leveraging the capabilities of LLMs. Its primary purpose is to serve as an orchestration tool for prompts, streamlining the process of chaining different prompts interactively and effectively, leading to the attainment of desired outcomes. In order to facilitate seamless scalability of LangChain’s execution on large datasets, we have integrated the framework with SynapseML. This integration allows for the utilization of the Apache Spark distributed computing framework, enabling the processing of millions of data points with the LangChain Framework on Microsoft Fabric. Using a table of arXiv links as the source of documents, the process employs the LangChain Transformer to seamlessly extract vital information from the documents through the arXiv links in order to efficiently streamline the entire organization and summarization of the documents.

Following this guide, you can easily embark on your own journey to harness the potential of Microsoft Fabric and SynapseML LLM capabilities to effectively summarize and organize your own documents. You can also access and run the LangChain Notebook.

Step 1: Import the required libraries

Before we move forward with document summarization and organization, it is imperative to first install SynapseML cluster and LangChain and then import the essential libraries from LangChain, Spark, and SynapseML.

# Install SynapseML on your cluster
%%configure -f
{
  "name": "synapseml",
  "conf": {
      "spark.jars.packages": "com.microsoft.azure:synapseml_2.12:0.11.1-10-5e9c0c19-SNAPSHOT,org.apache.spark:spark-avro_2.12:3.3.1",
      "spark.jars.repositories": "https://mmlspark.azureedge.net/maven",
      "spark.jars.excludes": "org.scala-lang:scala-reflect,org.apache.spark:spark-tags_2.12,org.scalactic:scalactic_2.12,org.scalatest:scalatest_2.12,com.fasterxml.jackson.core:jackson-databind",
      "spark.yarn.user.classpath.first": "true",
      "spark.sql.parquet.enableVectorizedReader": "false",
      "spark.sql.legacy.replaceDatabricksSparkAvro.enabled": "true"
  }
}
# Install LangChain
%pip install openai langchain pdf2image pdfminer.six pytesseract unstructured

# Import required libraries
import os, openai, langchain, uuid
from langchain.llms import AzureOpenAI, OpenAI
from langchain.agents import load_tools, initialize_agent, AgentType
from langchain.chains import TransformChain, LLMChain, SimpleSequentialChain
from langchain.document_loaders import OnlinePDFLoader
from langchain.tools.bing_search.tool import BingSearchRun, BingSearchAPIWrapper
from langchain.prompts import PromptTemplate
import pyspark.sql.functions as f
from synapse.ml.cognitive.langchain import LangchainTransformer
from synapse.ml.core.platform import running_on_synapse, find_secret

Step 2: Provide the keys for Azure OpenAI to authenticate the applications

To authenticate Azure OpenAI applications, you need to provide the respective API keys. In particular, you need to set the deployment_nameopenai_api_base, and open_api_key variables to match those for your OpenAI service. You also need to set up your Bing search to gain access to your Bing Search subscription key. Here is an example of how you can provide the keys in Python code. 

# Provide keys for Azure OpenAI servies
openai_api_key = find_secret("openai-api-key")
openai_api_base = "https://synapseml-openai.openai.azure.com/"
openai_api_version = "2022-12-01"
openai_api_type = "azure"
deployment_name = "text-davinci-003"
bing_search_url = "https://api.bing.microsoft.com/v7.0/search"
bing_subscription_key = find_secret("bing-search-key")

os.environ["BING_SUBSCRIPTION_KEY"] = bing_subscription_key
os.environ["BING_SEARCH_URL"] = bing_search_url
os.environ["OPENAI_API_TYPE"] = openai_api_type
os.environ["OPENAI_API_VERSION"] = openai_api_version
os.environ["OPENAI_API_BASE"] = openai_api_base
os.environ["OPENAI_API_KEY"] = openai_api_key

Provide:

your Bing Subscription key for “BING_SUBSCRIPTION_KEY”

your Bing Search URL for “BING_SEARCH_URL”

your OpenAI API type for “OPENAI_API_TYPE”

your OpenAI API Version for “OPENAI_API_VERSION”

your OpenAI Base for “OPENAI_API_BASE”

your your OpenAI API Key for “OPENAI_API_KEY”

We then initialize an instance of the Azure OpenAI class to create a language model using the keys provided above.

# Initialize Azure OpenAI class
llm = AzureOpenAI(
    deployment_name=deployment_name,
    model_name=deployment_name,
    temperature=0.1,
    verbose=True,
)


Step 3: Basic Usage of LangChain Transformer

We will begin by illustrating the fundamental usage of a simple chain that defines a data processing pipeline to use the LangChain library and OpenAI API to generate definitions for input words.

This is achieved by creating a prompt template, setting up an LLMChain, and configuring a LangChain transformer to execute the processing chain on the data while interfacing with the OpenAI API . Note that chains allow to combine multiple components together to create a single, coherent application.


# Create prompt template
copy_prompt = PromptTemplate(
    input_variables=["technology"],
    template="Define the following word: {technology}",
)

# Set up an LLMChain
chain = LLMChain(llm=llm, prompt=copy_prompt)

# Configure LangChain transformer
transformer = (
    LangchainTransformer()
    .setInputCol("technology")
    .setOutputCol("definition")
    .setChain(chain)
    .setSubscriptionKey(openai_api_key)
    .setUrl(openai_api_base)
)

We create a DataFrame with labeled technology names and use the previously defined pipeline to transform the DataFrame and generate word definitions using the LangChain processing pipeline. The DataFrame contains three rows, each with two columns: label and technology. The rows correspond to the following data:

  • Row 1: label=0, technology=”docker”
  • Row 2: label=1, technology=”spark”
  • Row 3: label=2, technology=”python”
# Construct a test DataFrame
df = spark.createDataFrame(
    [(0, "docker"), (1, "spark"), (2, "python")], ["label", "technology"]
)
display(transformer.transform(df))

We first save the LangChain transformer using the specified temporary directory and then load the saved LangChain transformer and apply it to the DataFrame to generate word definitions. Note that LangChain serialization only works for chains that don’t have memory.

temp_dir = "tmp"
if not os.path.exists(temp_dir):
    os.mkdir(temp_dir)
path = os.path.join(temp_dir, "langchainTransformer")
transformer.save(path)

# Load the LangChain transformer
loaded = LangchainTransformer.load(path)
display(loaded.transform(df))

Step 4: Using LangChain for Large scale literature review

We will now create a sequential chain for extracting structured information from an arXiv link as the source of document. Specifically, we will employ LangChain to extract the paper’s title, author information, and a summary of its content. Subsequently, we will set up and utilize a web search tool to locate the recent papers authored by the primary author.

To outline, our sequential chain comprises of the following steps:

  1. Transform Chain: Extract Paper Content from an arXiv Link
  2. LLMChain: Summarize the Paper, extract paper title and authors
  3. Transform Chain: Generate the prompt
  4. Agent with Web Search Tool: Use Web Search to find recent papers by the first author

Specifically, as shown below we define functions that involves extracting content from PDFs linked in arXiv papers and generating prompts for extracting specific information from paper descriptions. Note that as part of the prompt, we use web search to find the three most recent paper authored by the paper’s first author.

# Extract content from PDF
def paper_content_extraction(inputs: dict) -> dict:
    arxiv_link = inputs["arxiv_link"]
    loader = OnlinePDFLoader(arxiv_link)
    pages = loader.load_and_split()
    return {"paper_content": pages[0].page_content + pages[1].page_content}

# Generate prompt
def prompt_generation(inputs: dict) -> dict:
    output = inputs["Output"]
    prompt = (
        "find the paper title, author, summary in the paper description below, output them. After that, Use websearch to find out 3 recent papers of the first author in the author section below (first author is the first name separated by comma) and list the paper titles in bullet points: <Paper Description Start>\n"
        + output
        + "<Paper Description End>."
    )
    return {"prompt": prompt}

These processing chains are designed to work together in a sequence. The first chain extracts content from the document using the provided arXiv link, the second chain summarizes the extracted content, and the third chain generates a prompt using both the summarized content and the function that was defined above for “prompt_generation”. This series of transformations can be used to automate the process of extracting key information from documents including academic papers.

# Chain to extract content
paper_content_extraction_chain = TransformChain(
    input_variables=["arxiv_link"],
    output_variables=["paper_content"],
    transform=paper_content_extraction,
    verbose=False,
)

paper_summarizer_template = """You are a paper summarizer, given the paper content, it is your job to summarize the paper into a short summary, and extract authors and paper title from the paper content.
Here is the paper content:
{paper_content}
Output:
paper title, authors and summary.
"""
prompt = PromptTemplate(
    input_variables=["paper_content"], template=paper_summarizer_template
)
summarize_chain = LLMChain(llm=llm, prompt=prompt, verbose=False)

prompt_generation_chain = TransformChain(
    input_variables=["Output"],
    output_variables=["prompt"],
    transform=prompt_generation,
    verbose=False,
)

We now can extend the data processing pipeline by incorporating the web search functionality using the the BingSearchAPIWrapper and a language model agent.

The sequential_chain represents the entire data processing pipeline, where data flows through the chains in a sequential manner. It starts with extracting content from the document through the arXiv link, summarizing the content, generating a prompt, performing a web search using the Bing API, and reacting to the search results using a language model agent.

bing = BingSearchAPIWrapper(k=3)
tools = [BingSearchRun(api_wrapper=bing)]
web_search_agent = initialize_agent(
    tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=False
)

# Create a sequential chain
sequential_chain = SimpleSequentialChain(
    chains=[
        paper_content_extraction_chain,
        summarize_chain,
        prompt_generation_chain,
        web_search_agent,
    ]
)

Step 5: Apply the LangChain transformer to perform the workload at scale

We can now use the defined pipeline to process a DataFrame containing links of different documents, e.g., arXiv links of academic papers. We create a DataFrame named paper_df containing four rows, each with two columns: label and arXiv_link where each row corresponds to a different arXiv paper link. We then use the data processing pipeline LangchainTransformer defined in step 4 to extract information such as document title, authors, summary, and recent documents by the first author.

paper_df = spark.createDataFrame(
    [
        (0, "https://arxiv.org/pdf/2107.13586.pdf"),
        (1, "https://arxiv.org/pdf/2101.00190.pdf"),
        (2, "https://arxiv.org/pdf/2103.10385.pdf"),
        (3, "https://arxiv.org/pdf/2110.07602.pdf"),
    ],
    ["label", "arxiv_link"],
)

# Construct LangChain transformer using the paper summarizer chain defined above
paper_info_extractor = (
    LangchainTransformer()
    .setInputCol("arxiv_link")
    .setOutputCol("paper_info")
    .setChain(sequential_chain)
    .setSubscriptionKey(openai_api_key)
    .setUrl(openai_api_base)
)

# Extract paper information from arXiv links, the paper information needs to include:
# paper title, paper authors, brief paper summary, and recent papers published by the first author
display(paper_info_extractor.transform(paper_df))

And now we have built a document summarization framework using Microsoft Fabric and SynapseML.

Note that while in this blog, we used arXiv links as our guide, you can follow the systematic techniques outlined in the above steps to seamlessly replicate our approach to configure and curate various types of documents of your choice, e.g., PubMed, where you could distill extensive medical studies into concise, informative summaries, opening the door to a realm of endless possibilities.

Get started with Microsoft Fabric 

Microsoft Fabric is currently in preview. Try out everything Fabric has to offer by signing up for a free trial—no credit card information required! 

If you want to learn more about Microsoft Fabric, consider: 

 Learning Resources 

To help you get started with Microsoft Fabric, there are several resources we recommend: 

  • Microsoft Fabric Learning Paths: experience a high-level tour of Microsoft Fabric and how to get started 
  • Microsoft Fabric Tutorials: get detailed tutorials with a step-by-step guide on how to create an end-to-end solution in Microsoft Fabric. These tutorials focus on a few different common patterns including a Lakehouse architecture, data warehouse architecture, real-time analytics, and data science projects. 
  • Microsoft Fabric Documentation: read Fabric docs to see detailed documentation for all aspects of Microsoft Fabric. 

Related blog posts

Harness the Power of LangChain in Microsoft Fabric for Advanced Document Summarization

September 25, 2024 by Santhosh Kumar Ravindran

We’re excited to introduce high concurrency mode for notebooks in pipelines, bringing session sharing to one of the most popular orchestration mechanisms for enterprise data ingestion and transformation. Notebooks will now automatically be packed into an active high concurrency session without compromising performance or security, while paying for a single session. Key Benefits: Why Use … Continue reading “Introducing High Concurrency Mode for Notebooks in Pipelines for Fabric Spark”

September 25, 2024 by Jenny Jiang

Fabric Apache Spark Diagnostic Emitter for Logs and Metrics is now in public preview. This new feature allows Apache Spark users to collect Spark logs, job events, and metrics from their Spark applications and send them to various destinations, including Azure Event Hubs, Azure Storage, and Azure Log Analytics. It provides robust support for monitoring … Continue reading “Announcing the Fabric Apache Spark Diagnostic Emitter: Collect Logs and Metrics”