Docling Loader

Overview

Docling parses PDF, DOCX, PPTX, HTML, and other formats into a rich unified representation including document layout, tables etc., making them ready for generative AI workflows like RAG.

Docling Loader, presented in this notebook, seamlessly integrates Docling into LangChain, enabling you to:

use various document types in your LLM applications with ease and speed, and

leverage Docling's rich representation for advanced, document-native grounding.

In the sections below, we showcase Docling Loader's usage, covering document loading specifics but also demonstrating an end-to-end RAG pipeline.

This notebook provides a quick overview for getting started with Docling document loader. For detailed documentation of all Docling Loader features and configurations head to the API reference.

Integration details

Class	Package	Local	Serializable	JS support
DoclingLoader	langchain_community	✅	❌	❌

Loader features

Source	Document Lazy Loading	Native Async Support
DoclingLoader	✅	❌

Setup

Installation

To use the Docling document loader you will need to have docling installed besides langchain-community:

%pip install -qU docling langchain-community

Initialization

Now we can instantiate our loader and load documents.

By default, DoclingLoader loads each input document as a LangChain Document with Markdown content (more options can be found in the "Deep Dive" section further below).

from langchain_community.document_loaders import DoclingLoader

FILE_PATH = "https://arxiv.org/pdf/2408.09869"

loader = DoclingLoader(file_path=FILE_PATH)

API Reference:DoclingLoader

Load

docs = loader.load()
print(f"{docs[0].page_content[:200]=}")

docs[0].page_content[:200]='## Docling Technical Report\n\nVersion 1.0\n\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla '

print(docs[0].metadata)

{'source': 'https://arxiv.org/pdf/2408.09869'}

Lazy Load

Documents can also be loaded in a lazy fashion:

doc_iter = loader.lazy_load()
for doc in doc_iter:
    pass  # you can operate on `doc` here

Deep Dive

Initialization

The general syntax of DoclingLoader initialization is as follows (also see API reference):

loader = DoclingLoader(
    file_path=FILE_PATH,
    ### OPTIONAL PARAMS: ###
    converter=...,  # any specific Docling converter to use
    convert_kwargs=...,  # any specific kwargs for conversion execution
    export_type=...,  # export mode: Markdown (default) or doc-chunks
    md_export_kwargs=...,  # any specific Markdown export kwargs (for Markdown mode)
    chunker=...,  # any specific Docling chunker to use (for doc-chunk mode)
)

DoclingLoader can be instantiated in two different modes:

Markdown mode: for each input doc, outputs a LangChain Document with the Markdown representation of the input doc. This is the default mode, implicitly used in the steps above.
Doc-chunks mode: for each input doc, outputs the doc chunks as per the chunker (by default using Docling layout-aware chunking) as LangChain Documents.

In the subsections below we explore both modes in more detail.

Document preparation using Markdown mode

Following up on the steps further above, given that the docs have been loaded, any built-in (or custom) LangChain splitter can be used to split them. For example, below we show a possible splitting with a MarkdownHeaderTextSplitter:

%pip install -qU langchain-text-splitters

from langchain_text_splitters import MarkdownHeaderTextSplitter

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[("#", "Header_1"), ("##", "Header_2"), ("###", "Header_3")],
)
md_splits = [split for doc in docs for split in splitter.split_text(doc.page_content)]

for d in md_splits[:2]:
    print(f"{d.metadata=}, {d.page_content=}")

API Reference:MarkdownHeaderTextSplitter

d.metadata={'Header_2': 'Docling Technical Report'}, d.page_content='Version 1.0  \nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar  \nAI4K Group, IBM Research Ruschlikon, Switzerland'
d.metadata={'Header_2': 'Abstract'}, d.page_content='This technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.'

Document preparation using doc chunks mode

The doc-chunks mode directly returns the document chunks including rich metadata such as the page number and the bounding box info.

loader = DoclingLoader(
    file_path=FILE_PATH,
    export_type=DoclingLoader.ExportType.DOC_CHUNKS,
)
doc_splits = loader.load()

for d in doc_splits[:2]:
    print(f"{d.metadata=}, {d.page_content=}")

d.metadata={'source': 'https://arxiv.org/pdf/2408.09869', 'dl_meta': {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/0', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'page_header', 'prov': [{'page_no': 1, 'bbox': {'l': 17.088111877441406, 't': 583.2296752929688, 'r': 36.339778900146484, 'b': 231.99996948242188, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 38]}]}], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}}, d.page_content='arXiv:2408.09869v3 [cs.CL] 30 Aug 2024'
d.metadata={'source': 'https://arxiv.org/pdf/2408.09869', 'dl_meta': {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/2', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 282.772216796875, 't': 512.7218017578125, 'r': 328.8624572753906, 'b': 503.340087890625, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 11]}]}], 'headings': ['Docling Technical Report'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}}, d.page_content='Version 1.0'

RAG example

In this section we put together a demo RAG pipeline and run it using the documents loaded above.

%pip install -qU langchain langchain-huggingface langchain-milvus

import json
import os
from pathlib import Path
from tempfile import mkdtemp

from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import PromptTemplate
from langchain_huggingface import HuggingFaceEndpoint
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_milvus import Milvus

# https://github.com/huggingface/transformers/issues/5486:
os.environ["TOKENIZERS_PARALLELISM"] = "false"

QUESTION = "Which are the main AI models in Docling?"
PROMPT = PromptTemplate.from_template(
    "Context information is below.\n---------------------\n{context}\n---------------------\nGiven the context information and not prior knowledge, answer the query.\nQuery: {input}\nAnswer:\n",
)
HF_EMBED_MODEL_ID = "BAAI/bge-small-en-v1.5"
HF_LLM_MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1"

embedding = HuggingFaceEmbeddings(model_name=HF_EMBED_MODEL_ID)
llm = HuggingFaceEndpoint(repo_id=HF_LLM_MODEL_ID)


def run_rag(documents, embedding, llm, question, prompt):
    def clip_text(text, threshold=100):
        return f"{text[:threshold]}[...]" if len(text) > threshold else text

    milvus_uri = str(Path(mkdtemp()) / "docling.db")  # or set as needed
    vectorstore = Milvus.from_documents(
        documents,
        embedding,
        connection_args={"uri": milvus_uri},
        drop_old=True,
    )
    retriever = vectorstore.as_retriever()
    question_answer_chain = create_stuff_documents_chain(llm, prompt)
    rag_chain = create_retrieval_chain(retriever, question_answer_chain)
    resp_dict = rag_chain.invoke({"input": question})

    answer = clip_text(resp_dict["answer"], threshold=200)
    print(f"Question:\n{resp_dict['input']}\n\nAnswer:\n{json.dumps(answer)}")
    for i, doc in enumerate(resp_dict["context"]):
        print()
        print(f"Source {i+1}:")
        print(f"  text: {json.dumps(clip_text(doc.page_content, threshold=200))}")
        for key in doc.metadata:
            if key != "pk":
                val = doc.metadata.get(key)
                clipped_val = clip_text(val) if isinstance(val, str) else val
                print(f"  {key}: {clipped_val}")

API Reference:create_retrieval_chain | create_stuff_documents_chain | PromptTemplate | HuggingFaceEndpoint | HuggingFaceEmbeddings

RAG using Markdown mode

Below we run the RAG pipeline passing it the output of the Markdown mode (after splitting):

run_rag(
    documents=md_splits,
    embedding=embedding,
    llm=llm,
    question=QUESTION,
    prompt=PROMPT,
)

Question:
Which are the main AI models in Docling?

Answer:
"The main AI models in Docling are DocLayNet and TableFormer. DocLayNet is a layout analysis model that is an accurate object-detector for page elements, and TableFormer is a state-of-the-art table str[...]"

Source 1:
  text: "As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis m[...]"
  Header_2: 3.2 AI models

Source 2:
  text: "This technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layo[...]"
  Header_2: Abstract

Source 3:
  text: "Thanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can provide a base for detailed e[...]"
  Header_2: 5 Applications

Source 4:
  text: "Docling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecogni[...]"
  Header_2: 6 Future work and contributions

RAG using doc-chunk mode

Below we run the RAG pipeline passing it the output of the doc-chunk mode.

Notice how the sources now also contain document-level grounding (e.g. page number or bounding box information):

run_rag(
    documents=doc_splits,
    embedding=embedding,
    llm=llm,
    question=QUESTION,
    prompt=PROMPT,
)

Question:
Which are the main AI models in Docling?

Answer:
"The main AI models in Docling are a layout analysis model, an accurate object-detector for page elements, and TableFormer, a state-of-the-art table structure recognition model. These models are develo[...]"

Source 1:
  text: "As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis m[...]"
  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/34', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 3, 'bbox': {'l': 107.07593536376953, 't': 406.1695251464844, 'r': 504.1148681640625, 'b': 330.2677307128906, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 608]}]}], 'headings': ['3.2 AI models'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}
  source: https://arxiv.org/pdf/2408.09869

Source 2:
  text: "With Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition[...]"
  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/9', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 107.0031967163086, 't': 136.7283935546875, 'r': 504.04998779296875, 'b': 83.30133056640625, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 488]}]}], 'headings': ['1 Introduction'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}
  source: https://arxiv.org/pdf/2408.09869

Source 3:
  text: "Docling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecogni[...]"
  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/60', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 106.92281341552734, 't': 323.5386657714844, 'r': 504.00347900390625, 'b': 258.76641845703125, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 543]}]}], 'headings': ['6 Future work and contributions'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}
  source: https://arxiv.org/pdf/2408.09869

Source 4:
  text: "This technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layo[...]"
  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/6', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 142.92593383789062, 't': 364.814697265625, 'r': 468.3847351074219, 'b': 300.651123046875, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 431]}]}], 'headings': ['Abstract'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}
  source: https://arxiv.org/pdf/2408.09869

API reference

For detailed documentation of all DoclingLoader features and configurations head to the API reference.

Document loader conceptual guide
Document loader how-to guides

Overview​

Integration details​

Loader features​

Setup​

Installation​

Initialization​

Load​

Lazy Load​

Deep Dive​

Initialization​

Document preparation using Markdown mode​

Document preparation using doc chunks mode​

RAG example​

RAG using Markdown mode​

RAG using doc-chunk mode​

API reference​

Related​

Was this page helpful?

Overview

Integration details

Loader features

Setup

Installation

Initialization

Load

Lazy Load

Deep Dive

Initialization

Document preparation using Markdown mode

Document preparation using doc chunks mode

RAG example

RAG using Markdown mode

RAG using doc-chunk mode

API reference

Related