Docling Loader
Overviewโ
Docling parses PDF, DOCX, PPTX, HTML, and other formats into a rich unified representation including document layout, tables etc., making them ready for generative AI workflows like RAG.
Docling Loader, presented in this notebook, seamlessly integrates Docling into LangChain, enabling you to:
- use various document types in your LLM applications with ease and speed, and
- leverage Docling's rich representation for advanced, document-native grounding.
In the sections below, we showcase Docling Loader's usage, covering document loading specifics but also demonstrating an end-to-end RAG pipeline.
This notebook provides a quick overview for getting started with Docling document loader. For detailed documentation of all Docling Loader features and configurations head to the API reference.
Integration detailsโ
Class | Package | Local | Serializable | JS support |
---|---|---|---|---|
DoclingLoader | langchain_community | โ | โ | โ |
Loader featuresโ
Source | Document Lazy Loading | Native Async Support |
---|---|---|
DoclingLoader | โ | โ |
Setupโ
Installationโ
To use the Docling document loader you will need to have docling
installed besides langchain-community
:
%pip install -qU docling langchain-community
Initializationโ
Now we can instantiate our loader and load documents.
By default, DoclingLoader
loads each input document as a LangChain Document
with Markdown content (more options can be found in the "Deep Dive" section further below).
from langchain_community.document_loaders import DoclingLoader
FILE_PATH = "https://arxiv.org/pdf/2408.09869"
loader = DoclingLoader(file_path=FILE_PATH)
Loadโ
docs = loader.load()
print(f"{docs[0].page_content[:200]=}")
docs[0].page_content[:200]='## Docling Technical Report\n\nVersion 1.0\n\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla '
print(docs[0].metadata)
{'source': 'https://arxiv.org/pdf/2408.09869'}
Lazy Loadโ
Documents can also be loaded in a lazy fashion:
doc_iter = loader.lazy_load()
for doc in doc_iter:
pass # you can operate on `doc` here
Deep Diveโ
Initializationโ
The general syntax of DoclingLoader
initialization is as follows (also see API reference):
loader = DoclingLoader(
file_path=FILE_PATH,
### OPTIONAL PARAMS: ###
converter=..., # any specific Docling converter to use
convert_kwargs=..., # any specific kwargs for conversion execution
export_type=..., # export mode: Markdown (default) or doc-chunks
md_export_kwargs=..., # any specific Markdown export kwargs (for Markdown mode)
chunker=..., # any specific Docling chunker to use (for doc-chunk mode)
)
DoclingLoader
can be instantiated in two different modes:
- Markdown mode: for each input doc, outputs a LangChain Document with the Markdown representation of the input doc. This is the default mode, implicitly used in the steps above.
- Doc-chunks mode: for each input doc, outputs the doc chunks as per the chunker (by default using Docling layout-aware chunking) as LangChain Documents.
In the subsections below we explore both modes in more detail.
Document preparation using Markdown modeโ
Following up on the steps further above, given that the docs
have been loaded, any built-in (or custom) LangChain splitter can be used to split them. For example, below we show a possible splitting with a MarkdownHeaderTextSplitter
:
%pip install -qU langchain-text-splitters
from langchain_text_splitters import MarkdownHeaderTextSplitter
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[("#", "Header_1"), ("##", "Header_2"), ("###", "Header_3")],
)
md_splits = [split for doc in docs for split in splitter.split_text(doc.page_content)]
for d in md_splits[:2]:
print(f"{d.metadata=}, {d.page_content=}")
d.metadata={'Header_2': 'Docling Technical Report'}, d.page_content='Version 1.0 \nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar \nAI4K Group, IBM Research Ruschlikon, Switzerland'
d.metadata={'Header_2': 'Abstract'}, d.page_content='This technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.'
Document preparation using doc chunks modeโ
The doc-chunks mode directly returns the document chunks including rich metadata such as the page number and the bounding box info.
loader = DoclingLoader(
file_path=FILE_PATH,
export_type=DoclingLoader.ExportType.DOC_CHUNKS,
)
doc_splits = loader.load()
for d in doc_splits[:2]:
print(f"{d.metadata=}, {d.page_content=}")
d.metadata={'source': 'https://arxiv.org/pdf/2408.09869', 'dl_meta': {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/0', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'page_header', 'prov': [{'page_no': 1, 'bbox': {'l': 17.088111877441406, 't': 583.2296752929688, 'r': 36.339778900146484, 'b': 231.99996948242188, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 38]}]}], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}}, d.page_content='arXiv:2408.09869v3 [cs.CL] 30 Aug 2024'
d.metadata={'source': 'https://arxiv.org/pdf/2408.09869', 'dl_meta': {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/2', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 282.772216796875, 't': 512.7218017578125, 'r': 328.8624572753906, 'b': 503.340087890625, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 11]}]}], 'headings': ['Docling Technical Report'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}}, d.page_content='Version 1.0'
RAG exampleโ
In this section we put together a demo RAG pipeline and run it using the documents loaded above.
%pip install -qU langchain langchain-huggingface langchain-milvus
import json
import os
from pathlib import Path
from tempfile import mkdtemp
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import PromptTemplate
from langchain_huggingface import HuggingFaceEndpoint
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_milvus import Milvus
# https://github.com/huggingface/transformers/issues/5486:
os.environ["TOKENIZERS_PARALLELISM"] = "false"
QUESTION = "Which are the main AI models in Docling?"
PROMPT = PromptTemplate.from_template(
"Context information is below.\n---------------------\n{context}\n---------------------\nGiven the context information and not prior knowledge, answer the query.\nQuery: {input}\nAnswer:\n",
)
HF_EMBED_MODEL_ID = "BAAI/bge-small-en-v1.5"
HF_LLM_MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1"
embedding = HuggingFaceEmbeddings(model_name=HF_EMBED_MODEL_ID)
llm = HuggingFaceEndpoint(repo_id=HF_LLM_MODEL_ID)
def run_rag(documents, embedding, llm, question, prompt):
def clip_text(text, threshold=100):
return f"{text[:threshold]}[...]" if len(text) > threshold else text
milvus_uri = str(Path(mkdtemp()) / "docling.db") # or set as needed
vectorstore = Milvus.from_documents(
documents,
embedding,
connection_args={"uri": milvus_uri},
drop_old=True,
)
retriever = vectorstore.as_retriever()
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)
resp_dict = rag_chain.invoke({"input": question})
answer = clip_text(resp_dict["answer"], threshold=200)
print(f"Question:\n{resp_dict['input']}\n\nAnswer:\n{json.dumps(answer)}")
for i, doc in enumerate(resp_dict["context"]):
print()
print(f"Source {i+1}:")
print(f" text: {json.dumps(clip_text(doc.page_content, threshold=200))}")
for key in doc.metadata:
if key != "pk":
val = doc.metadata.get(key)
clipped_val = clip_text(val) if isinstance(val, str) else val
print(f" {key}: {clipped_val}")
RAG using Markdown modeโ
Below we run the RAG pipeline passing it the output of the Markdown mode (after splitting):
run_rag(
documents=md_splits,
embedding=embedding,
llm=llm,
question=QUESTION,
prompt=PROMPT,
)
Question:
Which are the main AI models in Docling?
Answer:
"The main AI models in Docling are DocLayNet and TableFormer. DocLayNet is a layout analysis model that is an accurate object-detector for page elements, and TableFormer is a state-of-the-art table str[...]"
Source 1:
text: "As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis m[...]"
Header_2: 3.2 AI models
Source 2:
text: "This technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layo[...]"
Header_2: Abstract
Source 3:
text: "Thanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can provide a base for detailed e[...]"
Header_2: 5 Applications
Source 4:
text: "Docling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecogni[...]"
Header_2: 6 Future work and contributions
RAG using doc-chunk modeโ
Below we run the RAG pipeline passing it the output of the doc-chunk mode.
Notice how the sources now also contain document-level grounding (e.g. page number or bounding box information):
run_rag(
documents=doc_splits,
embedding=embedding,
llm=llm,
question=QUESTION,
prompt=PROMPT,
)
Question:
Which are the main AI models in Docling?
Answer:
"The main AI models in Docling are a layout analysis model, an accurate object-detector for page elements, and TableFormer, a state-of-the-art table structure recognition model. These models are develo[...]"
Source 1:
text: "As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis m[...]"
dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/34', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 3, 'bbox': {'l': 107.07593536376953, 't': 406.1695251464844, 'r': 504.1148681640625, 'b': 330.2677307128906, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 608]}]}], 'headings': ['3.2 AI models'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}
source: https://arxiv.org/pdf/2408.09869
Source 2:
text: "With Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition[...]"
dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/9', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 107.0031967163086, 't': 136.7283935546875, 'r': 504.04998779296875, 'b': 83.30133056640625, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 488]}]}], 'headings': ['1 Introduction'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}
source: https://arxiv.org/pdf/2408.09869
Source 3:
text: "Docling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecogni[...]"
dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/60', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 106.92281341552734, 't': 323.5386657714844, 'r': 504.00347900390625, 'b': 258.76641845703125, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 543]}]}], 'headings': ['6 Future work and contributions'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}
source: https://arxiv.org/pdf/2408.09869
Source 4:
text: "This technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layo[...]"
dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/6', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 142.92593383789062, 't': 364.814697265625, 'r': 468.3847351074219, 'b': 300.651123046875, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 431]}]}], 'headings': ['Abstract'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}
source: https://arxiv.org/pdf/2408.09869
API referenceโ
For detailed documentation of all DoclingLoader
features and configurations head to the API reference.
Relatedโ
- Document loader conceptual guide
- Document loader how-to guides