Docling: Extracting Structured Data from PDFs and Documents

Part 1 on integrating a chatbot with your personal document collection

Feb 03, 2026

∙ Paid

Many of my newsletter subscribers ask how I keep up with new research in time series and data science.

Every week, new papers, blog posts, and tutorials come out. Keeping track of them, organizing them, and deciding what is worth reading feels almost impossible. It’s a constant stream of information—99% noise and 1% real value that is easy to miss.

LLMs and chatbots can help with this problem. They are good at reading and summarizing information. At the same time, there is a lot of hype, and it’s not always clear what actually works or how to use these tools well.

That is why I decided to publish my own document intelligence app that I use to store, catalog, and archive useful information. We will go through it as a capstone project, made of 3 parts:

Docling: A library for extracting text, images, and tables from documents into structured and searchable data.
Langchain + Chroma: Making documents searchable with Vector Stores and Retrieval
Interactive Application: Wrap everything into a simple Streamlit app.

Don’t worry if some of these terms are unfamiliar. We will go through everything step by step. Let’s get started!

✅ Find the notebook for this article here: Language Models (Project 2)

Preliminaries

Modern document systems turn messy files into useful answers - but it’s not straightforward.

The idea is this: documents are split into chunks, embedded into vectors, and stored. When you ask a question, the system retrieves the most relevant chunks and passes them to an LLM that produces the final answer. You may have heard of this as RAG, which stands for Retrieval-Augmented Generation.

With RAG, you avoid hallucinations by getting answers based on your local knowledge base rather than on what an LLM has learned. We’ll describe this in more detail in Part 3, but this is an overview of what it looks like:

**Figure 1:** *Basic Retrieval Augmented Generation (RAG) analysis workflow (Image by author)*

Building basic document retrieval is relatively easy and has been so since GPT-4 came out in 2023. The challenging part is extracting text correctly, since real-world documents contain handwritten notes, tables, images, captions, and broken layouts that are hard to parse and extract correctly.

This is where Docling comes in, as it is built to handle complex documents and produce clean, structured data.

What is Docling? Why Another Document Parser?

Docling is currently the most complete and structure-aware document parsing library available.

AI Horizon Forecast

Docling: Extracting Structured Data from PDFs and Documents

Part 1 on integrating a chatbot with your personal document collection

Preliminaries

What is Docling? Why Another Document Parser?

This post is for paid subscribers