SlideShare a Scribd company logo
Simplifying Document Processing with Docling
for AI Applications
By - Tamanna
NextGen_Outlier 1
What is Docling?
Open-source Python library by IBM for document processing
Parses diverse formats:
Documents: PDF, DOCX, PPTX, XLSX, HTML
Images: PNG, JPEG, TIFF
Audio: WAV, MP3
Designed for generative AI workflows (e.g., RAG, chatbots)
Key benefits:
Advanced parsing (layouts, tables, formulas)
Local execution for secure data processing
Seamless AI framework integrations
NextGen_Outlier 2
Installing Docling
Requirements: Python 3.11+, macOS/Linux/Windows
Steps:
i. Create virtual environment: python3.11 -m venv myenv
ii. Activate: source myenv/bin/activate (Windows: myenvScriptsactivate )
iii. Install: pip install docling
iv. Verify: docling --version
Optional: GPU support with TensorFlow/PyTorch
NextGen_Outlier 3
Key Features of Docling
Multi-Format Parsing: PDF, DOCX, images, audio
Advanced PDF Understanding: Layouts, tables, formulas
Unified Format: DoclingDocument for consistency
Export Options: Markdown, HTML, JSON, DocTags
Security: Local execution for air-gapped environments
OCR: Supports scanned PDFs and images
AI Integrations: LangChain, LlamaIndex, Crew AI, Haystack
NextGen_Outlier 4
Example - Converting PDF to Markdown
Code:
from docling.document_converter import DocumentConverter
source = "sample.pdf"
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())
Converts PDF text, tables, and images to Markdown
Ideal for feeding into AI pipelines
NextGen_Outlier 5
Docling Workflow
Diagram (describe in PowerPoint as a flowchart):
Input Document → Docling Parse → DoclingDocument → Export (Markdown, JSON) → AI
Frameworks → Vector Store → LLM Query → Output (Answers)
Note: In PowerPoint, use shapes and arrows to create a horizontal flowchart with these steps.
NextGen_Outlier 6
Integration with LangChain
Preprocess documents for Retrieval-Augmented Generation (RAG)
Workflow: Convert → Load → Index → Query
Code:
from docling.document_converter import DocumentConverter
from langchain.vectorstores import FAISS
converter = DocumentConverter()
result = converter.convert("report.pdf")
with open("output.md", "w") as f:
f.write(result.document.export_to_markdown())
# Load into LangChain and query
NextGen_Outlier 7
Integration with LlamaIndex
Use DoclingReader for document loading
Build vector index for querying
Code:
from llama_index.readers.docling import DoclingReader
reader = DoclingReader()
documents = reader.load_data("report.pdf")
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What are the key findings?")
NextGen_Outlier 8
Practical Use Cases
Enterprise RAG: Searchable database from PDFs
Research Assistant: Extract tables/formulas from papers
Audio Transcription: Convert meeting recordings to text
Secure Processing: Handle sensitive data locally
NextGen_Outlier 9
Docling vs. Other Tools
Feature Docling LangChain LlamaIndex
Document Parsing Advanced Basic Moderate
OCR Support Extensive Limited Limited
AI Integrations Multiple Extensive Focused
Local Execution Yes Yes Yes
NextGen_Outlier 10
Conclusion
Docling simplifies document processing for AI
Key strengths: Multi-format parsing, OCR, secure execution
Integrates with LangChain, LlamaIndex, Crew AI, Haystack
Get started: pip install docling
NextGen_Outlier 11
Thank you!!
NextGen_Outlier 12

More Related Content

Recently uploaded (20)

PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
What Is Data Integration and Transformation?
subhashenia
 
BinarySearchTree in datastructures in detail
kichokuttu
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 

Featured (20)

PDF
2024 Trend Updates: What Really Works In SEO & Content Marketing
Search Engine Journal
 
PDF
Storytelling For The Web: Integrate Storytelling in your Design Process
Chiara Aliotta
 
PDF
Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...
OECD Directorate for Financial and Enterprise Affairs
 
PDF
How to Leverage AI to Boost Employee Wellness - Lydia Di Francesco - SocialHR...
SocialHRCamp
 
PDF
2024 State of Marketing Report – by Hubspot
Marius Sescu
 
PDF
Everything You Need To Know About ChatGPT
Expeed Software
 
PDF
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
 
PDF
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
PDF
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
 
PDF
Skeleton Culture Code
Skeleton Technologies
 
PDF
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
 
PDF
Content Methodology: A Best Practices Report (Webinar)
contently
 
PPTX
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
PDF
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
PDF
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
PDF
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
PDF
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
PDF
Getting into the tech field. what next
Tessa Mero
 
PDF
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
PDF
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
2024 Trend Updates: What Really Works In SEO & Content Marketing
Search Engine Journal
 
Storytelling For The Web: Integrate Storytelling in your Design Process
Chiara Aliotta
 
Artificial Intelligence, Data and Competition – SCHREPEL – June 2024 OECD dis...
OECD Directorate for Financial and Enterprise Affairs
 
How to Leverage AI to Boost Employee Wellness - Lydia Di Francesco - SocialHR...
SocialHRCamp
 
2024 State of Marketing Report – by Hubspot
Marius Sescu
 
Everything You Need To Know About ChatGPT
Expeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
 
Skeleton Culture Code
Skeleton Technologies
 
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
contently
 
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
Getting into the tech field. what next
Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
Ad

Simplifying Document Processing with Docling for AI Applications.pdf

  • 1. Simplifying Document Processing with Docling for AI Applications By - Tamanna NextGen_Outlier 1
  • 2. What is Docling? Open-source Python library by IBM for document processing Parses diverse formats: Documents: PDF, DOCX, PPTX, XLSX, HTML Images: PNG, JPEG, TIFF Audio: WAV, MP3 Designed for generative AI workflows (e.g., RAG, chatbots) Key benefits: Advanced parsing (layouts, tables, formulas) Local execution for secure data processing Seamless AI framework integrations NextGen_Outlier 2
  • 3. Installing Docling Requirements: Python 3.11+, macOS/Linux/Windows Steps: i. Create virtual environment: python3.11 -m venv myenv ii. Activate: source myenv/bin/activate (Windows: myenvScriptsactivate ) iii. Install: pip install docling iv. Verify: docling --version Optional: GPU support with TensorFlow/PyTorch NextGen_Outlier 3
  • 4. Key Features of Docling Multi-Format Parsing: PDF, DOCX, images, audio Advanced PDF Understanding: Layouts, tables, formulas Unified Format: DoclingDocument for consistency Export Options: Markdown, HTML, JSON, DocTags Security: Local execution for air-gapped environments OCR: Supports scanned PDFs and images AI Integrations: LangChain, LlamaIndex, Crew AI, Haystack NextGen_Outlier 4
  • 5. Example - Converting PDF to Markdown Code: from docling.document_converter import DocumentConverter source = "sample.pdf" converter = DocumentConverter() result = converter.convert(source) print(result.document.export_to_markdown()) Converts PDF text, tables, and images to Markdown Ideal for feeding into AI pipelines NextGen_Outlier 5
  • 6. Docling Workflow Diagram (describe in PowerPoint as a flowchart): Input Document → Docling Parse → DoclingDocument → Export (Markdown, JSON) → AI Frameworks → Vector Store → LLM Query → Output (Answers) Note: In PowerPoint, use shapes and arrows to create a horizontal flowchart with these steps. NextGen_Outlier 6
  • 7. Integration with LangChain Preprocess documents for Retrieval-Augmented Generation (RAG) Workflow: Convert → Load → Index → Query Code: from docling.document_converter import DocumentConverter from langchain.vectorstores import FAISS converter = DocumentConverter() result = converter.convert("report.pdf") with open("output.md", "w") as f: f.write(result.document.export_to_markdown()) # Load into LangChain and query NextGen_Outlier 7
  • 8. Integration with LlamaIndex Use DoclingReader for document loading Build vector index for querying Code: from llama_index.readers.docling import DoclingReader reader = DoclingReader() documents = reader.load_data("report.pdf") index = VectorStoreIndex.from_documents(documents) query_engine = index.as_query_engine() response = query_engine.query("What are the key findings?") NextGen_Outlier 8
  • 9. Practical Use Cases Enterprise RAG: Searchable database from PDFs Research Assistant: Extract tables/formulas from papers Audio Transcription: Convert meeting recordings to text Secure Processing: Handle sensitive data locally NextGen_Outlier 9
  • 10. Docling vs. Other Tools Feature Docling LangChain LlamaIndex Document Parsing Advanced Basic Moderate OCR Support Extensive Limited Limited AI Integrations Multiple Extensive Focused Local Execution Yes Yes Yes NextGen_Outlier 10
  • 11. Conclusion Docling simplifies document processing for AI Key strengths: Multi-format parsing, OCR, secure execution Integrates with LangChain, LlamaIndex, Crew AI, Haystack Get started: pip install docling NextGen_Outlier 11