Scarf analytics pixel

Jul 15, 2025

How to go from S3 to MongoDB with no code using Unstructured

Unstructured

LLM

Modern AI systems—whether you're building Retrieval-Augmented Generation (RAG), semantic search, document intelligence, or agentic workflows—depend on transforming unstructured content into structured, vectorized knowledge. But connecting your cloud storage to a vector database like MongoDB Atlas typically means writing custom code, standing up infrastructure, and managing orchestration.

This tutorial shows you how to skip all of that.

With Unstructured's Workflow builder, you can ingest PDFs, Word docs, HTML, and more from Amazon S3, apply powerful parsing and metadata enrichment, generate embeddings, and send the results directly into MongoDB, all from your browser. Whether you're prototyping a chatbot or deploying a production-grade search system, this is the fastest way to turn unstructured data into AI-ready vectors.

You'll walk through building a full document ingestion pipeline from S3 to MongoDB, entirely in the UI. No orchestration code required.

What You'll Learn

In this tutorial, we'll show you how to:

  • Pull documents from an S3 bucket

  • Partition text from PDFs, DOCX, HTML, and other document types

  • Apply enrichments (image captions, table summaries, etc.)

  • Generate embeddings

  • Push data into MongoDB for RAG or search applications

Step 1: Connect Your S3 Bucket in Unstructured

🔑 Retrieve AWS Security Credentials

  1. Navigate to the top bar in AWS and click your account ID in the top right

  2. Scroll down to Security Credentials

  3. Scroll to the Access keys section and click Create access key

  4. You'll receive an Access Key ID and Secret Access Key

  5. Click Download .csv file to keep a local copy of the keys for reference

🪣 Create a New S3 Bucket

This bucket will contain your input PDFs.

  1. In the AWS Console, go to Amazon S3 → Buckets, then click Create bucket

  2. Use a name like nicks-demo-s3-bucket

  3. Keep Block all public access checked

  4. Leave all other settings as default

  5. Click Create bucket

📄 Upload Your Files

  1. Locate your new bucket in the list and click its name

  2. Click the Upload button

  3. Click Add files and select the PDF documents you want to upload (PDF, DOCX, HTML, JPEG, etc.)

  4. Copy the full Destination URI (e.g., s3://nicks-demo-s3-bucket)

  5. Scroll down and click Upload

🔒 Set S3 Bucket Permissions

  1. Navigate to your bucket

  2. Select the Permissions tab at the top

Note: If you're using access keys tied to an account with full S3 read/write permissions, you can leave the bucket policy blank.

Step 2: Create a New S3 Connector in Unstructured

  1. Go to platform.unstructured.io or your organization's tenant address

  2. In the left sidebar, click Connectors

  3. Click + New, ensure Source is selected, and choose Amazon S3

  4. Set a name like nicks-test-s3-connector

  5. Fill in the Bucket URI, AWS Key, and AWS Secret Key

  6. Check Recursive if you want to ingest nested folders

  7. Leave Custom URL blank

  8. Click Save and Test

  9. Upon success, you'll see a confirmation message

Step 3: Set Up MongoDB Atlas

🧾 Create an Account

Visit MongoDB Atlas and sign up.

📁 Create a New Project

  1. Go to Projects → New Project

  2. Name your project (Example: nicks-test-mongodb)

  3. On Add Members and Set Permissions, leave the defaults and click Create Project

🔐 Whitelist Unstructured's IP Addresses

We need to allow Unstructured's connectors to communicate with your MongoDB cluster.

  1. Go to the Project Home Page

  2. In the left-hand sidebar, navigate to Security → Network Access

  3. Click Add IP Address under the IP Access List section

  4. To fetch Unstructured's IPs, run the following command in your terminal:

    curl -s https://assets.p6m.u10d.net/publicitems/ip-prefixes.json | jq -r '.prefixes[].ip_prefix'
  5. Alternatively, manually add each of these IPs one at a time in CIDR notation:

    • 104.42.153.20/30

    • 104.45.176.240/30

    • 20.23.19.236/30

    • 20.88.104.236/30

  6. Paste each into the IP input box. Leave other fields default, then click Confirm

    After saving, you should see all IPs listed under the access list

🗂 Deploy a New Cluster

  1. In the Project Home Page, go to Database → Clusters

  2. Click Build a Cluster

  3. Select the M10 tier

    Important: Connectors to M0 and Flex databases are currently not supported due to the lesser encryption standard used for these assets by MongoDB.

  4. Choose your preferred cloud provider (AWS, Google Cloud, Azure). For this demo:

    • Cloud Provider: Azure

    • Tier: M10

  5. (Optional) Set Cluster Name to Cluster0

  6. Ensure Quicksetup → Automate security setup is checked and Preload sample dataset is unchecked

  7. Click Create Deployment

👤 Create a Database User

  1. A window will prompt you to create credentials

  2. Click Create Database User

  3. Record your username and password — you'll use them to configure the Unstructured connector



    You should see confirmation that a database was added

🔗 Get the MongoDB URI

  1. Under Choose a connection method, select Drivers

  2. In Connecting with MongoDB Driver, scroll to section 3: Add your connection string into your application code

  3. Copy the URI — you'll need it for Unstructured setup

    Example:

  4. Click Done to finish

👤 (Optional) Manually Add a New Database User

If you skipped the earlier prompt to create a user during the database setup:

  1. Go to Database Access → Add New Database User

  2. Fill in the fields:

    • Username: testuser

    • Password: testpassword

    • Role: Read and Write to any database

📚 Add Your Own Data

  1. Navigate to Database → Clusters

  2. Select your cluster

  3. At the top, click Collections → Add My Own Data

  4. Set:

    • Database name: your choice

    • Collection name: your choice

    • Preferences: Clustered Index Collection

  5. Click Create

🎉 You should now see your cluster and collection listed.

Step 4: Create a MongoDB Destination Connector in Unstructured

  1. Log in to Unstructured

  2. For a full walkthrough of how to set up a MongoDB destination connector, you can follow the official guide Unstructured Docs - MondgoDB Destination or follow the instructions below:

  3. Go to platform.unstructured.io or the tenant address of your organization

  4. Select “Connectors” on the lefthand panel

  5. Click the “+ New” button on the side bar.

  6. Set a name for the connector

  7. Make sure “Destination” is highlighted and choose the “MongoDB” destination connector <image_2.png>

  8. Click “Continue”

  9. Enter the database and collection names that you created for your cluster in MongoDB Atlas. 

  10. Add the connection string for your cluster that you created during initialization.

  11. Click Test Connection — you should get a success message

Step 5: Create a Workflow in Unstructured

  1. From the main dashboard, click Workflows → New Workflow

  2. Select Build it for me

  3. Name your workflow, choose the previously created source and destination connectors, then click Continue

  4. Use the automatic partitioning strategy, default embedding model and size

  5. Leave other settings default, then click Complete

Optional: Adjust the Embedder

  1. Go to the Embedder segment of your workflow

  2. Click the gear icon in the top right

  3. Choose your embedding model

Step 6: Run & Test the Workflow

▶️ Full Run

  1. Go to the Workflows page

  2. Click Run next to your workflow

  3. Use the Schedule tab to automate runs

📄 Upload a Sample Document

  1. In your workflow, go to the Source segment

  2. Upload a single document

  3. Click the Results </> icon above the segment to inspect JSON output at every stage

Step 7: Get More from Your Workflow

🪄 Partitioning Strategy

The default auto strategy detects structure (titles, tables, images) and selectively applies VLM parsing. Read more info here.

🖼️ Image Description Enrichment

Generates human-readable captions for diagrams, photos, and visual elements.

When useful:

  • Instruction manuals with schematics

  • Research reports with charts

  • Scanned docs with key visual content

📊 Table Summary Enrichment

Converts tables to natural language summaries (e.g., 'North America leads in Q1 sales').

Ideal for:

  • Financial reports

  • Policy documents

  • Scanned PDFs with tables

🛠️ Additional Options

  • Table-to-HTML Enrichment

  • Named Entity Recognition (NER)

  • Chunking: by title, character, page, or similarity

  • Contextual Chunking: prepend summaries to chunks

Conclusion

And that's it!

You now have a fully automated pipeline from S3 documents to enriched, vectorized content in MongoDB and built entirely in Unstructured's UI. Whether launching a RAG system or indexing internal files, this is a fast, reliable starting point.

This no-code approach eliminates the complexity of building custom data pipelines while providing enterprise-grade capabilities for document processing and vector generation.

Keep Reading

Keep Reading

Recent Stories

Recent Stories

Jul 15, 2025

How to go from S3 to MongoDB with no code using Unstructured

Unstructured

LLM

Jul 15, 2025

How to go from S3 to MongoDB with no code using Unstructured

Unstructured

LLM

Jul 15, 2025

How to go from S3 to MongoDB with no code using Unstructured

Unstructured

LLM

Jul 8, 2025

Improving Retrieval in RAG with Reranking

Unstructured

LLM

Jul 8, 2025

Improving Retrieval in RAG with Reranking

Unstructured

LLM

Jul 8, 2025

Improving Retrieval in RAG with Reranking

Unstructured

LLM

Jul 2, 2025

How to go from S3 to Qdrant with no code using Unstructured

Unstructured

LLM

Jul 2, 2025

How to go from S3 to Qdrant with no code using Unstructured

Unstructured

LLM

Jul 2, 2025

How to go from S3 to Qdrant with no code using Unstructured

Unstructured

LLM