Daft’s cover photo
Daft

Daft

Technology, Information and Internet

San Francisco, CA 2,264 followers

Distributed query engine providing simple and reliable data processing for any modality and scale.

About us

Daft is a high-performance data engine providing simple and reliable data processing for any modality and scale, from local to petabyte-scale distributed workloads. The core engine is written in Rust and exposes both SQL and Python DataFrame interfaces as first-class citizens. Solving the fundamental challenge of working with multimodal data at scale and powering the next generation of AI applications, we are eliminating the traditional barriers and redefining how developers interact with multimodal data. Try Daft today: pip install daft

Industry
Technology, Information and Internet
Company size
11-50 employees
Headquarters
San Francisco, CA

Locations

Employees at Daft

Updates

  • View organization page for Daft

    2,264 followers

    When my fleet of flapjack flipping humanoids finishes uploading it's data to CAIOS, I need a data engine that can help me align all of my robotics data. Think 10 TB of video, 500 GB of sensor telemetry, and a whole lot of flapjacks. My video and sensor data come in separate payloads, so the two streams don't always align. My robot's eyeballs record video at 30 fps while super-duper-flapjack-flipping motor actuators run at 100 Hz. How do I query my video and sensor data so my video frames align against my flapjack actuator signals in order? 🦾 🥞 You guessed it 👉️ an ASOF join — matching each frame to the most recent sensor reading at or before that timestamp. Every major DataFrame library that supports ASOF joins is single-node only. Spark technically has `merge_asof` but rewrites it to a correlated subquery under the hood. If you're running TB-scale pipelines, none of these work. Daft now has native distributed ASOF joins. Here's what it took to get there: - V1 (sort + two pointers): 133s on 100M rows — 3.75x slower than pandas; string comparisons dominated every sort comparison - V2 (hash grouping): 36s — hash by entity key first, then sort only the integer timestamp column within each group - V3 (streaming probe): 21s, 4 GB peak — right table streams in parallel batches; parallelism is now batch count, not entity cardinality - V4 (range partitioned + distributed): 283s → 31s scaling from 2 to 8 nodes, skew-resistant via carryovers Daft is the first distributed DataFrame library with a native ASOF join built-in. Link in the comments.

    • No alternative text description for this image
  • View organization page for Daft

    2,264 followers

    daft.VideoFile is perfect for Physical AI. Open X-Embodiment aggregates over a million episodes. DROID alone runs 350+ hours of multi-camera 60fps footage. That's hundreds of millions of frames across a single dataset, and most action-model training doesn't need them all. daft.VideoFile decodes only the slice you describe. - read_video_frames — filter on keyframes; supports S3, GCS, & YouTube URLs. - video_metadata — resolution, fps, duration, frame count from file headers. - video_frames(start_time, end_time) — decode a 10-second window from a 90-minute file. Frames land as Image columns in the same DataFrame. Feed them to a vision model, compute embeddings, and write to Iceberg. Link in the comments.

    • No alternative text description for this image
  • View organization page for Daft

    2,264 followers

    So turns out I'm not the only one who builds on @daftengine 😆 In fact, theres a TON of projects that leverage daft natively to power their AI & data processing. Daft is the Data Engine for AI. > I say it because its true. > I keep saying it because the Daft community keeps giving back! Check out all these projects! (link in the comments)

    • No alternative text description for this image
  • View organization page for Daft

    2,264 followers

    Probably my favorite episode yet! Just finished filming our latest episode of Zero Shot Espresso with Daniel Imberman who is an ApacheAirflow PMC, developed the K8s Executor, and now helps technical teams ship production AI as a consultant. We chatted about how open-source software is changing in the AI-era, what it's like running a solo-consulting business, and the biggest difference between senior and principal engineers. Episode goes live in 2 weeks! In the meantime, check out our previous episodes: https://lnkd.in/guXZe9Ms

    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
  • View organization page for Daft

    2,264 followers

    30 contributors. 41 new features. That's a new record! 🏆️ 💪 Daft v0.7.10 ships across distributed joins, duplicate detection, and observability: • Distributed asof joins — temporal accuracy without shuffle penalties • SimHash duplicate detection — near-duplicate matching at document scale • 8 new temporal functions — date math that works correctly • C++ extension support — custom aggregations and transformations • Enhanced Paimon integration — improved metadata and read performance • Dashboard improvements — query heartbeats, task lifecycle events, runtime stats A record number of contributors shipped this together. Link in the comments.

    • No alternative text description for this image
  • View organization page for Daft

    2,264 followers

    🚀 Daft Community Extensions have landed

    The fastest H3 geospatial indexing in Daft wasn't written by the Daft team. Developed by long time contributor Garret Weaver, daft-h3 runs 3–16x faster than simply wrapping h3-py in a UDF. That speed up is thanks to Daft's new Native ABI Extensions powered by Apache Arrow's C Data Interface. Daft is the data engine for AI, and AI workloads are inherently domain-specific. Daft extensions give contributors the freedom to build domain-specific functionality while still benefiting from Daft’s execution model. Up until now, the most common way the community would extend daft is with Python UDFs (daft.func / daft.cls). For lower-level vectorized performance, Native ABI extensions can add high-performance scalar functions, aggregate functions, Python expression wrappers, and extension-backed datatypes. The Daft ecosystem is growing rapidly and we want to support that growth with transparent open source stewardship. daft-h3 is joined by daft-html, daft-geo, and daft-lance as the first community packages in the wild. Link in the comments.

    • No alternative text description for this image
  • View organization page for Daft

    2,264 followers

    Most image embedding pipelines are actually two pipelines stitched together. Script one: PySpark reads images from S3, resizes them, joins with metadata, writes to Delta Lake. Script two: PyTorch loads ResNet, generates embeddings on GPU, writes back to Delta Lake. This works. But it means two frameworks with different dependency management, two execution environments with different hardware requirements, and serialization overhead every time data crosses the boundary between them. The alternative is collapsing both into one pipeline: 👉 One script — Daft handles S3 reads, image resizing, joins, and embedding generation in a single pipeline 👉 @daft.cls for GPU inference — load ResNet once, batch 64 images at a time, Daft manages GPU placement 👉 Native image type — col("image").resize(224, 224) instead of PIL inside a UDF 👉 Same write target — write_deltalake() at the end, no intermediate staging The blog walks through both approaches side by side with full code. The PySpark + PyTorch version is ~80 lines across two scripts. The Daft version is ~50 lines in one. Fewer systems, fewer failure points, shorter path from raw images to usable embeddings. Link in the comments.

    • No alternative text description for this image
  • View organization page for Daft

    2,264 followers

    Talent-maxxing!

    Eventual was ranked #47 globally on Paraform’s Talent Density Index. What I liked most about this wasn’t the ranking itself, but how they define it: not by who looks impressive on paper, but by who’s actually developing people the market is fighting for. A friend put it better than I could: “Honestly, it’s a testament to the talent you’re recruiting and fostering.” Feels right. Grateful to be building alongside this team. Link in comments.

    • No alternative text description for this image
  • View organization page for Daft

    2,264 followers

    daft.File is a lazily evaluated file reference. When you call daft.from_files(), nothing downloads — you get lightweight references that defer all I/O until a UDF explicitly opens them. 👇️ The pattern: filter by metadata first (path, size, MIME type), then open only the files that survive your predicates. Cheap operations narrow the set. Expensive operations run on the survivors. Same .open() and .to_tempfile() interface works for PDFs, Python source, audio, and video. This is Week 2 of the daft.File series. Blog Link in the comments.

    • No alternative text description for this image
  • View organization page for Daft

    2,264 followers

    Ok yeah, embeddings and RAG is so 2024, but did you know embeddings can work across modalities? Here's a quick lesson 👇️ Multimodal embeddings project different data types into a shared vector space so a text query can find the right image, and a chest X-ray can match a doctor's note, without modality-specific pipelines. The challenge is that most frameworks treat this as a multi-tool problem. One system for preprocessing, another for embedding generation, another for retrieval. Every boundary is a serialization cost and an ops burden. Our new tutorial walks through the full pipeline with CLIP and Daft: 👉 CLIP contrastive learning — dual encoders trained to align image-text pairs in a shared space 👉 @daft.cls for stateful inference — load the model once, batch 16 images at a time, run on GPU if available 👉 cosine_distance for retrieval — cross-join query embeddings against the image index, sort, return top-K 👉 One script, one framework — no handoff between preprocessing and ML stages The full example is ~60 lines. Runnable in a Colab notebook. 🔗 in the comments.

    • No alternative text description for this image

Affiliated pages

Similar pages

Browse jobs