Gema Ramirez-Sanchez’s Post

CEO, Prompsit Language Engineering

1mo Edited

Release of the massive HPLT v3.0 multilingual dataset! 🚀 October is back and so are HPLT datasets (we've been doing this for three consecutive years now!). This time is my honour, on behalf of the HPLT team, to announce the release of the massive HPLT v3.0 multilingual dataset which can be considered a major upgrade for large-scale multilingual corpora. Accounting for 29 billion documents, 198 language-script combinations and 112 trillion characters, v3.0 shows significant gains over v2, driven by several improvements, including a new global deduplication process: ✅ Unique content boosted from 52% to 73% on average. ✅ Data substance and robustness remains high with better extraction and improved language identification. ✅ Shows increased variety and better representativity of natural web content. This release provides a cleaner, more robust dataset for building powerful LLMs and machine translation systems, including a myriad of low- to medium-resourced languages. And we have not said our last word: wait for more data soon because we are already working on it. Special thanks to all the collaborators and funding bodies, including the European Union's Horizon Europe programme and UK Research and Innovation. 🔗 Explore and download the data: https://lnkd.in/dv5mqVP3 🔎 [NEW]See the analysis and evaluation highlights on our website post: https://lnkd.in/duGAeMTu #HPLT #NLProc #AI #Datasets #MachineTranslation #MultilingualNLP #LanguageTechnology #OpenData #Data4LLMs

1 Comment

Andrey Kutuzov

Associate professor in NLP - University of Oslo

1mo

Great release for great #NLProc :)

To view or add a comment, sign in

More Relevant Posts

Abhyudaya Avasthi

Senior Quantitative Analyst
1mo
Report this post
Wikipedia’s knowledge graph is becoming AI-ready. The Wikidata Embedding Project, a collaboration between Wikimedia Deutschland, Jina.AI, and DataStax, is turning over 120M+ Wikidata entities into semantic vector embeddings. Why it matters: Until now, tapping into Wikidata’s structured knowledge required complex SPARQL queries or keyword search. With embeddings, large language models can query by meaning, not just text. How it works: Each Wikidata entry (labels, descriptions, aliases, claims) is converted into high-dimensional vectors using Jina’s multilingual model Vectors are hosted in DataStax Astra DB, enabling hybrid search (keyword + semantic) across English, French, Arabic, with more languages planned Developers can test it freely via Toolforge Deeper integration: Supports the Model Context Protocol (MCP), letting LLMs query Wikidata as a live, semantic knowledge source Enables retrieval-augmented generation (RAG) and GraphRAG: multi-hop reasoning, fact verification, explainability, and hallucination reduction Why it’s a breakthrough: Open-source and continuously updated by the global Wikidata community A transparent alternative to closed Big Tech knowledge APIs Strengthens AI grounding, amplifies underrepresented facts, and makes knowledge retrieval more context-aware and trustworthy This is a step toward a more reliable, explainable, and democratic AI ecosystem. #AI #OpenSource #Wikipedia
1 Comment
Like Comment
To view or add a comment, sign in
Ilyas DAHAOUI

Data scientist & AI Research | AI & Digital Transformation Consultant | Driving Enterprise Value with Data Science | Ex-Renault Group
3w Edited
Report this post
Recent advances in language models have demonstrated remarkable capabilities in text generation and contextual understanding, yet these models remain fundamentally static, once trained their weights no longer change. This limitation prevents models from adapting to new knowledge, specific contexts, or emerging tasks, reducing their ability to generalize and stay up-to-date. In the paper “Self-Adapting Language Models (SEAL)", the authors propose a framework that enables models to self-adapt autonomously, without requiring costly external fine-tuning or additional human annotations. The method relies on generating self-edits, which are revised or enriched versions of the model’s own outputs, serving as synthetic data for internal supervised fine-tuning. For each input, the model first produces an initial response and then generates a self-edit that can correct errors, rephrase the output, enrich content, or adjust training parameters. These self-edits are evaluated using task-specific metrics, and only those deemed effective are incorporated into the model’s weights, enabling persistent updates. This approach establishes a loop of continuous self-improvement, allowing the model to generate progressively more accurate and contextually appropriate responses while retaining prior knowledge. #SelfLearningAI #FewShotLearning #SyntheticData #ReinforcementLearning
Like Comment
To view or add a comment, sign in
Blitz Consulting & Coaching

431 followers
2w
Report this post
Retrieval augmented generation (RAG) enhances large language models (LLMs) by providing them with relevant external context. For example, when using a RAG system for a question-answer (QA) task, the LLM receives a context that may be a combination of...

Deeper insights into retrieval augmented generation: The role of sufficient context research.google
Like Comment
To view or add a comment, sign in
Dhanya LK

Assistant Professor at Mar Baselios College of Engineering and Technology (Autonomous)
3d
Report this post
Our chapter titled “Comparative Performance of Machine Learning Algorithms in Detecting Offensive Speech in Malayalam–English Code-Mixed Data” has been published in the Springer book Advances in Distributed Computing and Machine Learning (Lecture Notes in Networks and Systems, Vol. 427). This research explores the effectiveness of various machine learning models in identifying offensive and hate content in code-mixed Malayalam–English social media data — a growing area of concern in multilingual online communication. The study contributes to advancing AI-based content moderation and fostering responsible digital discourse in regional languages. 🔗 Read more: https://lnkd.in/g6PKC74p #Springer #MachineLearning #OffensiveSpeechDetection #CodeMixedData #NaturalLanguageProcessing #AI #ResearchPublication #AcademicResearch #MultilingualNLP #DigitalSafety

Comparative Performance of Machine Learning Algorithms in Detecting Offensive Speech in Malayalam-English Code-Mixed Data link.springer.com

2 Comments
Like Comment
To view or add a comment, sign in
Ryan Coles, PhD

Leading Global Progress through Scientific Exploration, Business Acumen, & Friendship
3w
Report this post
Just read a sharp new paper: “LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings.” The authors show that large language models can replicate human purchase intent nearly as reliably as real survey panels by generating free-text responses and mapping them to Likert scales via semantic similarity. This is more than a methodological trick. It points to a future where synthetic consumers become part of the innovation process—testing ideas, narratives, and designs before a single ad dollar is spent. As someone building ventures and studying how AI reorganizes firms, I see this as a glimpse of what’s coming: organizational architectures where insight itself becomes machine-generated. #AI #Innovation #ConsumerResearch #Strategy #Entrepreneurship https://lnkd.in/eschhFw8

LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings arxiv.org

74 Comments
Like Comment
To view or add a comment, sign in
GyaanSetu AI

256 followers
6d
Report this post
🧠 Building an Enterprise-Grade Grammar API with AI/ML Integration How I Built a Production-Ready Grammar Checking API with Multi-Language Support, LLM Integration, and Advanced NLP TL;DR: I built an open-source, production-ready Grammar API with multi-agent AI systems, LLM integration, and enterprise features like PII detection, intelligent caching, and multi-language support. This article walks through the architecture, design decisions, and how you can use or contribute to it. GitHub: https://lnkd.in/gGt2yqHk As developers, we've all used grammar checking tools. But when you need to integrate grammar checking into your application at scale, you quickly run into limitations: Language Barriers: Most APIs support only English or a handful of languages No AI Integration: Traditional rule-based checkers miss context and nuance Privacy Concerns: Sending sensitive data to third-party APIs isn't always an option Cost at Scale: Per-request pricing models become expensive fast Limited Customization: One-size-fits-all solutions don't fit ente https://lnkd.in/gmQmsJEh
Like Comment
To view or add a comment, sign in
Bayes Labs

2,657 followers
3d
Report this post
Research Paper Highlights: "Fast Thinking for Large Language Models" by Haoyu Zheng et al. Reasoning-oriented Large Language Models (LLMs) rely on step-by-step token generation, often enhanced by Chain-of-Thought (CoT) reasoning for complex tasks. However, this process is inefficient due to long reasoning traces that increase latency and token usage. This research introduces Latent Codebooks for Fast Thinking, a framework that enables concise and efficient reasoning without explicit token generation. Challenges: - High latency and cost from generating long CoT reasoning traces. - Dependence on large-scale fine-tuning or RL, which limits scalability. - Overthinking and inefficiency in cases where simpler reasoning suffices. Key Takeaways: - Introduces a codebook of discrete strategy priors, learned from concise CoT sketches during training. - Enables fast inference using compact continuous thinking vectors instead of explicit reasoning tokens. - Proposes GainRouter, a routing mechanism that dynamically switches between fast and explicit reasoning modes. - Achieves competitive or superior accuracy across reasoning benchmarks while reducing inference cost and token usage. This work underscores the potential of efficient and controllable reasoning in LLMs by combining strategic latent guidance with adaptive routing. Latent Codebooks for Fast Thinking marks a step toward faster, cost-effective reasoning models without compromising performance. Further reading: https://lnkd.in/gz3jd5rX
Like Comment
To view or add a comment, sign in
Information MDPI

An Open Access journal by MDPI
3w
Report this post
🧐 The paper addresses the challenge of reliably evaluating the performance of large language models (LLMs) in tasks where outputs vary across executions due to their non-deterministic nature. 🧠 Traditional evaluation methods typically involve running the model multiple times, averaging the results, and providing confidence intervals. However, the authors argue that these confidence intervals may not be trustworthy when the number of runs is limited, which is often the case due to computational costs. 👆 To overcome this limitation, they propose a novel methodology that captures intra-run variability by analysing predictions at the instance level across multiple executions. This allows for the computation of more robust and reliable confidence intervals when a gold standard is available. 🌟 A key advantage of their approach is efficiency: it requires fewer full model runs to provide accurate estimates of performance variability, reducing both time and resource consumption. In their experiments, the proposed method achieved complete empirical coverage (100%) of plausible performance outcomes with as few as three runs, whereas conventional methods only achieved up to 63% coverage even with eight runs. ✉️ The study thus contributes an effective and computationally efficient framework for measuring the reliability of LLM performance, addressing a critical gap in current evaluation practices. Read #NewPaper "On Measuring Large Language Models Performance with Inferential Statistics" from Jesus Mª Fraile Hernández, and Anselmo Peñas. See more details at: https://lnkd.in/dgizEFRx #performance_evaluation #confidence_intervals #LLMs
Like Comment
To view or add a comment, sign in
kreaviz.com

4 followers
3w
Report this post
LFM2–8B–A1B — a new generation of Edge AI for mobile The new Liquid AI model runs on just 1.5 billion active parameters, even though its total count reaches 8.3 billion. Think of it as a hybrid engine — most of the system stays asleep until needed. The result? - Lower memory usage, less power consumption, and performance comparable to dense 3–4B models. A hybrid architecture LFM2 blends convolutional blocks with grouped-query attention (GQA) — 18 convolution units and 6 attention blocks. This hybrid setup allows it to handle short-range language patterns while understanding context up to 32k tokens. In short: ➡️ Speed meets reasoning — lightweight, yet capable. 💻 Built for devices, not the cloud It’s not meant to replace GPT-4. It’s designed to thrive where most LLMs choke — on laptops, tablets, and phones. Quantized versions already run comfortably on high-end consumer devices. Perfect for: -AI agents -RAG pipelines -Creative writing -Conversational systems 📊 Benchmark results MMLU: 64.84 (≈ Llama-3.2-3B) GSM8K: 84.38 HumanEval+: 69.5% It won’t dethrone Qwen3-4B-Instruct, but it’s faster and lighter — ideal for local and offline AI systems. 🌍 Data & languages Training mix: 🟩 75% English 🟨 20% multilingual 🟦 5% code Supports 8 major languages: EN, AR, ZH, FR, DE, JP, KR, ES. License: LFM Open License v1.0 — open and developer-friendly. Your turn Do you see the future in hybrid Edge AI models — small, fast, and context-aware? Or do you still believe “more parameters = better intelligence”? #AI #EdgeAI #LLM #ArtificialIntelligence #MachineLearning #DeepLearning #Technology #Innovation #Qwen #Gemma #LiquidAI #RAG #AIagents #opensource 🔗 Further Reading & Documentation LFM2-8B-A1B on Hugging Face — Model card, weights, and usage instructions 👉 https://lnkd.in/eKGCJEDk Liquid AI Blog: “LFM2-8B-A1B – An Efficient On-device Mixture-of-Experts” — Official announcement and benchmarks 👉 https://lnkd.in/ecyCmKHM Transformers Docs (Hugging Face) — Developer documentation for integrating LFM2 👉 https://lnkd.in/d3MjaBXR Liquid AI Models Overview — All LFM and LFM2 versions explained 👉 https://lnkd.in/d7fVxc2P GGUF Weights (LFM2-8B-A1B) — Quantized versions for local / llama.cpp setups 👉 https://lnkd.in/dmAFJHWe Research Paper (arXiv) — Liquid: Language Models are Scalable Multi-modal Generators 👉 https://lnkd.in/d6njmwBd

Liquid: Language Models are Scalable and Unified Multi-modal Generators arxiv.org
Like Comment
To view or add a comment, sign in
Anis Aknouche

Data Scientist @ Zeenea part of Actian | Industrial PhD on Smart Metadata Management Systems @ LIP6
4w Edited
Report this post
The second part of the LLM and KG Patterns series is out. In this article, together with Ole Olesen-Bagneux, we explore the concept of the LLM-augmented KG—examining how large language models can enhance, extend, and interact with structured knowledge graph representations. Stay tuned for the upcoming article “Synergized LLM and KG“ #LLM #KG #AI https://lnkd.in/ePae_g5G

LLM-augmented KG: Large Language Model (LLM) And Knowledge Graph (KG) Patterns (Part 2/3) dataintelligenceplatform.substack.com

1 Comment
Like Comment
To view or add a comment, sign in