Gema Ramirez-Sanchez’s Post

View profile for Gema Ramirez-Sanchez

CEO, Prompsit Language Engineering

Release of the massive HPLT v3.0 multilingual dataset! 🚀 October is back and so are HPLT datasets (we've been doing this for three consecutive years now!). This time is my honour, on behalf of the HPLT team, to announce the release of the massive HPLT v3.0 multilingual dataset which can be considered a major upgrade for large-scale multilingual corpora. Accounting for 29 billion documents, 198 language-script combinations and 112 trillion characters, v3.0 shows significant gains over v2, driven by several improvements, including a new global deduplication process: ✅ Unique content boosted from 52% to 73% on average. ✅ Data substance and robustness remains high with better extraction and improved language identification. ✅ Shows increased variety and better representativity of natural web content. This release provides a cleaner, more robust dataset for building powerful LLMs and machine translation systems, including a myriad of low- to medium-resourced languages. And we have not said our last word: wait for more data soon because we are already working on it. Special thanks to all the collaborators and funding bodies, including the European Union's Horizon Europe programme and UK Research and Innovation. 🔗 Explore and download the data: https://lnkd.in/dv5mqVP3 🔎 [NEW]See the analysis and evaluation highlights on our website post: https://lnkd.in/duGAeMTu #HPLT #NLProc #AI #Datasets #MachineTranslation #MultilingualNLP #LanguageTechnology #OpenData #Data4LLMs

Andrey Kutuzov

Associate professor in NLP - University of Oslo

1mo

Great release for great #NLProc :)

Like
Reply

To view or add a comment, sign in

Explore content categories