Skip to main content

Posts

Showing posts with the label Patents

New SureChEMBL announcement

(Generated with DALL-E 3 ∙ 30 October 2023 at 1:48 pm) We have some very exciting news to report: the new SureChEMBL is now available! Hooray! What is SureChEMBL, you may ask. Good question! In our portfolio of chemical biology services, alongside our established database of bioactivity data for drug-like molecules ChEMBL , our dictionary of annotated small molecule entities ChEBI , and our compound cross-referencing system UniChem , we also deliver a database of annotated patents! Almost 10 years ago , EMBL-EBI acquired the SureChem system of chemically annotated patents and made this freely accessible in the public domain as SureChEMBL. Since then, our team has continued to maintain and deliver SureChEMBL. However, this has become increasingly challenging due to the complexities of the underlying codebase. We were awarded a Wellcome Trust grant in 2021 to completely overhaul SureChEMBL, with a new UI, backend infrastructure, and new f...

SureChEMBL: A New Hope

US-D254080-S SureChEMBL has disrupted the field of patent chemistry by liberating chemical structures and knowledge locked in text and images, and by making the compound-patent associations freely  and fully searchable and accessible on a daily basis to everyone: academics, IP professionals, content providers, software vendors, biotechs, small and big pharma, and related chemical industries . The speed, scale and scope of the data is unprecedented for a public resource.  SureChEMBL has been around for less than two years ; during this time, it has evolved into a full-blown chemistry resource provided by the EMBL-EBI: the SureChEMBL interface was revamped and released last year , including combined keyword and structure-based queries against the annotated patent corpus. All chemistry is integrated with UniChem and there are several ways to access the data in bulk, including flat files and a data client. Very soon, the data will be fully integrated and avai...

Advanced keyword and structure searches with SureChEMBL

Previously in the SureChEMBL series, we described how to access SureChEMBL data in bulk , offline and locally. So, you may ask, what is the point in using the SureChEMBL web interface ? Well, how about the unprecedented functionality that allows you to submit very granular queries by combining: i) Lucene fields against full-text and bibliographic metadata and ii) advanced structure query features against the annotated compound corpus - at the same time? Let’s see each one separately first: Lucene-powered keyword searching You may use the main text box for simple keyword-based patent searches, such as ‘Apple’, ‘diabetes’ or even ' chocolate cake ' (the patent corpus as a recipe book is a new use-case here). You will get a lot of results and probably a lot of noise. With Lucene fields, you can slice and dice a query by indicating specific patent sections and bibliographic metadata, such as date/year of filing or publication, assignee, patent classification code,...

Paper: Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents

Our collaborators in GSK have just published an Open Access paper in the Journal of Cheminformatics . It is a comparative study of the quality of chemistry extraction from patent documents and includes patent chemistry sources derived by automated text-mining, such as SureChEMBL and the IBM/NIH data set . Among other things, the paper provides a useful detailed overview of SureChEMBL's chemistry annotation specifications. While conducting this study, we realised that this task is far from trivial for several reasons:  The patent corpus is inherently noisy, ambiguous and error-rich. There are diverse use cases and accuracy expectations when it comes to chemistry extracted from patents. Not all the chemistry found in a patent document is of equal importance. Compound standardisation variants such as stereoisomers, tautomers, salts and mixtures is always an issue. There is a distinct lack of an open Gold Standard when it comes to standardised chemistry extracted fro...

Accessing SureChEMBL data in bulk

It is the peak of the summer (at least in this hemisphere) and many of our readers/users will be on holiday, perhaps on an island enjoying the sea. Luckily, for the rest of us there is still the 'sea' of SureChEMBL data that awaits to be enjoyed and explored for hidden 'treasures' (let me know if I pushed this analogy too far). See here and  here for a reminder of SureChEMBL is and what it does.  This wealth of (big) data can be accessed via the SureChEMBL interface , where users can submit quite sophisticated and granular queries by combining: i) Lucene fields against full-text and bibliographic metadata and ii) advanced structure query features against the annotated compound corpus. Examples of such queries will be the topic of a future post. Once the search results are back, users can browse through and export the chemistry from the patent(s) of interest. In addition to this functionality, we've been receiving user requests for  local (behind the ...

The SureChEMBL map file is out

As many of you know, SureChEMBL taps into the wealth of knowledge hidden in the patent documents. More specifically, SureChEMBL extracts and indexes chemistry from the full-text patent corpus (EPO, WIPO and USPTO; JPO titles and abstracts only) by means of automated text- and image-mining, on a daily basis. We have recently hosted a webinar about it which turned out to be very popular - for those who missed it, the video and slides are here . Besides the interface, SureChEMBL compound data can be accessed in various ways, such as UniChem and PubChem . The full compound dump is also available as a flat file download from our ftp server . Since the release of the SureChEMBL interface last September, we have received numerous requests for a way to access compound and patent data in a batch way. Typical use-cases would include retrieving all compounds for a list of patent IDs, or vice versa , retrieving all patents where one or more compounds have been extracted from. As a...