Published March 17, 2025 | Version v2 (expanded version of Zenodo record 14080821)
Dataset Open

Pool PaRTI protein sequence embeddings and residue importance scores for ESM-2 650M and protBERT

Description

This dataset is version 2 of zenodo.org/records/14080821 for the paper titled "Pool PaRTI: A PageRank-Based Pooling Method for Identifying Critical Residues and Enhancing Protein Sequence Representations."

 

For two different PLMs (ESM-2 650M and protBERT) and more than 20,000 proteins on UniProt (encapsulating all Homo sapiens proteins), we present

1) the protein sequence embeddings generated by Pool PaRTI

2) the importance weights assigned to each residue of every protein by Pool PaRTI in the npz files. 

The individual proteins are indexed by their UniProt accession codes. If you need to generate sequence embeddings or get residue importance values for sequences not in the dataset, please follow the repo with the link below to generate the desired output. 

github.com/Helix-Research-Lab/Pool_PaRTI.git

 

You can also reach out to the authors for any clarification.

Files

Files (629.2 MB)

Name Size Download all
md5:888138f164cde8cfc826097c77da8c7b
125.0 MB Download
md5:abdc96fd20262d7bc8392d8d45aab508
222.6 MB Download
md5:9e411be0250af6303017dc37a2cdcf5d
166.4 MB Download
md5:4c95a1f8e48114badbbee41b2f0a6b29
115.2 MB Download

Additional details

Dates

Available
2024-10-05
paper available on bioRxiv

References

  • Tartici, Alp, Gowri Nayar, and Russ B. Altman. "Pool PaRTI: A PageRank-based Pooling Method for Robust Protein Sequence Representation in Deep Learning." bioRxiv (2024): 2024-10.