Pool PaRTI protein sequence embeddings and residue importance scores for ESM-2 650M and protBERT
Description
This dataset is version 2 of zenodo.org/records/14080821 for the paper titled "Pool PaRTI: A PageRank-Based Pooling Method for Identifying Critical Residues and Enhancing Protein Sequence Representations."
For two different PLMs (ESM-2 650M and protBERT) and more than 20,000 proteins on UniProt (encapsulating all Homo sapiens proteins), we present
1) the protein sequence embeddings generated by Pool PaRTI
2) the importance weights assigned to each residue of every protein by Pool PaRTI in the npz files.
The individual proteins are indexed by their UniProt accession codes. If you need to generate sequence embeddings or get residue importance values for sequences not in the dataset, please follow the repo with the link below to generate the desired output.
github.com/Helix-Research-Lab/Pool_PaRTI.git
You can also reach out to the authors for any clarification.
Files
Files
(629.2 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:888138f164cde8cfc826097c77da8c7b
|
125.0 MB | Download |
|
md5:abdc96fd20262d7bc8392d8d45aab508
|
222.6 MB | Download |
|
md5:9e411be0250af6303017dc37a2cdcf5d
|
166.4 MB | Download |
|
md5:4c95a1f8e48114badbbee41b2f0a6b29
|
115.2 MB | Download |
Additional details
Dates
- Available
-
2024-10-05paper available on bioRxiv
Software
- Repository URL
- https://github.com/Helix-Research-Lab/Pool_PaRTI.git
- Programming language
- Python, Shell
References
- Tartici, Alp, Gowri Nayar, and Russ B. Altman. "Pool PaRTI: A PageRank-based Pooling Method for Robust Protein Sequence Representation in Deep Learning." bioRxiv (2024): 2024-10.