DEGRONOPEDIA is an online resource that allows inspection and visualization of known degron motifs in the proteomes of selected model organisms, as well as in a user-submitted sequence or structure, and many more.
structural context (solvent-accessibility, intrinsically disordered region, secondary structure) for each degron motif found
post-translational modifications and mutations occurring within each degron motif and its flanking regions
comprehensive visualization of the location of degron motifs in the sequence with structural data, post-translational modifications and mutations
evolutionary conservation of degron motifs among orthologous proteins
degron context according to the tripartite degron model1
Gravy hydrophobicity index for N-/C-terminus of the query
experimental or Machine Learning-predicted Protein Stability Index (PSI) of N-/C-terminus of the query (provides insight into protein termini stability)
simulations of proteolytic cleavage followed by screening for degron motifs in each newly emerged N-/C-termini after proteolysis
The maintenance of proteostasis requires the degradation of damaged or unwanted proteins and plays a key role in cell function, the growth of organisms and, ultimately, its viability. The ubiquitin-proteasome system (UPS) manages protein degradation through a process known as ubiquitination, in which a small ubiquitin protein is attached to its target. Ubiquitination is mediated by an enzymatic cascade involving ubiquitin-activating enzymes (E1), ubiquitin-conjugating enzymes (E2) and ubiquitin ligases (E3). The proteasome complex recognizes ubiquitinated proteins and, through proteolysis, degrades them into short peptides that can be further processed.
Degron is a short linear motif on a protein of interest (POI) recognized by E3 ubiquitin ligases. N- and C-termini of protein may act as degron sites, but internal degrons, often within intrinsically disordered regions (IDRs), are also possible.
Figure 1. Scheme of a degron site. E2 - ubiquitin-conjugating enzyme, E3 - ubiquitin ligase, Ub - ubiquitin, POI - protein of interest.
UniProt ID - UniProt ID of a protein (min. 50 amino acids long) from the canonical UniProt proteome of one of the 11 selected model organisms: H. sapiens, M. musculus, R. norvegicus, D. rerio, D. melanogaster, C. elegans, S. cerevisiae, S. pombe, A. thaliana, O. sativa, Z. mays
Sequence - protein sequence in FASTA format, between 50 and 40,000 amino acids long, containing only 20 canonical amino acids
Structure - protein monomer structure in the PDB format
with only one model and one chain
with continous numbering starting from 1 (to avoid inconsistency with overlaying data on the sequence; you can easily renumber your structure here)
between 50 and 40.000 amino acids long
containing 20 canonical amino acids only
not exceeding 5 MB size
Structure + UniProt ID - protein monomer structure in the PDB format meeting the same criteria as described above with additionally passed UniProt ID, which, if matching the structure and is present in our database, results in the display of post-translational modifications, mutations and all other data related to the query - it mimics the "Query by UniProt ID", but calculations are performed for the structure submitted by the user
๐ Note: Regardless of the query type, it is possible to submit only one protein at a time.
Calculate degron conservation - applicable when passing UniProt ID; allows calculation of degron conservation scores based on pre-calculated Multiple Sequence Alignments (MSAs) obtained from eggNOG5 database; see also Degron conservation scores
Run Machine Learning - applicable to any input type; is applicable to any type of input data; allows to predict the Protein Stability Index (PSI) for N-/C-terminus of the query; see also Machine Learning
Predict disordered regions based on - applicable to any input type; allows to choose between using pLDDT/LDDT values of the corresponding structure model and using the IUPRED3 software to predict disorder regions. Please note, that when querying by structure, it is on the userโs end to provide the correct type of the submitted structure (in "The submitted structure is" panel); otherwise, the disordered regions may be erroneously defined; see also
How is disorder predicted?
Define custom degron motifs - applicable to any input type; allows to define custom degron motifs, either as exact sequence motifs or regular expressions
Define custom proteolytic sites - applicable to any input type; allows to define custom proteolytic sites, either as sequence motif with indicated cleavage site or as index of the cleavage site
Upload custom Multiple Sequence Alignment - applicable to any input type; allows to upload Multiple Sequence Alignment (MSA) of the query protein's orthologs in FASTA format to check for degrons' conservation among them
Pass UniProt ID of the structure - applicable when querying by structure; allows to additionally provide a UniProt ID, and if it matches the structure and is present in our database, the server displays post-translational modifications, mutations and all the other data associated with the query - it mimics the "Query by UniProt ID", but calculations are performed for the structure submitted by the user
Definition: the maximum sequence distance to regions upstream and downstream of the degron motif to be considered as flanking
Applies: to all query types
Unit: aa
Default value: 20
Allowed values: 5-40
Figure 2. Scheme of the degron flanking region in sequence (defined as 20 aa).
Definition: the maximum structural distance to residues around the degron motif to be considered as flanking (note that such residues are not necessarily close in sequence to the degron motif)
Applies: to query by UniProt ID or structure
Unit: ร
Default value: 20
Allowed values: 5-40
Figure 3. Scheme of the degron flanking region in structure (defined as 20 ร ).
Definition: the maximum sequence distance to regions upstream and downstream of the degron motif to be included in the degron mean disorder score
Applies: to all query types
Unit: aa
Default value: 10
Allowed values: 1-20
Figure 4. Scheme of the region to calculate degron disorder (defined as 10 aa).
Definition: the maximum sequence distance to regions upstream and downstream of the secondary degron (K/C/T/S) to be included in the secondary degron mean disorder score
Applies: to all query types
Unit: aa
Default value: 3
Allowed values: 1-15
Figure 5. Scheme of the region to calculate secondary degron disorder (defined as 3 aa).
Definition: the minimum sequence distance from the secondary degron (K/C/T/S) to the continuous intrinsically disordered region (IDR) of a defined length (see Minimum continuous IDR length) to consider it as a tertiary degron
Applies: to all query types
Unit: aa
Default value: 10
Allowed values: 5-40
Figure 6. Scheme of different IDRs and their relation to the secondary degron (with both Minimum IDR distance from the secondary degron (K/C/T/S) and Minimum continuous IDR length defined as 10 aa).
Definition: the minimum number of subsequent (in sequence) disordered residues to be considered as an intrinsically disordered region (IDR)
Applies: to all query types
Unit: aa
Default value: 10
Allowed values: 5-40
Example: when defined as 10 aa, minimum 10 disorder residues must appear one after another in sequence to recognize them as IDR
See also:How is disorder predicted?
Definition: the threshold below which the residue is considered as disordered based on its confidence pLDDT/LDDT (predicted Local Distance Difference Test/Local Distance Difference Test) score. This structure for which its disorder is to be predicted based on the pLDDT/LDDT scores must be either an AlphaFold2 or RoseTTAFold model.
Applies: to query by UniProt ID or structure
Unit: %; pLDDT scores (present in AlphaFold2 models) are in the range of 1-100 and LDDT scores (present in RoseTTAFold models) are in the range of 0-1, so in order to handle both cases, this paramater is defined as %
Default value: 70
Allowed values: 40-90
Example: when defined as 70%, all residues with mean pLDDT/LDDT score (this score should be the same for each atom of the residue, nevertheless the residue mean for all atoms is always calculated) below 70/0.7 (Alphafold2 model/RoseTTAFold model) are considered as disordered
See also:How is disorder predicted?
Definition: the threshold above which the residue is considered as disordered based on predictions obtained from the IUPred3 software (sequence-based predictions)
Applies: to all query types
Unit: %
Default value: 50
Allowed values: 40-90
Example: when defined as 50%, all residues with the score predicted by IUPred3 above this value are considered as disordered
See also:How is disorder predicted?
Definition: the threshold below which the residue is considered as buried based on its Relative Solvent Accessibility (RSA) calculated with the DSSP software and normalized using the Sander method
Applies: to query by UniProt ID or structure
Unit: %
Default value: 20
Allowed values: 5-60
Example: when defined as 20%, all residues with the RSA value below 0.2 are considered as buried
See also:How are buried residues defined?
Definition: the maximum distance to regions upstream and downstream of the degron motif (its ends) in the query to its orthologs in the Multiple Sequence Alignment (MSA) to consider it as evolutionarily conserved
Applies: to all query types
Unit: aa
Default value: 5
Allowed values: 5-40
See also:Degron conservation scores
Figure 7. Scheme of the occurrence of the degron motif in the MSA and its recognition as conserved with respect to the distance from the degron motif in the query (defined as 5 aa). Gaps in the alignment within the degron motifs are marked with horizontal dashes.
Depending on the input type, different granularity of degron-related output information is provided.
Query by UniProt ID - provides the most comprehensive information about degron motifs in a query, including full tripartite degron model, as additional information about evolutionary conservation, structural context, post-translational modifications or mutations, is superimposed on the degron data
Sequence - gives the least amount of output information compared to query by UniProt ID or structure, as no structure or post-translational modifications/mutations/experimental proteolytic site data are available (although information on disordered regions is present as predicted based on the query sequence using the IUPred3 software)
Structure - provides a moderate amount of information, including a full tripartite degron model, but not as complete as the query by UniProt ID, because experimental data such as post-translational modifications or mutations are unavailable
Structure + UniProt ID - provides the same information as the "Query by UniProt ID"
Figure 8. The amount of output information depends on the type of input. Figure 9. Comparison of the result information obtained upon different query types in the DEGRONOPEDIA.
Regardless of the query type, it is possible to download an xlsx file with all the data, divided into separate sheets.
Look for the icon at the top of each result page.
Regular expression is a search pattern allowing for text screening to check its presence; see examples below.
[AVP]x[ST][ST][ST]
means that there are 5 characters in the pattern
first character: A or V or P
second character: any (x indicates any character)
third character: S or T
fourth character: S or T
fifth character: S or T
F[^P]{3}W[^P]{2,3}[VIL]
means that there are eight or nine characters in the pattern
๐ Note: {} brackets indicate number of occurrence.
first character: F
second character: any except P
third character: any except P
fourth character: any except P
fifth character: W
sixth: any except P
seventh character: any except P
eight character: either continuation of previous any except P, or if V or I or L would occur, this will be the final character
ninth character - only if previous character was not V or I or L: V or I or L
FSDLWKLL
the motif has to exactly match the pattern
^M{0,1}([ED])x
means that there are three characters in the pattern
๐ Note: {} brackets indicate number of occurrence.
๐ Note 2: ^ indicates that the pattern has to match the very beginning of the sequence
first character: M occurs or does not occur
second character: E or D
third character: any
KxxR$
means that there are four characters in the pattern
๐ Note: $ indicates that the pattern has to match the very end of the sequence.
Gravy (grand average of hydropathy) hydrophobicity index is calculated by adding the hydropathy value for each residue and dividing by the length of the sequence3. Its higher values indicate that a sequence is more hydrophobic. In DEGRONOPEDIA, it is calculated for N-terminus (first 15 aa) and C-terminus (last 15 aa) of the input sequence.
Interpretation: hydrophobic regions often determine the specificity for recognition by chaperones and protein quality control E3s, but they are less likely to be recognized by cullin-RING E3 ligases4-7.
The reported Protein Stability Index (PSI) values are from two large-scale studies involving the Global Protein Stability technology to measure the stability of 23 long peptides spanning the N- and C-terminus of the human proteome8-9.
The PSI values for the N-terminus provide information about the experimental stability of the first 23 residues/24 residues of the protein, depending on whether PSI was measured for the cleaved initiator methionine or not, respectively (for more on the co-translational cleavage of methionine, when it occurs and the associated Ac/N-degron pathway, see, for example, this review).
The PSI value for the C-terminus provides information about the experimental stability of the last 23 residues of the protein. Information on experimental PSI values is currently only available for human proteins.
Interpretation: the higher the PSI value, the more stable the terminus is. Please refer to the provided distributions of the experimental data for each terminus.
Figure 10. Visualization of experimental PSI values for N-/C-termini in the context of experimental data distribution. For the N-termini, the experimental PSI was measured in ranges of 1-6, while for the C-termini in ranges of 1-4.
๐ Note: PSI is reported by the identity of the N-/C-terminus with the experimental data, not by the name of the protein due to possible inconsistencies in nomenclature (e.g., human protein A has an experimental C-terminal PSI value, but we are querying human protein B, whose name is absent from the experimental dataset, but which has an identical C-terminal peptide to protein A - so we report an identical C-terminal PSI value for protein B to that of protein A).
Machine Learning models for predicting Protein Stability Index (PSI) values for the N-/C-terminus of a query were developed based on experimental stability datasets for a 24/23-mer covering the N-/C-terminus of the human proteome (see N-/C-termini stability data) using the CatBoost regressor method.
The performance of the final models was evaluated using the testing set and an R2 coefficient, reaching the values of 0.796/0.812 for the N-terminus with initiator methionine cleaved/not cleaved, respectively, and 0.815 for the C-terminus (the highest possible value of R2 coefficient is 1).
See our publication for more details.
โ We recommend running N-/C-termini stability predictions only on proteins from higher mammals, as our models were trained on human protein stability datasets.
Interpretation: the higher the PSI value, the more stable the terminus is. Please refer to the provided distributions of the experimental data for each terminus.
Figure 11. Visualization of experimental PSI values for N-/C-termini in the context of experimental data distribution. For the N-termini, the experimental PSI was measured in ranges of 1-6, while for the C-termini in ranges of 1-4.
๐ Note: If the initiator methionine is absent, only one PSI value is predicted for the case when it is cleaved.
Users interested in using our Machine Learning models to perform high-throughput predictions of protein N-/C-termini stability can use our standalone software available at github.com/filipsPL/degronopedia-ml-psi.
Upon Multiple Sequence Alignment (MSA) availability, four different degron conservation scores can be calculated (see scheme below), providing insight into the degron motif conservation among orthologs.
Figure 12. Scheme on calculating different degron conservation scores. Maximum distance from the degron motif in the query to its orthologs to consider them as conserved was defined as 5 aa. Gaps in the alignment within the degron motifs are marked with horizontal dashes.
DEGRONOPEDIA integrates pre-calculated MSAs of orthologs, obtained from the eggNOG5 database, at various evolutionary distances, which are available when querying by UniProt ID or by structure + UniProt ID. Regardless of the query type, the user can also submit their custom MSA in FASTA format containing no more than 200 sequences to check the conservation of each degron motif found in the query.
Table 1. Pre-calculated MSAs of orthologs at various evolutionary distances from the eggNOG5 database available for selected model organisms. NCBI taxonomy identifiers are given in parentheses.
โ Requires calculations to be performed on AlphaFold2/RoseTTAFold model - B-factor column in PDB file must contain valid pLDDT/LDDT scores.
The pLDDT (Predicted Local Distance Difference Test; ranges 0-100) or LDDT (Local Distance Difference Test; ranges 0-1) score estimates the accuracy of the modeled residues, and those with pLDDT values above 70 are generally expected to be well modeled, while pLDDT below this value correlates with disordered regions10.
๐ Note: When uploading a RoseTTAFold model, we recommend using a model obtained from a local RosetTTAFold run, as its B-factor column contains the LDDT scores. Please do not directly upload a RosetTTAFold model obtained from the ROBETTA server, as its B-factor column holds the estimated RMSD error (although it is possible to convert these values to LDDT scores using, e.g., PHENIX software.
โ Applies to any query type, as is IUPred3 predicts disorder based on query sequence.
IUPred3 predicts a disorder score, ranging from 0 to 1, for each amino acid in the sequence. The default threshold above which the residue can be considered as disordered is 0.5, according to the authors of this tool, but it can be adjusted in the IUPred3 disorder threshold parameter. You can read more about IUPred3 here.
๐ Note: IUPred3 is run as a standalone tool using default settings (long disorder is predicted).
Intrinsically disordered regions (IDRs) are defined as a continuous region with the minimum number (defined in the Minimum continuous IDR length parameter) of consecutive residues considered as disordered according to an appropriate threshold depending on the user's choice of disorder prediction method (this threshold is either defined in the parameter pLDDT/LDDT disorder threshold or IUPRED3 sequence-based predictions).
Relative solvent accessibility (RSA) of a protein residue is a measure of its solvent exposure.
It is calculated using the DSSP software and normalized by the Sander's method.
A residue is considered as buried if its RSA value is below the threshold defined in the Buried residue threshold parameter.
Interpretation: RSA values range from 0-1, where lower values indicate more buried residues.
literature - manually compiled datasets of non-canonical ubiquitination and arginylation
In total, DEGRONOPEDIA provides up to 32 different PTMs.
๐ญ What is the importance of a degron being a phosphodegron?
Phosphodegron contains one or more phosphorylated residues which may modulate the degron's accessibility. See review on this topic.
Guharoy and colleagues1 suggested a tripartite degron model where the primary degron is a short linear motif recognized by an E3 ligase, localized preferentially within an intrinsically disordered region (IDR) of the protein. The secondary degron refers to lysines to which ubiquitin may be attached, and the tertiary degron is an IDR close to the secondary degron, which acts as an unfolding seed initiating proteasome-dependent protein degradation.
The secondary and tertiary degrons are suggested to play subsidiary roles that affect ubiquitin-signaling - lack of a component of the tripartite degron model, e.g., IDR near a ubiquitinated lysine can result in non-proteolytic ubiquitination functions.
Figure 13. The tripartite degron model. Note that in our implementation, the secondary degron may not only be lysine (K), as ubiquitination can also occur on cysteines (C), serines (S) or threonines (T)2.
in sequence - amino acids downstream and upstream of the ends of the degron motif within the distance defined in the Degron flanking region in sequence parameter. Relevant to all query types.
in structure - residues around the degron motif (excluding the degron motif itself) within the structural distance defined in the Degron flanking region in structure parameter. Relevant to all query by UniProt ID or structure.
DEGRONOPEDIA reports not only all degron motifs present in the query protein, but also secondary and tertiary degrons according to the tripartite degron model.
We consider not only lysines (K) as potential secondary degrons but also cysteines (C), threonines (T) and serines (S), since ubiquitination may occur on these amino acids2.
Secondary degrons (also referred as K/C/T/S) are searched within the degron flanking regions in sequence and structure.
๐ Note: Secondary degrons within the degron flanking regions in structure are NOT searched when querying by sequence.
Figure 14. Scheme on search location for secondary degrons.
Tertiary degrons are searched within the distance from secondary degrons defined in the Minimum IDR distance from the secondary degron (K/C/T/S) parameter. Tertiary degrons close in sequence are reported for all query types (as intrinsically disordered regions (IDRs) can be predicted from the query sequence using IUPred3), but those close in structure are reported only for query by UniProt ID or structure.
๐ Note: Only the closest tertiary degron to each secondary degron is reported (both in terms of sequence and structure).
Figure 15. Scheme on reporting the tertiary degrons.
Protein turnover can be regulated by various proteolytic enzymes that cleave the protein, leading to the emergence of new N- and C-terminus that can act as degrons11.
DEGRONOPEDIA simulates the cleavage of a query based on a user-defined cleavage motif/site, experimentally validated cleavage sites derived from the MEROPS database (the largest resource of experimental proteolysis data) as well as from the literature, and predicted cleavage sites for 35 different proteolytic enzymes using the Pyteomics module, which implements the PeptideCutter Expasy web server cleavage prediction rules. Each newly emerged N-/C-termini is then screened for the presence of degron motifs.
๐ Note 1: When defining own cleavage sites e.g. as 80, the cleavage occurs after the given site (see picture below).
๐ Note 2: Degrons are searched in the emerged peptides providing their length is min. 50 amino acids.
Since degrons act as a binding site for various E3 ubiquitin ligases, we report the E3s known to interact with the query based on interactome data from the:
Guharoy, M., Bhowmick, P., Sallam, M. & Tompa, P. Tripartite degrons confer diversity and specificity on regulated protein degradation in the ubiquitin-proteasome system. Nat Commun 7, 10239 (2016).
Squair, D. R. & Virdee, S. A new dawn beyond lysine ubiquitination. Nat Chem Biol 18, 802โ811 (2022).
Kyte, J. & Doolittle, R. F. A simple method for displaying the hydropathic character of a protein. J Mol Biol 157, 105โ132 (1982).
Hickey, C. M., Breckel, C., Zhang, M., Theune, W. C. & Hochstrasser, M. Protein quality control degron-containing substrates are differentially targeted in the cytoplasm and nucleus by ubiquitin ligases. Genetics 217, iyaa031 (2021).
Kats, I. et al. Mapping Degradation Signals and Pathways in a Eukaryotic N-terminome. Molecular Cell 70, 488-501.e5 (2018).
StefanovicโBarrett, S. et al. MARCH6 and TRC8 facilitate the quality control of cytosolic and tailโanchored proteins. EMBO Rep 19, (2018).
Culver, J. A., Li, X., Jordan, M. & Mariappan, M. A second chance for protein targeting/folding: Ubiquitination and deubiquitination of nascent proteins. BioEssays 44, 2200014 (2022).
Koren, I. et al. The Eukaryotic Proteome Is Shaped by E3ย Ubiquitin Ligases Targeting C-Terminal Degrons. Cell 173, 1622-1635.e14 (2018).
Timms, R. T. et al. A glycine-specific N-degron pathway mediates the quality control of protein N-myristoylation. Science 365, eaaw4912 (2019).
Tunyasuvunakool, K. et al. Highly accurate protein structure prediction for the human proteome. Nature 596, 590โ596 (2021).
Varshavsky, A. N-degron and C-degron pathways of protein degradation. Proc. Natl. Acad. Sci. U.S.A. 116, 358โ366 (2019).