What is ProteDNA?
Summary:
In recent years, prediction of residues in a protein that may be involved in interaction with the DNA has been a research topic that attracts a high degree of interest. In this respect, as it has been reported in a recent article that the tertiary structures of a large number of transcription factors (TF) are mostly disordered (1), sequence based analysis aimed at identifying the residues in a highly-disordered TF that play key roles in interaction with the DNA is essential for obtaining a comprehensive picture of how the TF functions. Concerning protein-DNA interactions, there are two types of binding mechanisms involved: specific binding and non-specific binding. Specific binding occurs between protein sidechains and nucleotide bases, while non-specific binding occurs between protein sidechains and the DNA sugar/phosphate backbone. In molecular biology, specific binding corresponds to sequence-specific recognition of a gene and therefore is essential for correct gene regulation. This article presents the design of a sequence based predictor named ProteDNA for identifying the specific binding residues in a TF. The design of the ProteDNA is distinctive by employing a hybrid approach aimed at achieving superior performance. In particular, the LIBSVM package (2) has been incorporated for making predictions with those residues that are predicted by PSIPRED (3) to be in an alpha-helix or a coil segment of secondary structure. On the other hand, the SSEA (Secondary Structure Element Alignment) mechanism proposed by Gewehr and Zimmer (4) has been incorporated for making predictions with those residues that are predicted by PSIPRED to be in a beta-sheet segment. For evaluating the performance of ProteDNA, we have first created a data set containing 228 TF-DNA complexes extracted from the 691 protein-DNA complexes that Yanay Ofran et al. (5) collected from the Protein Data Bank (PDB) (6). In this process, we excluded those complexes in the Ofran collection that do not contain a TF and then queried the PFAM server (7) to exclude those complexes in which no polypeptide segment is within the DNA binding domain predicted by the PFAM server. With the 228 TF-DNA complexes, we have randomly extracted 30 complexes to form the independent testing data set and made sure that none of the remaining complexes to be used in the training process is homologous to a protein chain in the independent testing data set by having a sequence identity higher than 25%. In the independent test, ProteDNA has been able to deliver overall sensitivity of 59.5%, specificity of 98.8%, precision of 77.4%, and accuracy of 96.3%. Furthermore, ProteDNA is capable of delivering much higher precision than the existing predictors of DNA-binding residues. In this respect, one must note that ProteDNA is the only predictor that has been designed to identify the specific binding residues. We emphasize precision because it provides the biochemist with a confidence level for designing an experiment to confirm whether a predicted binding residue is really involved in interaction with the DNA. ProteDNA is available at http://serv.csbb.ntu.edu.tw/ProteDNA/ as well as at http://bio222.esoe.ntu.edu.tw/ProteDNA/. ProteDNA has been up and running for 3 months and has been used by a group of 30 graduate students enrolled in a bioinformatics class. The number of protein chains that has been submitted to ProteDNA by the users outside of our laboratory is around 300.
Keyterms: protein-DNA binding, specific binding, transcription factor, prediction
References
- Liu, J., Perumal, N.B., Oldfield, C.J., Su, E.W., Uversky, V.N. and Dunker, A.K. (2006) Intrinsic disorder in transcription factors. Biochemistry, 45, 6873-6888.
- Chang, C.-C. and Lin, C.-J. (2001) LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
- McGuffin, L.J., Bryson, K. and Jones, D.T. (2000) The PSIPRED protein structure prediction server. Bioinformatics, 16, 404-405.
- Gewehr, J.E. and Zimmer, R. (2006) SSEP-Domain: protein domain prediction by alignment of secondary structure elements and profiles. Bioinformatics, 22, 181-187.
- Ofran, Y., Mysore, V. and Rost, B. (2007) Prediction of DNA-binding residues from sequence. Bioinformatics, 23, i347-353.
- Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N. and Bourne, P.E. (2000) The Protein Data Bank. Nucleic Acids Res, 28, 235-242.
- Finn, R.D., Mistry, J., Schuster-Bockler, B., Griffiths-Jones, S., Hollich, V., Lassmann, T., Moxon, S., Marshall, M., Khanna, A., Durbin, R. et al. (2006) Pfam: clans, web tools and services. Nucleic Acids Res, 34, D247-251.