The DomFOLD Method for the Prediction of Protein Domain Boundaries ------------------------------------------------------------------ Version 2.0 (March 2009) (c) Liam J. McGuffin Description ----------- DomFOLD v 2.0 combines the output from DomSSEA, DISOPRED and HHsearch to form a consensus domain prediction. References ---------- This software is free and you may copy it or use it in any other applications, so long as it is properly referenced. Please cite the following references for DomSSEA, DISOPRED, HHsearch and PDP: Marsden, R., McGuffin, L. J. & Jones, D. T. (2002) Rapid protein domain assignment from amino acid sequence using predicted secondary structure. Protein Science, 11, 2814-2824. Ward, J. J., Sodhi, J. S. McGuffin, L. J., Buxton, B. F. & Jones, D. T. (2004) Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J. Mol. Biol., 337, 635-645. Söding J. (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics. 21, 951-96. PubMed Alexandrov, N. & Shindyalov, I. (2003) PDP: protein domain parser. Bioinformatics. 19, 429-30. This version of DomFOLD is also dependent on PSI-BLAST, PSIPRED and MODELLER. Please also cite the appropriate references for these tools. Installation ------------ No installation is required for DomFOLD program itself after you have downloaded the file. The program is provided in the form of an executable jar file (DomFOLD.jar) and is designed to run on Linux operating systems. This version of the program has been tested on recent versions of Ubuntu and CentOS, but it should work on most versions of Linux that have bash installed. Requirements (you may already have many of these programs installed): 1. A recent version of Java (java.com/getjava/). 2. The PDP program (ftp://ftp.ncifcrf.gov/pub/SARF2/PDP/pdp.gz). 3. A recent version of PSI-BLAST (ftp://ftp.ncbi.nih.gov/blast/executables/LATEST/) and a sequence database (ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz). It is recommended that you filter your sequence database using pfilt (http://bioinf.cs.ucl.ac.uk/downloads/pfilt/pfilt.c) prior to running PSI-BLAST. Steps are shown below: Make sure your C compilers work (e.g. in Ubuntu): sudo apt-get install build-essential Compile pilt: cc -O -lm pfilt.c -o pfilt Run pfilt on sequence database: pfilt nr > nrfilt Then format your database using the formatdb program, which is part of the blast package: formatdb -i nrfilt -o T -t nrfilt 4. The PSIPRED program (http://bioinf.cs.ucl.ac.uk/downloads/psipred/). 5. The HHsearch program (e.g. ftp://toolkit.lmb.uni-muenchen.de/HHsearch/HHsearch1.5.0/hh_1.5.0.linux64.tar.gz), a compatible template library (e.g. ftp://toolkit.lmb.uni-muenchen.de/HHsearch/databases/pdb70_7Feb09.hhm.tar.gz) and a calibration file (ftp://toolkit.lmb.uni-muenchen.de/HHsearch/databases/cal.hmm) You may need to concatentate the hhm files for the template library: gunzip pdb70_7Feb09.hhm.tar.gz tar xvf pdb70_7Feb09.hhm.tar find -name \*.hhm -exec cat {} > pdb70_7Feb09.hhm.tmp \; find -name \*.hhm -exec rm {} \; mv pdb70_7Feb09.hhm.tmp pdb70_7Feb09.hhm You will also need to edit the following perl scripts: hhmakemodel.pl - you will need to edit the headers of this script according to Johannes Soeding's instructions and make sure you set the relevant line in the script to: my $pdbdir="./"; addpsipred.pl - you will need to edit the headers of this script to point towards your installations of PSIBLAST and PSIPRED e.g. my $ncbidir="/home/liam/programs/blast-2.2.19/bin"; # Put the directory path with the BLAST executables my $perl="/home/liam/programs/bin"; # Put the directory path where reformat.pl is lying my $dummydb="./"; # Put the name given to the dummy blast directory (or leave this name) my $psipreddir="/home/liam/programs/psipred"; # Put the directory path with the PSIPRED executables Finally download this file - ftp://toolkit.lmb.uni-muenchen.de/HHsearch/HHsearch1.2.0/alignblast.pl and make sure it is executable: chmod +x alignblast.pl All HHsearch related binaries should be placed in the same directory. The database file can be in a separate directory. 6. A recent version of MODELLER (http://salilab.org/modeller/) and a license key (http://salilab.org/modeller/registration.html). 7. The DISOPRED program (http://bioinf.cs.ucl.ac.uk/downloads/DISOPRED/). 8. An internet connection. The program downloads PDB files to use with MODELLER. Please make sure wget is installed on your machine (e.g. in Ubuntu): sudo apt-get install wget Running the program ------------------- You can edit the shell script (DomFOLD.sh) or you can follow the steps below. 1. Set the environment variables for the PDP, PSIBLAST, HHsearch, MODELLER and DISOPRED executables and databases. For example, if your PDP executable is located in "/home/Liam/programs/bin/", then enter the following command: export PDP=/home/Liam/programs/bin/pdp Likewise, setup the environment variables for the following: export PSIBLAST=/home/Liam/programs/blast-2.2.19/bin/blastpgp export BLAST_DB=/home/Liam/data/blastdb/nrfilt export HHSEARCH_HOME=/home/Liam/programs/bin/ <-- Note that this directory should hold all HHsearch related binaries export HHSEARCH_CAL=/home/Liam/data/hhsearch/cal.hhm export HHSEARCH_DB=/home/Liam/data/hhsearch/hhsearch_db.hhm export KEY_MODELLER9v5=XXXXXXXXXXXX <-- Note you should obtain a license key for MODELLER before attempting to run DomFOLD. export MODELLER=/usr/bin/mod9v5 export DISOPRED=/home/Liam/programs/disopred2/bin/disopred export DISOPRED_DB=/home/Liam/programs/disopred2/data/ Please check these paths are correct for your installation before proceeding. You may want to setup a shell script (see DomFOLD.sh) for this or add the lines to your .bashrc or .bash_profile file. 2. (optional) Set the environment variable for Java, if you have not installed it system wide. e.g. export JAVA_HOME=/home/Liam/jdk1.6.0/ 3. Run DomFOLD. For example, if your target is called "T0417", the sequence file is "/home/liam/T0417.fasta" and your output directory is "/home/liam/T0417_output/", then enter the following: $JAVA_HOME/bin/java -jar DomFOLD T0417 /home/liam/T0417.fasta /home/liam/T0417_output/ Or, if you have java installed system wide: java -jar DomFOLD.jar T0417 /home/liam/T0417.fasta /home/liam/T0417_output/ The sequence file should be in FASTA format. IMPORTANT: Please also note that you should use FULL PATHS for your input file and output directory, the output directory should also end with a "/". Output ------ A number of different output files are produced in the output directory (e.g. "/home/liam/T0417_output/") and a log of the progress is written to the screen as standard output. Please check all of the output files in each directory. If you have any output files with zero bytes you may not have set up your environment variables correctly. Please make sure that you have set the paths correcty before emailing me with any problems. A description of the output files follows: 1. The final DomFOLD output file - this file will consist of the target name plus "_DomFOLD.out", e.g. "T0417_DomFOLD.out". This file conforms to the CASP DP data format (http://predictioncenter.org/casp8/index.cgi?page=format#DP). 2. The initial DomFOLD output file - this file will consist of the target name plus ".pred.domout", e.g. "T0417.pred.domout". This file contains individual domain predictions made using: the DomSSEA method, the DISOPRED putative flexible domain linker regions, the HHsearch multi-template model parsed with PDP and a consensus prediction from all three of these methods. 3. The DISOPRED output files - e.g. T0417.disopred, T0417.horiz_d 4. The multi-template model built using HHsearch and MODELLER e.g. "_multi_HHsearch_TS1". 4. A time-stamped subdirectory of data will also be created. This directory will contain the output files from PSIBLAST, HHsearch, PSIPRED and MODELLER. Trouble Shooting ---------------- Email me: l.j.mcguffin@reading.ac.uk I will try to respond to your issue as soon as I can! Thanks, Liam