The DISOclust Method for the Prediction of Intrinsic Protein Disorder --------------------------------------------------------------------- Version 1.1 (Feb 2009) (c) Liam J. McGuffin Description ----------- DISOclust v 1.1 makes use of the ModFOLDclust method in order to analyse the variation in 3D models built using HHsearch alignments. Further accuracy may be gained by adding additional 3D models of the target protein into the output directory. References ---------- This software is free and you may copy it or use it in any other applications, so long as it is properly referenced. Please cite the following reference for DISOclust: McGuffin, L. J. (2008) Intrinsic disorder prediction from the analysis of multiple protein fold recognition models. Bioinformatics, 24, 1798-804. This version of DISOclust is dependent on the following tools: PSI-BLAST, PSIPRED, TMscore, HHsearch, MODELLER and DISOPRED. Please also cite the appropriate references for these tools. Installation ------------ No installation is required for DISOclust program itself after you have downloaded the file. The program is provided in the form of an executable jar file (DISOclust.jar) and is designed to run on Linux operating systems. This version of the program has been tested on recent versions of Ubuntu and CentOS, but it should work on most versions of Linux that have bash installed. Requirements (you may already have many of these programs installed): 1. A recent version of Java (java.com/getjava/). 2. The TMscore program (http://zhang.bioinformatics.ku.edu/TM-score/). Please ensure the TMscore program is working on your system before attempting to run ModFOLDclust. Ensure that you have the correct 32bit/64bit version for your hardware and that the TMscore file is made executable: chmod +x TMscore 3. A recent version of PSI-BLAST (ftp://ftp.ncbi.nih.gov/blast/executables/LATEST/) and a sequence database (ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz). It is recommended that you filter your sequence database using pfilt (http://bioinf.cs.ucl.ac.uk/downloads/pfilt/pfilt.c) prior to running PSI-BLAST. Steps are shown below: Make sure your C compilers work (e.g. in Ubuntu): sudo apt-get install build-essential Compile pilt: cc -O -lm pfilt.c -o pfilt Run pfilt on sequence database: pfilt nr > nrfilt Then format your database using the formatdb program, which is part of the blast package: formatdb -i nrfilt -o T -t nrfilt 4. The PSIPRED program (http://bioinf.cs.ucl.ac.uk/downloads/psipred/). 5. The HHsearch program (e.g. ftp://toolkit.lmb.uni-muenchen.de/HHsearch/HHsearch1.5.0/hh_1.5.0.linux64.tar.gz), a compatible template library (e.g. ftp://toolkit.lmb.uni-muenchen.de/HHsearch/databases/pdb70_7Feb09.hhm.tar.gz) and a calibration file (ftp://toolkit.lmb.uni-muenchen.de/HHsearch/databases/cal.hmm) You may need to concatentate the hhm files for the template library: gunzip pdb70_7Feb09.hhm.tar.gz tar xvf pdb70_7Feb09.hhm.tar find -name \*.hhm -exec cat {} > pdb70_7Feb09.hhm.tmp \; find -name \*.hhm -exec rm {} \; mv pdb70_7Feb09.hhm.tmp pdb70_7Feb09.hhm You will also need to edit the following perl scripts: hhmakemodel.pl - you will need to edit the headers of this script according to Johannes Soeding's instructions and make sure you set the relevant line in the script to: my $pdbdir="./"; addpsipred.pl - you will need to edit the headers of this script to point towards your installations of PSIBLAST and PSIPRED e.g. my $ncbidir="/home/liam/programs/blast-2.2.19/bin"; # Put the directory path with the BLAST executables my $perl="/home/liam/programs/bin"; # Put the directory path where reformat.pl is lying my $dummydb="./"; # Put the name given to the dummy blast directory (or leave this name) my $psipreddir="/home/liam/programs/psipred"; # Put the directory path with the PSIPRED executables Finally download this file - ftp://toolkit.lmb.uni-muenchen.de/HHsearch/HHsearch1.2.0/alignblast.pl and make sure it is executable: chmod +x alignblast.pl All HHsearch related binaries should be placed in the same directory. The database file can be in a separate directory. 6. A recent version of MODELLER (http://salilab.org/modeller/) and a license key (http://salilab.org/modeller/registration.html). 7. The DISOPRED program (http://bioinf.cs.ucl.ac.uk/downloads/DISOPRED/). 8. An internet connection. The program downloads PDB files to use with MODELLER. Please make sure wget is installed on your machine (e.g. in Ubuntu): sudo apt-get install wget Running the program ------------------- You can edit the shell script (DISOclust.sh) or you can follow the steps below. 1. Set the environment variables for the TMscore, PSIBLAST, HHsearch, MODELLER and DISOPRED executables and databases. For example, if your TMscore executable is located in "/home/Liam/programs/bin/", then enter the following command: export TMSCORE=/home/Liam/programs/bin/TMscore Likewise, setup the environment variables for the following: export PSIBLAST=/home/Liam/programs/blast-2.2.19/bin/blastpgp export BLAST_DB=/home/Liam/data/blastdb/nrfilt export HHSEARCH_HOME=/home/Liam/programs/bin/ <-- Note that this directory should hold all HHsearch related binaries export HHSEARCH_CAL=/home/Liam/data/hhsearch/cal.hhm export HHSEARCH_DB=/home/Liam/data/hhsearch/hhsearch_db.hhm export KEY_MODELLER9v5=XXXXXXXXXXXX <-- Note you should obtain a license key for MODELLER before attempting to run DISOclust. export MODELLER=/usr/bin/mod9v5 export DISOPRED=/home/Liam/programs/disopred2/bin/disopred export DISOPRED_DB=/home/Liam/programs/disopred2/data/ Please check these paths are correct for your installation before proceeding. You may want to setup a shell script (see DISOclust.sh) for this or add the lines to your .bashrc or .bash_profile file. 2. (optional) Set the environment variable for Java, if you have not installed it system wide. e.g. export JAVA_HOME=/home/Liam/jdk1.6.0/ 3. Run DISOclust. For example, if your target is called "T0417", the sequence file is "/home/liam/T0417.fasta" and your output directory is "/home/liam/T0417_output/", then enter the following: $JAVA_HOME/bin/java -jar DISOclust T0417 /home/liam/T0417.fasta /home/liam/T0417_output/ Or, if you have java installed system wide: java -jar DISOclust.jar T0417 /home/liam/T0417.fasta /home/liam/T0417_output/ Please ensure that the model/models are provided as separate files in PDB format. The sequence file should be in FASTA format. IMPORTANT: Please also note that you should use FULL PATHS for your input file and output directory, the output directory should also end with a "/". Output ------ A number of different output files are produced in the output directory (e.g. "/home/liam/T0417_output/") and a log of the progress is written to the screen as standard output. Please check all of the output files in each directory. If you have any output files with zero bytes you may not have set up your environment variables correctly. Please make sure that you have set the paths correcty before emailing me with any problems. A description of the output files follows: 1. The final DISOclust output file - this file will consist of the target name plus ".disoclust2", e.g. "T0417.disoclust2". This file conforms to the CASP DR data format (http://predictioncenter.org/casp8/index.cgi?page=format#DR). 2. The initial DISOclust output file - this file will consist of the target name plus ".disoclust1", e.g. "T0417.disoclust1". This file conforms to the CASP DR data format (http://predictioncenter.org/casp8/index.cgi?page=format#DR). You may combine the output scores from this file with those from another disorder prediction method, in order to add value to predictions. An increase in the AUC score of approx. 3-5% can be gained from adding information from this file to the scores produced by most other methods (McGuffin, 2008). 3. The QMODE2 output file from ModFOLDclust - this file will consist of the target name plus "_ModFOLDclust.out", e.g. "T0417_ModFOLDclust.out". This file conforms to the CASP QA QMODE2 data format (http://predictioncenter.org/casp8/index.cgi?page=format#QA). 4. The sorted data file - this file will consist of the target name plus "_ModFOLDclust.sort", e.g. "T0417_ModFOLDclust.sort". This file contains the same data as the QMODE2 file but without the headers and in a more convenient machine readable format. Also in this file are the scores for the comparisons of your models with models produced using HHsearch. This file is worth checking as a better model may have been made! 5. The DISOPRED output files - e.g. T0417.disopred, T0417.horiz_d 6. If you include your own models in the output directory, then you will also receive the following: B-factor files - these have the extension "*.bfact", e.g. "nFOLD3_TS1.bfact". These files contain your original model with the predicted per-residue error entered into the B-factor column. If you open these files using Pymol or Rasmol you can colour your models according to the predicted errors with the b-factor/temperature colouring options. Gnuplot files - these have the extension "*.gnuplot", e.g. "nFOLD3_TS1.gnuplot". These files contain data for each model which can be plotted using gnuplot, for example using the following script: set terminal postscript color set output "nFOLD3_TS1.ps" set boxwidth 1 set style fill solid 0.25 border set ylabel "Predicted residue error (Angstroms)" set xlabel "Residue number" set yrange [0:15] set yzeroaxis unset key set datafile missing "NaN" plot "nFOLD3_TS1.gnuplot" using 1:2 with boxes,\ "nFOLD3_TS1.gnuplot" using 1:3 with points quit 7. A time-stamped subdirectory of data will also be created. This directory will contain the output files from PSIBLAST, HHSEARCH, PSIPRED and MODELLER. Trouble Shooting ---------------- Email me: l.j.mcguffin@reading.ac.uk I will try to respond to your issue as soon as I can! Thanks, Liam