#include
#include
#include
#include
#include "myutil.h"
/*
BAYESFST.C Created JAN05 - this version OCT06 - is an updated version
of NEWFST.C (described below) incorporating potential correlations in
Fst between adjacent markers due to tight linkage (causing linkage
disequilibrium). There is also a user option (set in input file) for
inclusion or exclusion of the interaction term, gamma.
(NEWFST.C last version OCT03 - similar to FSTMET.C but incorporates
new functional form for Fst, no longer includes a "total" population,
and updates the population allele proportions (p) with Dirichlet
proposal subject to a minimum, and assuming a uniform prior.)
The program is written in ANSI C, and should compile under most
compilers (PC and Unix). Compile via
CC bayesfst.c myutil.c -o bayesfst -lm
for example (substituting your compiler name for CC if
required). The file myutil.c must be available (e.g. in the same
directory) and it calls a file INTFILE.
Run the program via
BAYESFST inputfilename outputfilename logfilename
e.g. BAYESFST freq.inp fst.out fst.log
(substitute your executable filename for BAYESFST if you compile with a
different executable name). The file INTFILE must be in the directory
from which the program is run.
1. Overview
This is a Metropolis algorithm for estimating F_ST, the correlation of
two homologous genes sampled within the same subpopulation, relative
to the total population. Among other uses, F_ST is important in
calculating forensic match probabilities, as matches may be more
common between individuals in the same subpopulation (due to recent
shared ancestry of genes in the DNA profile), see Balding & Nichols
(1994, Forensic Science International 64, pp125-140). F_ST is also
interesting as an indicator of possible selection, see Lewontin &
Krakauer (1973, Genetics 74:175-195) and citations thereof, especially
Beaumont & Balding (2004, Molecular Ecology, 13:969-980).
2. Method
The method focuses on obtaining a sample from the posterior
(post-data) distribution of F_ST given the data (allele counts at
up to several thousand loci in several subpopulations), and
a prior (pre-data) distribution for F_ST.
The program uses a Metropolis algorithm in order to construct a Markov
chain whose stationary distribution is the posterior of F_ST. For more
details on Metropolis algorithms and other Markov chain Monte Carlo
(MCMC) methods see e.g. Gamerman (1997) "Markov Chain Monte
Carlo. Stochastic simulation for Bayesian inference" publ. Chapman &
Hall.
The proposal distribution used for generating new values of the model
parameters a_i, b_j and g_ij (g_ij is optional; see below for full
model) is Normal, centred on the current value (of a_i, b_j or g_ij
as appropriate), with standard deviation usa, usb, and usg.
The method combines information over loci and subpopulations in order
to simultaneously estimate F_ST at the i'th locus in the j'th
subpopulation (for all i and j). In order to do this, a hierarchical
model is implemented for F_ST^{i,j}
F_ST^{i,j} = exp(a_i + b_j + g_ij)/(1 + exp(a_i + b_j + g_ij))
where a_i and b_j are locus and population parameters respectively,
and g_ij are locus/population-specific parameters. Priors are placed
on the "hyperparameters" a_i, b_j and g_ij, which then imply a prior
for F_ST. The g_ij are present to account for cases when the model
just involving a_i and b_j does not fit well - the fit of the model
without g_ij can be investigated by inspecting the posterior density
of g_ij (if 0 is heavily supported for the g_ij, then the reduced
model without g_ij may be appropriate). Prior beliefs that the g_ij
are 0 (and the reduced model if appropriate) can readily be specified:
set the prior mean to 0, with a small standard deviation.
Alternatively, to reduce the computation time a 'gamma switch' can be
set to 0 in the input file which fixes each gamma term to zero.
The b_j and g_ij are assumed here to have independent normal priors.
The prior for a_i is also Normal but we incorporate the possibility
of a prior correlation between the values neighbouring loci. The
correlation is implemented via a simple autoregressive model, such
that the prior mean for a_i is equal to ca_{i-1}, where c is a
correlation parameter between 0 and 1. Currently there is a
common correlation parameter for all marker pairs, but it would be
easy to modify the program to read in a distinct correlation value
for each pair, to reflect knowledge about levels of linkage disequilibrium.
Allele frequencies in each subpopulation are assumed unknown and are modelled
by a Dirichlet distribution with parameters (1/F_ST -1)*p_k, where p_k
are the "total" (i.e. at a higher level than subpopulation) population
relative frequencies. The p_k for each locus are assumed unknown with
a jointly uniform prior distribution.
For further details see:
* DJ Balding, M Greenhalgh & RA Nichols (1996).
Population genetics of STR loci in Caucasians
International Journal of Legal Medicine 108, pp 300-305
* DJ Balding & RA Nichols (1997)
Significant genetic correlations among Caucasians at forensic DNA loci
Heredity 78, pp 583-589
* DJ Balding (2003)
Likelihood-based inference for genetic correlation coefficients.
Theor. Pop. Biol. 63, pp 221-230.
* MA Beaumont & DJ Balding (2004)
Identifying adaptive genetic divergence among populations from genome
scans, Molecular Ecology, 13: 969-980.
User-defined constants:
The following constants can be defined by the user (some may not need
changing from their defaults for most cases).
maxall
This is the maximum number of distinct allele-types allowed
at any locus (can be much greater than the actual number).
maxpop
This is the maximum number of subpopulations allowed.
maxloc
This is the maximum number of loci allowed in the dataset.
nout
This is the number of values to be simulated from the posterior
distribution of F_ST^{i,j}. Usually this number should not be
less than 1000 - more values will provide smoother estimation
of the posterior density (and will more accurately estimate
the posterior quantiles of the distribution).
kp
This is the thinning interval for the Metropolis algorithm.
The thinning interval is applied in order to reduce autocorrelation
in the output values (see literature on MCMC). Higher values
reduce correlation but increase compute time (approximately linearly).
numit - THIS SHOULD NOT BE CHANGED
The total number of iterations of the algorithm to perform,
after the burn-in. This is a function of the thinning interval
and the number of output values required.
acrt
The number of outputs of acceptance rates to the log file.
discard
The number of iterations discarded as ``burn-in''.
amu, asi
The mean and standard deviation of the Normal distribution
implemented as a prior for each a_i (similarly for bmu, bsi, gmu,
gsi for b_j and g_ij respectively)
cor
The correlation parameter, to incorporate prior information on the level
of correlation between adjacent markers
srk
The 'shrinkage value' for the sd of a_i. Conditional on the value of
a_{i-1}, the prior mean and standard deviation of a_i are cor*a_{i-1}
and srk*asi (in the case i=1, we proceed as if c=0). The unconditional
mean and standard deviation are then 0 and asi, irrespective of the value
of c.
With these constants, a Metropolis algorithm is run for "numit+discard"
iterations, the first "discard" of which are thrown away. Subsequently,
every "kp" iteration is output.
4. Input format
The input file should be a space or tab-delimited ASCII (i.e. text)
file. First, there are three header lines, containing:
1. interaction term (gamma) switch: 0 - for no interaction term
1 - to include interaction term
2. the number of subpopulations (npop)
3. the number of loci (nloc)
Then for locus i (i=1,..,nloc) there are npop+1 input lines, the first
containing the number of alleles at that locus (numall[i]) and the
remaining lines each contain the allele counts.
For example, for 3 subpops and 2 loci, with 2 and 4 alleles, where we
do not want to include the interaction gamma in the analysis, the input
should look like:
0
3
2
2
45 5
23 27
10 40
4
10 10 10 20
5 5 25 15
17 19 0 14
Spaces and blank lines can be included as desired. Unlike FSTMET.C, in this
program it is assumed that there is no data directly sampled from the "total"
population.
5. Output format
The output file contains the log posterior value then all the a_i,
followed by all the b_j and then all the g_ij (if included). These values
can be read into a statistical package, and posterior density curves for
F_ST for each locus i and population j can be estimated (using the model
for F_ST given in section 2 above).
The log file contains information about the specific run of
the program on the dataset, plus information about the dataset.
Furthermore, summary statistics (mean and quantiles) are
provided for the posterior distribution of each F_ST^{i,j}.
These can only be reasonably interpreted if convergence
and mixing are not thought to be a problem.
6. Important note
Before running this program, you should familiarise yourself with the
method, including convergence diagnostics. Convergence of the MCMC algorithm
should ALWAYS be checked. For example, the
output from this program (the a_i, b_j and g_ij) can be analysed in
the S-Plus based suite of functions CODA (e.g. Best, Cowles & Vines,
1995, CODA manual v.3, MRC Biostatistics Unit Cambr., see also v0.4
http://www.mrc-bsu.cam.ac.uk/bugs/documentation/contents.shtml).
At the very least, this program should be run several times on the
same dataset to ensure that approximately the same results are
obtained with different starting values for the algorithm (the random
number seeds are intialised via the system time, hence each run will
automatically use a different starting value without further action
from the user).
7. Credits and contact
The program has evolved from an original Pascal program written by
David Balding. The C version was coded by Karen Ayres, now at the
University of Reading, who also made substantial improvements. The
most recent changes have been implemented by Paul O'Reilly (Imperial
College).
For further details about the method, please contact
David Balding
Dept. Epidemiology and Public Health
Imperial College, St Mary's Campus,
Norfolk Place
London W2 1PG, UK
E-mail: d.balding@ic.ac.uk
*/
/* USER-DEFINED CONSTANTS */
#define maxall 50 /* maximum number of alleles allowed at a locus */
#define maxpop 20 /* maximum number of populations allowed in the data */
#define maxloc 1200 /* maximum number of loci allowed in the data set */
#define nout 2001L /* number of output values required */
/* PRIORS FOR a_i, b_j AND g_ij - NORMAL DISTRIBUTION (mu,si^2)*/
#define amu 0.0 /* (expected) mean of Normal distribution for prior of a_i */
#define asi 1.0 /* (expected) sd of Normal distribution for prior of a_i */
#define bmu -2.0 /* mean of Normal distribution for prior of b_j */
#define bsi 1.8 /* sd of Normal distribution for prior of b_j */
#define gmu 0.0 /* mean of Normal distribution for prior of g_ij */
#define gsi 0.5 /* sd of Normal distribution for prior of g_ij */
/* CONSTANT FOR PROPOSAL DISTRIBUTION */
#define usi 0.5 /* scale param for normal updates */
#define psi 1000 /* scale param for Dirichlet updates of p_ij */
/* CORRELATION PARAMETER (FIXED) BETWEEN ADJACENT MARKERS */
#define cor 0 /* the value of cor determines the shrinkage value
associated with the sd of each a_i so that the
overall - expected - sd is asi (defined above).
The shrinkage value (srk) is defined below */
#define srk sqrt(1-cor*cor) /* shrinkage value for sd of each a_i (see above) */
/* DECLARE VARIABLES */
typedef double pvec[maxpop*maxloc+1];
int i, j, k, nloc, npop, gammaswitch, n, c, Illegal, kp, numit, acrt, discard, r;
int apop[maxloc], numall[maxloc], popsum[maxloc][maxpop], x[maxloc][maxpop][maxall];
long ix, iy, iz, l1, jmp1, jmp2;
double ftp, rtp, z, newz, u, v, usa, usb, usg, p[maxloc][maxall], oldp[maxall];
double fst[nout], mnf, tmpd, lp[maxloc], newlp[maxloc], pa[maxloc+1], newpa[maxloc+1], pg[maxloc], newpg[maxloc];
pvec a, b, g, newa, newb, newg;
char *flnm;
FILE *inp, *out1, *out2, *out3;
/* PROTOTYPES */
void inproc();
void pargen(int i,double sc);
double rnorm();
double WichHill();
double ldnorm(double lx,double mu,double si);
double loggf(double xx);
double lmd(double f0,int i,int j);
double f(double w);
double ldiri(int k, double par[], double vec[]);
/* FUNCTIONS */
double WichHill()
{
/* Wichmann & Hill random number generator (Algorithm AS183 in Applied
Statistics 31 (1982))
*/
ix=(171*ix)%30269;
iy=(172*iy)%30307;
iz=(170*iz)%30323;
return fmod((ix/30269.0)+(iy/30307.0)+(iz/30323.0),1.0);
}
void inproc()
{
/* READ IN DATA FROM INP, AND OUTPUT VALUES TO LOG FILE */
/* READ IN DATA */
fscanf(inp,"%d\n%d\n%d",&gammaswitch,&npop,&nloc);
if (nloc>maxloc){
printf("ERROR - number of loci %d greater than maxloc\nIncrease value of constant maxloc\n",nloc);
exit(1);
}
if (npop>maxpop){
printf("ERROR - number of populations %d greater than maxpop\nIncrease value of constant maxpop\n",npop);
exit(1);
}
for(i=0;imaxall){
printf("ERROR - number of alleles at locus %d is %d, greater than maxall=%d\nIncrease value of maxall\n",i,numall[i],maxall);
exit(1);
}
for (j=0;jfst[jmp1]){
tmpd=fst[l1];
fst[l1]=fst[jmp1];
fst[jmp1]=tmpd;
}
}
}
/* CALCULATE PERCENTILES BY SETTING i'th ORDER STATISTIC EQUAL TO
(i-0.5)/n AND USING LINEAR INTERPOLATION
*/
fprintf(out1,"|Prior |%7f|",mnf);
/* 2.5th */
u=0.025*nout+0.5;
k=(int)(u);
if (k==u){
fprintf(out1,"%8f|",fst[k]);
}
else {
fprintf(out1,"%8f|",fst[k]+fmod(u,1.0)*(fst[k+1]-fst[k]));
}
/* 5th */
u=0.05*nout+0.5;
k=(int)(u);
if (k==u){
fprintf(out1,"%8f|",fst[k]);
}
else {
fprintf(out1,"%8f|",fst[k]+fmod(u,1.0)*(fst[k+1]-fst[k]));
}
/* 25th */
u=0.25*nout+0.5;
k=(int)(u);
if (k==u){
fprintf(out1,"%8f|",fst[k]);
}
else {
fprintf(out1,"%8f|",fst[k]+fmod(u,1.0)*(fst[k+1]-fst[k]));
}
/* 50th */
u=0.5*nout+0.5;
k=(int)(u);
if (k==u){
fprintf(out1,"%8f|",fst[k]);
}
else {
fprintf(out1,"%8f|",fst[k]+fmod(u,1.0)*(fst[k+1]-fst[k]));
}
/* 75th */
u=0.75*nout+0.5;
k=(int)(u);
if (k==u){
fprintf(out1,"%8f|",fst[k]);
}
else {
fprintf(out1,"%8f|",fst[k]+fmod(u,1.0)*(fst[k+1]-fst[k]));
}
/* 95th */
u=0.95*nout+0.5;
k=(int)(u);
if (k==u){
fprintf(out1,"%8f|",fst[k]);
}
else {
fprintf(out1,"%8f|",fst[k]+fmod(u,1.0)*(fst[k+1]-fst[k]));
}
/* 97.5th */
u=0.975*nout+0.5;
k=(int)(u);
if (k==u){
fprintf(out1,"%8f|\n",fst[k]);
}
else {
fprintf(out1,"%8f|\n",fst[k]+fmod(u,1.0)*(fst[k+1]-fst[k]));
}
fprintf(out1,"-----------------------------------------------------------------------------------\n\n");
fprintf(out1,"****** Dataset ******\n\n");
fprintf(out1,"Number of loci read in = %d\n",nloc);
fprintf(out1,"Number of subpopulations read in = %d\n\n",npop);
fprintf(out1,"Data ...\n");
fprintf(out1,"Loc Pop Sample size Allele counts\n");
for (i=0;i=1);
return u1*sqrt(-2*log(w)/w);
}
double loggf(double xx)
{
/* Adapted from the underlying algorithm in Numerical Recipes in C (1988)
[by Press et al., published by Cambridge University Press], for
calculating the log of the gamma function
*/
double lg=(xx+0.5)*log(xx+5.5)-xx-5.5, const1=1.000000000190015;
const1+= 76.1800917297146/(xx+1.0);
const1+= 24.01409824083091/(xx+3.0);
const1+= 0.001208650973866179/(xx+5.0);
const1-= 86.50532032941677/(xx+2.0);
const1-= 1.2317395572450155/(xx+4.0);
const1-= 0.000005395239384953/(xx+6.0);
lg+= log(2.5066282746310005*const1/xx);
return lg;
}
double lmd(double f0,int i,int j)
{
/* modified version of function by M.A. Beaumont for evaluating
multinomial-Dirichlet likelihood for locus i, population j
(excluding constant)
*/
int k;
double t1,a;
for (k=0,t1=0.0;k0){
p[i][k]+=(x[i][j][k]+1.0)/((popsum[i][j]+numall[i])*(npop-apop[i]));
}
fprintf(out1,"%7.3f",p[i][k]);
}
fprintf(out1,"\n");
}
fprintf(out1,"\n");
fprintf(out1,"****** Acceptance rates from algorithm ******\n\n");
fflush(out1);
/******************************************************/
/*BELOW: VERSION OF ALGORITHM WITH GAMMA TERM EXCLUDED*/
/******************************************************/
/* INTERACTION TERM SWITCH */
if (gammaswitch==0){
/* INITIALISE HYPERPARAMETER VECTORS (a, b) - GENERATE FROM PRIOR */
a[0]=amu+rnorm()*asi;
for(i=1;i0&&i0){
lp[i]+= lmd(f(a[i]+b[j]),i,j); /* LIKELIHOOD */
}
}
z+= lp[i];
}
/* RUN METROPOLIS ALGORITHM (for numit+discard iterations) */
usa=asi*usi;
usb=bsi*usi/sqrt(nloc);
jmp1=0; jmp2=0;
for (l1=-discard;l10){
newlp[i]+= lmd(f(a[i]+newb[j]),i,j);
}
}
newz+= newlp[i];
}
/* DECIDE WHETHER TO ACCEPT/REJECT CANDIDATE VECTORS */
if ((Illegal == 0) && (log(WichHill())0&&i0){
newlp[i]+= lmd(f(newa[i]+b[j]),i,j);}
}
/* DECIDE WHETHER TO ACCEPT/REJECT CANDIDATE VECTORS */
if ((Illegal == 0) && (log(WichHill())0) pa[i]= ldnorm(a[i],cor*a[i-1],srk*asi);
}
}
/* OUTPUT - **** THESE VALUES SHOULD BE CHECKED FOR CONVERGENCE**** */
if(l1 > 0){
if ((l1/kp)*kp==l1){
fprintf(out2,"%7.0f",z);
for (i=0;i0) && ((l1+discard)/acrt)*acrt==(l1+discard))
{ /* OUTPUT ACCEPTANCE RATE TO LOG FILE */
fprintf(out1,"Iter: %8d, Accpt. rate: %12.6f%12.6f\n",l1,((double)(jmp1)/acrt),((double)(jmp2)/acrt/nloc));
jmp1=0; jmp2=0; fflush(out1);
}
}
fclose(out2);
fclose(out1);
closegfsr();
}
/******************************************************************/
/* END OF ALGORITHM IF INTERACTION TERM, GAMMA, HAS BEEN EXCLUDED */
/******************************************************************/
/******************************************************/
/*BELOW: VERSION OF ALGORITHM WITH GAMMA TERM INCLUDED*/
/******************************************************/
/* INTERACTION TERM SWITCH */
else if (gammaswitch==1) {
/* INITIALISE HYPERPARAMETER VECTORS (a, b, g) - GENERATE FROM PRIOR */
a[0]=amu+rnorm()*asi;
for(i=1;i0&&i0){
lp[i]+= lmd(f(a[i]+b[j]+g[i*npop+j]),i,j); /* LIKELIHOOD */
}
}
z+= lp[i];
}
/* RUN METROPOLIS ALGORITHM (for numit+discard iterations) */
usa=asi*usi;
usb=bsi*usi/sqrt(nloc);
usg=gsi*usi/sqrt(npop);
jmp1=0; jmp2=0;
for (l1=-discard;l10){
newlp[i]+= lmd(f(a[i]+newb[j]+g[i*npop+j]),i,j);
}
}
newz+= newlp[i];
}
/* DECIDE WHETHER TO ACCEPT/REJECT CANDIDATE VECTORS */
if ((Illegal == 0) && (log(WichHill())0&&i0){
newlp[i]+= lmd(f(newa[i]+b[j]+newg[i*npop+j]),i,j);}
}
/* DECIDE WHETHER TO ACCEPT/REJECT CANDIDATE VECTORS */
if ((Illegal == 0) && (log(WichHill())0) pa[i]= ldnorm(a[i],cor*a[i-1],srk*asi);
}
}
/* OUTPUT - **** THESE VALUES SHOULD BE CHECKED FOR CONVERGENCE**** */
if(l1 > 0){
if ((l1/kp)*kp==l1){
fprintf(out2,"%7.0f",z);
for (i=0;i0) && ((l1+discard)/acrt)*acrt==(l1+discard))
{ /* OUTPUT ACCEPTANCE RATE TO LOG FILE */
fprintf(out1,"Iter: %8d, Accpt. rate: %12.6f%12.6f\n",l1,((double)(jmp1)/acrt),((double)(jmp2)/acrt/nloc));
jmp1=0; jmp2=0; fflush(out1);
}
}
fclose(out2);
fclose(out1);
closegfsr();
}
}