EVOLUTION-MANAGER

Edit File: geneticmaps.Rnw

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Karl W. Broman
% The new multiple-qtl mapping functions
%
% This is an "Sweave" document; see the corresponding PDF.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\documentclass[12pt]{article}

\usepackage{times}
\usepackage{amsmath}
\usepackage{url}
\usepackage{color}

% revise margins
\setlength{\headheight}{0.0in}
\setlength{\topmargin}{-0.25in}
\setlength{\headsep}{0.0in}
\setlength{\textheight}{9.00in}
\setlength{\footskip}{0.5in}
\setlength{\oddsidemargin}{0in}
\setlength{\evensidemargin}{0in}
\setlength{\textwidth}{6.5in}

\setlength{\parskip}{6pt}
\setlength{\parindent}{0pt}

\newcommand{\code}{\texttt}
\newcommand{\lod}{\text{LOD}}

% make "Figure" within figure captions a small font
\renewcommand{\figurename}{\small Figure}

\begin{document}

\SweaveOpts{prefix.string=Figs/fig, eps=TRUE}
\setkeys{Gin}{width=\textwidth} %% <- change width of figures
% Try to get the R code from running into the margin
<<echo=FALSE>>=
options(width=87, digits=3, scipen=4)
set.seed(61777369)
@

% function (to be used later) to round numbers in a nicer way.
<<myround,echo=FALSE,results=hide>>=
library(broman) # includes myround() and holmans triangle stuff
@

% Change S input/output font size
\DefineVerbatimEnvironment{Sinput}{Verbatim}{fontsize=\footnotesize, baselinestretch=0.75, formatcom = {\color[rgb]{0, 0, 0.56}}}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{fontsize=\footnotesize, baselinestretch=0.75, formatcom = {\color[rgb]{0.56, 0, 0}}}

\begin{center}

\vspace*{1in}

\textbf{\Large Genetic map construction with R/qtl}

\bigskip \bigskip

{\large Karl W. Broman}

\bigskip \bigskip

{\color{blue} \tt \bfseries University of Wisconsin-Madison \\
Department of Biostatistics \& Medical Informatics \\
Technical Report \# 214 \\[24pt]
4 November 2010

\color{red} Revised 21 March 2012: \color{black} some function names
have changed \\
(e.g., \verb|plot.geno| is now \verb|plotGeno|)
}

\end{center}

\bigskip \bigskip

\textbf{\sffamily Abstract}: Genetic map construction remains an important
prerequisite for quantitative trait loci analysis in
organisms for which genomic sequence is not available.  Surprisingly
little has been written about the construction of genetic maps,
particularly regarding the many practical issues involved.  We provide
an overview of the process, including a description of the key
facilities in the R/qtl software.  The process is illustrated through
simulated data on an F$_2$ intercross derived from two inbred strains.

\vfill

\noindent \emph{email}: \verb|broman@wisc.edu|

\newpage

\textbf{\sffamily Introduction}

Mouse geneticists no longer have to be concerned with genetic map
construction, as the available genomic sequence provides the physical
locations and order of genetic markers.  Scientists interested in
mapping quantitative trait loci (QTL) in other organisms, however,
often must first spend considerable time identifying polymorphic
markers and then constructing a genetic map specifying the chromosomal
locations of the markers.

Surprisingly little has been written about the construction of genetic
maps.  Certainly, papers describing seminal genetic maps include some
description of the methods used, but students desire a general
overview of the process along with a discussion of key issues that
arise and how to overcome them.  I attempt to provide such an overview
here, along with a description of the key facilities in R/qtl
(Broman et al., \emph{Bioinformatics}
18:889--890, 2003; see \url{https://rqtl.org}) for genetic map construction.  While I
describe the use of R/qtl for this purpose in considerable detail, I
hope that my general comments are useful for, and that the R code
is not an obstacle to, readers interested in the process generally or
in other software.

R/qtl is principally aimed at the analysis of simple crosses
(particularly backcrosses and intercrosses) derived from a pair of
inbred strains.  R/qtl includes the ability to analyze doubled
haploids or a panel of recombinant inbred lines or even a phase-known
four-way cross, but here I will focus on the simple case of an
intercross between two inbred strains.

\bigskip
\textbf{\sffamily Experimental design issues}
\nopagebreak

Before proceeding with our discussion of genetic map construction, let
us first describe some of the design issues that a scientist should
consider before embarking on such a project.

One must first ask: what sort of cross should be performed?  As I
have in mind a QTL mapping application, I would suggest using the
same cross (that is, the same material) for both constructing a
genetic map and for the subsequent QTL mapping.  It is likely the case
that one will need a larger cross for the QTL mapping than for
map construction, and also one probably should use a larger
set of markers for map construction than for QTL mapping.  Thus the
ideal strategy may be to produce a large cross, for QTL mapping, but
genotype only a portion of that cross on the full set of markers.
Once the genetic map is constructed, one may then identify a smaller
set of markers, which cover the genome as evenly as possible, to
genotype on the remainder of the cross.

In spite of the fact that map construction often requires fewer
individuals than QTL mapping, as a general rule one should use more
than the minimal number of individuals for map construction.  Issues
such as genotyping errors and segregation distortion are more easily
overcome with a larger cross population.

Similarly, one should aim for a much larger set of genetic markers
than simple calculations about genome size might indicate to be
sufficient.  In organisms with a mature complement of genomic
resources, one may choose an ideal set of markers that evenly cover
the genome and that are all well behaved and easily genotyped.  In the
\emph{de novo\/} construction of a genetic map, however, markers will
be developed at random, and so some regions will be densely covered
with markers and other regions will be quite thin with markers.
Variation in recombination rate (as well as in marker density and
polymorphism) will exacerbate this issue.  Moreover, one often finds
that some markers are very difficult to genotype, and rather than
spend considerable time optimizing such markers (such as by
redesigning primers or changing PCR conditions), it would be most
expedient if such poorly behaved markers could simply be omitted.  If
one has at hand more markers than are really necessary, it is less of
a concern to be dropping 10\% of them.

In the actual genotyping, one should always include DNA from the
parental lines and F$_1$ individuals as controls, preferably on each
plate.  Ideally, include the \emph{actual\/} parents and F$_1$
individuals; if residual heterozygosity is identified, the genotypes
of these progenitors will be extremely useful for making sense of the
data.  Additional blind duplicates are useful to give some sense of
the overall genotyping error rate.

\bigskip
\textbf{\sffamily Load the data}
\nopagebreak

I will consider, as an illustration, a set of simulated data.  Real
data might be preferred, but by considering simulated data, I can be
sure that they feature all of the various issues that I wish to
illustrate.  Moreover, these features can be made quite striking, so
that there will be little question here of what to do.  In practice,
aberrant features in the data will often be less than clear and may
require additional genotyping, or at least a reconsideration of the
raw genotyping results, before the appropriate action is clear.

The data set is called \code{mapthis} and is included in
R/qtl version 1.19-10 and above.  This is an intercross between two
inbred lines, with 300 individuals genotyped at 100 markers.  There is
just one ``phenotype,'' which contains individual identifiers.  The
100 markers were chosen at random from five chromosomes of length 200,
150, 100, 75 and 50~cM, respectively.  The chromosomal locations of
the markers are to be treated as unknown, though the marker names
do contain this information: marker \code{C2M4} is the fourth marker on
chromosome~2.   (Note that the chromosomes are all autosomes.  I'm
not considering the X chromosome in this tutorial.)

To load the data, in R, one first needs to load the R/qtl package,
using the function \code{library()}.  The data set is then loaded via
\code{data()}.

% load the data
<<loaddata>>=
library(qtl)
data(mapthis)
@

In practice, one would use the function \code{read.cross()} to load a
data set into R.  See the help file (type \code{?read.cross}),
look at the sample data sets at \url{https://rqtl.org/sampledata},
and consider Chapter~2 of Broman and Sen (2009) \emph{A Guide to
QTL Mapping with R/qtl} (\url{https://rqtl.org/book}).

In assembling the sort of comma-delimited file that R/qtl can read,
one should assign all markers to the same chromosome (which can be
named whatever is convenient, ``\code{1}'' or ``\code{un}'' or
whatever).  There is no need to assign map positions to markers.

A data file for the \code{mapthis} data is at
\url{https://rqtl.org/tutorials/mapthis.csv}.  It can be read into
R as follows.  (Note the use of \code{estimate.map=FALSE}.  Markers
will be assigned dummy locations, with no attempt to estimate
inter-marker distances.)

% read the data (illustration)
<<readdata,eval=FALSE>>=
mapthis <- read.cross("csv", "https://rqtl.org/tutorials", "mapthis.csv",
                      estimate.map=FALSE)
@

\bigskip
\textbf{\sffamily Omit individuals and markers with lots of missing data}
\nopagebreak

After importing a new data set, the first thing I look at is a summary
of the cross.

% summary
<<summarycross>>=
summary(mapthis)
@
I look to make sure that the cross type (in this case, an F$_2$
intercross) was determined correctly and that the numbers of
individuals and markers is as expected.  The function
\code{summary.cross()} also performs a variety of basic checks of the
integrity of the data, and so I'd pay attention to any warning or
error messages.

Next, I look at the pattern of missing data, through the function
\code{plotMissing()}.

% plot missing data pattern
<<plotmissing,eval=FALSE>>=
plotMissing(mapthis)
@

\begin{figure}
\centering
<<plotmissingplot,fig=TRUE,echo=FALSE,height=3.5>>=
par(mar=c(4.1,4.1,0.6,1.1))
plotMissing(mapthis, main="")
@
\caption{Pattern of missing genotype data in the \code{mapthis} dataset.  Black
  pixels indicate missing genotypes.\label{fig:plotmissing}}
\end{figure}

The result (in Fig.~\ref{fig:plotmissing}) indicates several
individuals with a great deal of missing data (horizontal lines), as well as several
markers with a great deal of missing data (vertical lines).

The function \code{ntyped()} provides the numbers of genotyped markers
for each individual (or the number of genotyped individuals for each
marker).  Let us plot these.  (And note that there is a related
function, \code{nmissing()}, which provides the number of missing
genotypes for each individual or marker.)

% plot of number of genotypes
<<plotntyped,eval=FALSE>>=
par(mfrow=c(1,2), las=1)
plot(ntyped(mapthis), ylab="No. typed markers", main="No. genotypes by individual")
plot(ntyped(mapthis, "mar"), ylab="No. typed individuals",
     main="No. genotypes by marker")
@

\begin{figure}
\centering
<<plotntypedplot,fig=TRUE,echo=FALSE,height=3>>=
par(mfrow=c(1,2), las=1, cex=0.8)
plot(ntyped(mapthis), ylab="No. typed markers", main="No. genotypes by individual")
plot(ntyped(mapthis, "mar"), ylab="No. typed individuals",
     main="No. genotypes by marker")
@
\caption{Plot of number of genotyped markers for each individual (left
  panel) and number of genotyped individuals for each marker (right
  panel).\label{fig:plotntyped}}
\end{figure}

As seen in Fig.~\ref{fig:plotntyped}, there are six individuals
missing almost all genotypes, and there are four markers that are
missing genotypes at about half of the individuals.  Such appreciable
missing data often indicates a problem (either bad DNAs or
difficult-to-call markers) and can lead to difficulties in the genetic
map construction, and so it is best, at this stage, to omit them,
though one might consider adding them back in later.

To omit the individuals with lots of missing genotype data, we may use
the function \code{subset.cross()}, as follows.

% drop individuals with missing data
<<dropind>>=
mapthis <- subset(mapthis, ind=(ntyped(mapthis)>50))
@

To omit the markers with lots of missing data, we first need to
identify the names of the markers.  We then use \code{drop.markers()}.

% drop markers with missing data
<<dropmarkers>>=
nt.bymar <- ntyped(mapthis, "mar")
todrop <- names(nt.bymar[nt.bymar < 200])
mapthis <- drop.markers(mapthis, todrop)
@

%We now have 294 individuals and 96 markers:
%% nind; totmar
%<<nindtotmar>>=
%nind(mapthis)
%totmar(mapthis)
%@

\bigskip
\textbf{\sffamily Identify duplicate individuals}
\nopagebreak

I find it useful to compare the genotypes between all pairs of
individuals, in order to reveal pairs with unusually similar
genotypes, which may indicate sample duplications or monozygotic
twins.  In either case, we will want to remove one individual from
each pair.  Such duplicates are not common, but they are also not
rare.

We use the function \code{comparegeno} to compare the genotypes for
all pairs of individuals.  The output is a matrix, whose contents are
the proportions of markers at which the pairs have matching
genotypes.

% compare individuals' genotypes
<<comparegeno, eval=FALSE>>=
cg <- comparegeno(mapthis)
hist(cg[lower.tri(cg)], breaks=seq(0, 1, len=101), xlab="No. matching genotypes")
rug(cg[lower.tri(cg)])
@

\begin{figure}
\centering
<<comparegenoplot,fig=TRUE,echo=FALSE,height=3>>=
cg <- comparegeno(mapthis)
par(mar=c(4.1,4.1,0.1,0.6),las=1)
hist(cg[lower.tri(cg)], breaks=seq(0, 1, len=101), xlab="No. matching genotypes",
     main="")
rug(cg[lower.tri(cg)])
@
\caption{Histogram of the proportion of markers for which pairs of
  individuals have matching genotypes.\label{fig:comparegenoplot}}
\end{figure}

As seen in Fig.~\ref{fig:comparegenoplot}, a pair of F$_2$ siblings
typically share genotypes at
\Sexpr{round(mean(cg[lower.tri(cg)])*100,-1)}\% of the markers.  But
there are some pairs with well over 90\% matching genotypes.  We may
identify these pairs as follows.

% identify matching pairs
<<matchingpairs>>=
wh <- which(cg > 0.9, arr=TRUE)
wh <- wh[wh[,1] < wh[,2],]
wh
@

We may inspect the genotype matches for these pairs with the following.

% matching genotypes
<<matchinggenotypes>>=
g <- pull.geno(mapthis)
table(g[144,], g[292,])
table(g[214,], g[216,])
table(g[238,], g[288,])
@

As seen above, in each case, the pairs have matching genotypes at all
but 2 or 3 markers.  Ideally, one would go back to the records to try
to assess the source of these problems (e.g., are the pairs from the
same litters?).  Here, we will simply omit one individual from each
pair.  But first we will omit the genotypes that mismatch, as these are
indicated to be errors in one or the other individual (or both).  The
R code is a bit complicated.

% zero out the mismatches
<<dropmismatches>>=
for(i in 1:nrow(wh)) {
  tozero <- !is.na(g[wh[i,1],]) & !is.na(g[wh[i,2],]) & g[wh[i,1],] != g[wh[i,2],]
  mapthis$geno[[1]]$data[wh[i,1],tozero] <- NA
}
@

Now, we omit one individual from each pair.

% omit individuals
<<omitdup>>=
mapthis <- subset(mapthis, ind=-wh[,2])
@

It's also useful to look for duplicate markers (that is, markers with
identical genotypes).  This is particularly true for the case of very
large sets of markers; multiple markers with identical genotypes will
invariably map to the same location, and so one might as well thin out
the markers so that there are no such duplicates, as the extra markers
simply slow down all of our analyses.  The function
\code{findDupMarkers()} may be used to identify markers with matching
genotypes.  (Note that the function \code{drop.dupmarkers()} is for
dropping markers with matching \emph{names}, and considers the
genotypes only in order to establish consensus genotypes across the
multiple markers with the same name).  Here, though, there are no
markers with matching genotypes.

% look for duplicate markers
<<finddupmar>>=
print(dup <- findDupMarkers(mapthis, exact.only=FALSE))
@

\bigskip
\textbf{\sffamily Look for markers with distorted segregation patterns}
\nopagebreak

We next study the segregation patterns of the markers.  We expect the
genotypes to appear with the frequencies 1:2:1.  Moderate departures
from these frequencies are not unusual and may indicate the presence
of partially lethal alleles.  Gross departures from these frequencies often
indicate problematic markers that should be omitted, at least
initially: monomorphic markers (that is, where the two parental lines
actually have the same allele), or markers that are especially
difficult to call (e.g., AA are often called as AB).  We use the
function \code{geno.table} to inspect the segregation patterns.
It calculates the genotype frequencies and also a P-value for a test
of departure from the 1:2:1 expected ratios.  We will focus on those
markers that show significant distortion at the 5\% level, after a Bonferroni
correction for the multiple tests.

% inspect segregation patterns
<<genotable>>=
gt <- geno.table(mapthis)
gt[gt$P.value < 0.05/totmar(mapthis),]
@
% $

The first of these markers, \code{C4M2}, is not terrible.  The others
appear to be monomorphic with a few errors (\code{C2M9} and
\code{C2M15}) or have one genotype that is quite rare, which likely
indicates difficulties in genotyping.  It would be best to omit the
worst of these markers.

% drop bad markers
<<dropbadmarkers>>=
todrop <- rownames(gt[gt$P.value < 1e-10,])
mapthis <- drop.markers(mapthis, todrop)
@
% $

\bigskip
\textbf{\sffamily Study individuals' genotype frequencies}
\nopagebreak

Just as we expect the markers to segregate 1:2:1, we expect the
individuals to have genotype frequencies that are in approximately
those proportions.  Studying such genotype frequencies may help to
identify individuals with high genotyping error rates or some other
labeling or breeding mistake.

There's no R/qtl function to to pull out the genotype frequencies by
individual, but we can write a bit of R code to do so.  It is a bit
nasty, with calls to \code{apply()}, \code{table()}, \code{factor()},
\code{colSums()} and \code{t()}, but hopefully the reader can figure
this out after a bit of exploration and introspection.

<<genofreqbyind,eval=FALSE>>=
g <- pull.geno(mapthis)
gfreq <- apply(g, 1, function(a) table(factor(a, levels=1:3)))
gfreq <- t(t(gfreq) / colSums(gfreq))
par(mfrow=c(1,3), las=1)
for(i in 1:3)
  plot(gfreq[i,], ylab="Genotype frequency", main=c("AA", "AB", "BB")[i],
       ylim=c(0,1))
@

\begin{figure}
\centering
<<plotgenofreqbyind,fig=TRUE,echo=FALSE,height=3.5>>=
g <- pull.geno(mapthis)
gfreq <- apply(g, 1, function(a) table(factor(a, levels=1:3)))
gfreq <- t(t(gfreq) / colSums(gfreq))
par(mfrow=c(1,3), las=1)
for(i in 1:3)
  plot(gfreq[i,], ylab="Genotype frequency", main=c("AA", "AB", "BB")[i],
       ylim=c(0,1))
@
\caption{Genotype frequencies by individual.\label{fig:plotgenofreqbyind}}
\end{figure}

The results in Fig.~\ref{fig:plotgenofreqbyind} do not indicate any
particular problems, though the small number of short chromosomes result
in considerable variability, including one individual with
no BB genotypes, and the frequencies of
AB genotypes varies from
\Sexpr{round(min(gfreq[2,])*100)}--\Sexpr{round(max(gfreq[2,])*100)}\%.
However, if there were an individual with $\sim$ 100\% AA or BB
genotypes (like one of the parental strains), we would see it here.

A more fancy, and potentially more clear view of these genotype
frequencies is obtained by representing them as points in an
equilateral triangle (see Fig.~\ref{fig:triangleplot}).  For any point
within an equilateral triangle, the sum of the distances to the three
sides is constant, and so one may represent a trinomial distribution as
a point within the triangle.  I won't show the code, but note that the
red X in the center of the figure corresponds to the expected genotype
frequencies (1/4, 1/2, 1/4).  Each black dot corresponds to an
individual's genotype frequencies.  There's one point along the right
edge; this corresponds to the individual with no BB genotypes.  Again,
the small size of the genome results in enormous variation, and so no
one individual stands out as unreasonable.

\begin{figure}
\centering
<<triangleplot,fig=TRUE,echo=FALSE,height=3,results=hide>>=
par(mar=rep(0.1,4), pty="s")
triplot(labels=c("AA","AB","BB"))
tripoints(gfreq, cex=0.8)
tripoints(c(0.25, 0.5, 0.25), col="red", lwd=2, cex=1, pch=4)
@
\caption{Genotype frequencies by individual, represented on an
  equilateral triangle.  The red X indicates the expected frequencies
  of 1:2:1.\label{fig:triangleplot}}
\end{figure}

\bigskip
\textbf{\sffamily Study pairwise marker linkages; look for switched alleles}
\nopagebreak

We are now in a position to begin the genetic map construction.  We
start by assessing the linkage between all pairs of markers.  The
function \code{est.rf()} is used to
estimate the recombination fraction between each pair and to calculate
a LOD score for a test of rf = 1/2.  But first note that, in the presence
of appreciable segregation distortion (which is not the case for these
data), unlinked markers can appear to be linked.  For example,
consider the following 2$\times$2 table of two-locus genotypes in a backcross.

\begin{center}
  \begin{tabular}{cccc} \hline
     & \multicolumn{2}{c}{\textbf{Marker 2}} \\ \cline{2-3}
    \textbf{Marker 1} & AA & AB & overall\\ \hline
    AA    &  243 & 27 & 270 \\
    AB    &   27 &  3 &  30 \\ \hline
  overall &  270 & 30 & 300 \\ \hline
  \end{tabular} \end{center}

In this case, the two markers are segregating independently but show
severe segregation distortion. (They segregate 9:1 rather than  1:1.)
The usual estimate of the recombination fraction is (27+27)/300 =
0.18, and the LOD score for the test of rf = 1/2 is $\sim$28.9.

If segregation distortion is rampant in a dataset (and such things do
happen), the usual tests of pairwise linkage will give distorted
results, and so it is best to instead use a simple chi-square or
likelihood ratio test, to assess the association between markers.  The
function \code{markerlrt()} behaves just like \code{est.rf()}, but
uses a general likelihood ratio test in place of the usual test of
pairwise linkage.

Now back to the data, which are not so badly behaved.

% pairwise linkages
<<pairwiselinkage>>=
mapthis <- est.rf(mapthis)
@
\begin{Schunk}
\begin{Soutput}
Warning message:
In est.rf(mapthis) : Alleles potentially switched at markers
  C3M16 C2M16 C1M2 C3M9 C2M14 C1M24 C1M1 C2M12 C1M36 C3M1 C2M25 C1M22 C5M5 C5M7 C1M17 C5M1
  C3M5 C1M15 C2M24 C2M17 C1M23 C5M6 C1M16 C3M2 C3M10 C3M6 C2M13
\end{Soutput}
\end{Schunk}

Note the warning message, which indicates that there are numerous
markers with likely switched alleles (A $\leftrightarrow$ B).  This is
indicated through pairs of markers that are strongly indicated to be
associated but have estimated recombination fractions $\gg$ 1/2.
The \code{checkAlleles()} function gives more detailed information on
this issue.

% checkAlleles
<<checkAlleles>>=
checkAlleles(mapthis, threshold=5)
@

The final column in the output for a marker, \code{diff.in.max.LOD}, is the
difference between the maximum LOD score for the cases where the
estimated recombination fraction is $>$ 1/2 and the maximum LOD score
for the cases where the estimated recombination fraction is $<$ 1/2.
There are a large number of markers that are tightly associated with other
markers, but with recombination fractions well above 1/2, which
indicates that some markers likely have their alleles switched.

A plot of the LOD scores against the estimated recombination fractions
for all marker pairs will give another good view of this problem.  We
use the function \code{pull.rf()} to pull out the recombination
fractions and LOD scores as matrices.

<<lodvrf,eval=FALSE>>=
rf <- pull.rf(mapthis)
lod <- pull.rf(mapthis, what="lod")
plot(as.numeric(rf), as.numeric(lod), xlab="Recombination fraction", ylab="LOD score")
@

\begin{figure}
\centering
<<lodvrfplot,fig=TRUE,echo=FALSE,height=3>>=
rf <- pull.rf(mapthis)
lod <- pull.rf(mapthis, what="lod")
par(mar=c(4.1,4.1,0.6,0.6), las=1, cex=0.8)
plot(as.numeric(rf), as.numeric(lod), xlab="Recombination fraction", ylab="LOD score")
@
\caption{Plot of LOD scores versus estimated recombination fractions
  for all marker pairs.\label{fig:lodvrf}}
\end{figure}

As seen in Fig.~\ref{fig:lodvrf}, there are many marker pairs with
large LOD scores but estimated recombination fractions $\gg$ 1/2.

One solution to this problem is to form initial linkage groups,
ensuring that markers with rf $>$ 1/2 are placed in different groups.
If all goes well, each chromosome will come out as a pair of linkage
groups: one containing markers with correct alleles and another
containing markers with switched alleles.

We use the function \code{formLinkageGroups()} to infer linkage
groups.  It uses the pairwise linkage results from \code{est.rf()}
(or, pairwise association information from \code{markerlrt()}).
The function has two main arguments: \code{max.rf} and
\code{min.lod}.  Two markers will be placed in the same linkage groups
if they have estimated recombination fraction $\le$ \code{max.rf} and
LOD score $\ge$ \code{min.lod}.  The linkage groups are closed via the
transitive property.  That is, if markers a and b are linked and
markers b and c are linked, then all three are placed in the same
linkage group.  I generally start with \code{min.lod} relatively large
(say 6 or 8 or even 10).  The appropriate value depends on the number
of markers and chromosomes (and individuals).  The aim is to get as
many truly linked markers together, but avoid completely putting
unlinked markers in the same linkage group.  It is usually easier to
combine linkage groups after the fact rather than to have to split
linkage groups apart.

<<forminitialgroups>>=
lg <- formLinkageGroups(mapthis, max.rf=0.35, min.lod=6)
table(lg[,2])
@

The output of \code{formLinkageGroups()} is a matrix with two columns:
the initial linkage group or chromosome for each marker, and then the
assigned linkage group, as inferred from the pairwise linkage
information.  The inferred linkage groups are numbered in decreasing
order of size (so that linkage group 1 has the largest number of
markers).  Here we see that \Sexpr{length(table(lg[,2]))} linkage
groups were inferred, with the last \Sexpr{sum(table(lg[,2])==1)}
groups having just one marker.  One may play with \code{max.rf} and
\code{min.lod} until the result is about as expected.  Here, I was
hoping for about 10 linkage groups (since there are five chromosomes but
each will likely be split in two due to the allele switching problem),
and so the results seem okay.

Since we are happy with the results, we can reorganize the markers
into these inferred linkage groups.  We do so with the same function,
\code{formLinkageGroups()}, via the argument \code{reorgMarkers}.

<<reorganizemarkers>>=
mapthis <- formLinkageGroups(mapthis, max.rf=0.35, min.lod=6, reorgMarkers=TRUE)
@

A plot of the pairwise recombination fractions and LOD scores may
indicate how well this worked.

<<plotrf,eval=FALSE>>=
plotRF(mapthis, alternate.chrid=TRUE)
@

\begin{figure}
\centering
<<plotrfplot,fig=TRUE,echo=FALSE,height=5.5>>=
par(mar=c(4.1,4.1,2.1,2.1), las=1)
plotRF(mapthis, main="", alternate.chrid=TRUE)
@
\caption{Plot of estimated recombination fractions (upper-left
  triangle) and LOD scores (lower-right triangle) for all pairs of
  markers.  Red indicates linked (large LOD score or small
  recombination fraction) and blue indicates not linked (small LOD
  score or large recombination fraction).\label{fig:plotrf}}
\end{figure}

The result, in Fig.~\ref{fig:plotrf}, looks about as expected.  The
markers within linkage group 1 are linked to each other.  The pattern
within the group is a bit random, but that is because we have not yet
ordered the markers in any way.

Note that linkage groups 4 and 5 are
associated with each other, in that they have large LOD scores (the
lower-right rectangle for groups 4 and 5 is largely red), but
their recombination fractions are not small (the upper-left rectangle
for groups 4 and 5 is strongly blue).  I would infer from this that
these markers belong to the same chromosome, but the alleles in one or
the other group are switched.  We can't know which is correct, and
ideally one would have parental genotype data to help fix these
problems, but it is not unreasonable, for now, to assume that the
larger group of markers corresponds to the ones with the correct
alleles.

Similarly, groups 6 and 7 belong together, group 8 belongs to
group 2, and groups 9--11 belong to group 1.

We can look at this more clearly by picking out a marker from one
group and studying the recombination fractions and LOD scores for that
marker against all others.  Let us pick the third marker in linkage
group 4.  We create the objects \code{rf} and \code{lod} again, using
\code{pull.rf()}.  These objects have class \code{"rfmatrix"}, and so \code{plot}
below actually goes to the function \code{plot.rfmatrix()}, which
plots the values for a single marker in a display similar to a set of
LOD curves.

<<plotrfonemarker,eval=FALSE>>=
rf <- pull.rf(mapthis)
lod <- pull.rf(mapthis, what="lod")
mn4 <- markernames(mapthis, chr=4)
par(mfrow=c(2,1))
plot(rf, mn4[3], bandcol="gray70", ylim=c(0,1), alternate.chrid=TRUE)
abline(h=0.5, lty=2)
plot(lod, mn4[3], bandcol="gray70", alternate.chrid=TRUE)
@

\begin{figure}
\centering
<<plotrfonemarkerplot,fig=TRUE,echo=FALSE,height=4>>=
par(mar=c(4.1,4.1,1.1,0.6), las=1)
<<plotrfonemarker>>
@
\caption{Plot of estimated recombination fractions (top panel) and
  LOD scores (bottom panel) for the marker \code{\Sexpr{mn4[3]}} against all
  other markers.\label{fig:plotrfonemarker}}
\end{figure}

As seen in Fig.~\ref{fig:plotrfonemarker}, the marker
\code{\Sexpr{mn4[3]}} is strongly associated with markers in both
linkage groups 4 and 5, but the estimated recombination fractions
between \code{\Sexpr{mn4[3]}} and the markers on linkage group 4 are
all $<$ 1/2, while the estimated recombination fractions
between \code{\Sexpr{mn4[3]}} and the markers on linkage group 5 are
all $>$ 1/2.  We can see the problem even more clearly by inspecting
a few tables of two-locus genotypes, produced by
\code{geno.crosstab()}.

<<genocrosstab>>=
geno.crosstab(mapthis, mn4[3], mn4[1])
mn5 <- markernames(mapthis, chr=5)
geno.crosstab(mapthis, mn4[3], mn5[1])
@

It is clear that, if the genotypes for \code{\Sexpr{mn4[3]}} are
correct, then the A and B alleles for \code{\Sexpr{mn5[1]}} are
switched.

In practice, I will look at enormous numbers of these sorts of
two-locus genotype tables in order to figure out what is going on.
But these data are (by design) quite clean, and so we will simply
trust that the alleles need to be switched for all of the markers on
linkage groups 5 and 7--11.  The function \code{switchAlleles()} is
convenient for performing the allele switching.

<<switchalleles>>=
toswitch <- markernames(mapthis, chr=c(5, 7:11))
mapthis <- switchAlleles(mapthis, toswitch)
@

Now when we revisit the plot of recombination fractions and LOD
scores, we will see a quite different picture.  (Note that we need to
re-run \code{est.rf()} after having run \code{switchAlleles()}.)

<<plotrfagain,eval=FALSE>>=
mapthis <- est.rf(mapthis)
plotRF(mapthis, alternate.chrid=TRUE)
@

\begin{figure}
\centering
<<plotrfagainplot,fig=TRUE,echo=FALSE,height=5.5>>=
mapthis <- est.rf(mapthis)
par(mar=c(4.1,4.1,2.1,2.1), las=1)
plotRF(mapthis, main="", alternate.chrid=TRUE)
@
\caption{Plot of estimated recombination fractions (upper-left
  triangle) and LOD scores (lower-right triangle) for all pairs of
  markers, after having switched alleles for many markers.  Red
  indicates linked (large LOD score or small recombination fraction)
  and blue indicates not linked (small LOD score or large
  recombination fraction).\label{fig:plotrfagain}}
\end{figure}

As seen in Fig.~\ref{fig:plotrfagain}, the LOD scores between marker pairs
are unchanged, but now the recombination fractions are small
(indicated in red) for marker pairs with evidence of association
(large LOD scores, also in red).

It is useful to revisit the plot of LOD scores versus recombination
fractions for all pairs.

<<lodvrfagain,eval=FALSE>>=
rf <- pull.rf(mapthis)
lod <- pull.rf(mapthis, what="lod")
plot(as.numeric(rf), as.numeric(lod), xlab="Recombination fraction", ylab="LOD score")
@

\begin{figure}
\centering
<<lodvrfagainplot,fig=TRUE,echo=FALSE,height=3>>=
rf <- pull.rf(mapthis)
lod <- pull.rf(mapthis, what="lod")
par(mar=c(4.1,4.1,0.6,0.6), las=1, cex=0.8)
plot(as.numeric(rf), as.numeric(lod), xlab="Recombination fraction", ylab="LOD score")
@
\caption{Plot of LOD scores versus estimated recombination fractions
  for all marker pairs, after alleles at many markers have been
  switched.\label{fig:lodvrfagain}}
\end{figure}

Now we see (in Fig.~\ref{fig:lodvrfagain}) no estimated recombination
fractions that are much above 1/2, and certainly no large
recombination fractions with large LOD scores.  In fact, the largest
LOD score for marker pairs with estimated recombination fraction $>$
1/2 is \Sexpr{round(max(lod[!is.na(rf) & rf>0.5]), 2)}.

\bigskip
\textbf{\sffamily Form linkage groups}
\nopagebreak

We now should have the genotype data in good shape and can finally
proceed to the actual map construction.  First, we again attempt to
infer the linkage groups, with the hope that we'll come away with
exactly five.

<<formgroupsagain>>=
lg <- formLinkageGroups(mapthis, max.rf=0.35, min.lod=6)
table(lg[,2])
@

Right; we've got five groups.  Now we reorganize the markers again.

<<reorganizemarkersagain>>=
mapthis <- formLinkageGroups(mapthis, max.rf=0.35, min.lod=6, reorgMarkers=TRUE)
@

It is useful to plot the pairwise recombination fractions and LOD
scores again.

<<plotrfyetagain,eval=FALSE>>=
plotRF(mapthis)
@

\begin{figure}
\centering
<<plotrfyetagainplot,fig=TRUE,echo=FALSE,height=5.5>>=
mapthis <- est.rf(mapthis)
par(mar=c(4.1,4.1,1.6,1.6), las=1)
plotRF(mapthis, main="")
@
\caption{Plot of estimated recombination fractions (upper-left
  triangle) and LOD scores (lower-right triangle) for all pairs of
  markers, after markers have been placed in their final linkage
  groups.  Red indicates linked (large LOD score or small
  recombination fraction) and blue indicates not linked (small LOD
  score or large recombination fraction).\label{fig:plotrfyetagain}}
\end{figure}

As seen in Fig.~\ref{fig:plotrfyetagain}, we have five clear linkage
groups, with markers within a group linked to one another, and markers
in distinct groups showing no evidence of linkage.  The results
couldn't possibly be cleaner (and they wouldn't be, if these were real
data).

\bigskip
\textbf{\sffamily Order markers on chromosome~5}
\nopagebreak

We now turn to the problem of ordering markers within each linkage
group.  (We could go ahead and call them chromosomes at this point.)
I always start with the chromosome with the fewest markers, as there
are fewer possible orders and so the process is quicker.  I like to
see some progress before I move to the larger chromosomes.

The function \code{orderMarkers()} may be used to establish an initial
order.  It picks an arbitrary pair of markers, and then adds an
additional marker (chosen at random), one at a time, in the best
supported position.  With the argument \code{use.ripple=TRUE}, the
function \code{ripple()} is used after the addition of each marker, to
consider all possible orders in a sliding window of markers, to
attempt to improve marker order.  The argument \code{window} defines
the length of the window.  Larger values will explore more possible
orders but will require considerably more computation time.  The value
\code{window=7} is usually about the largest one would ever consider;
use of \code{window=8} will take so long that one would not want to
sit and wait.  Note that the other arguments to \code{orderMarkers}
(e.g., \code{error.prob} and \code{map.function}) are passed to
the function \code{est.map()} for estimation of the genetic map with
the final order that is chosen.  Also note that \code{ripple()}, in
this case, chooses among orders in order to minimize the number of
obligate crossovers.  More on this below.

<<orderchrfive,eval=FALSE>>=
mapthis <- orderMarkers(mapthis, chr=5)
@
<<orderchrfiverun,echo=FALSE>>=
file <- "Rcache/order5.RData"
if(file.exists(file)) {
  load(file)
} else {
<<orderchrfive>>
  save(mapthis, file=file)
}
@

We may use \code{pull.map()} to inspect the result.

<<chrfivemap>>=
pull.map(mapthis, chr=5)
@

Since the marker names were chosen to indicate the true marker order,
we can see that we got the order exactly right.  (Though note that the
second marker, \code{C5M2} was omitted at some point during the course
of our analysis.)  Of course, we wouldn't know this, and so we should
make an attempt to explore alternate orders, in case another order
might be seen to be an improvement.

We use \code{ripple()} to explore alternate orders.  As mentioned
above, it considers all possible orders of markers in a sliding window
of fixed length, with the argument \code{window} defining the width of
the window.  If \code{window} is equal to the number of markers on the
chromosome, than all possible orders are considered.

The quickest approach is to count the number of obligate crossovers
for each order.  The good orders generally are those that result in
the smallest numbers of crossovers.  The more refined, but
considerably slower approach, is to compare the likelihoods of the
different orders.  (The likelihood for a given marker order is the
probability of the data assuming that order is correct and plugging in
estimates of the inter-marker distances.)  We do this with
\code{method="likelihood"} in \code{ripple()}.  We may also indicate a
genotyping error probability (through \code{error.prob}).

While we could start by comparing orders to minimize the number of
obligate crossovers (with \code{method="countxo"}, the default, in
\code{ripple()}), this was already done when we called
\code{orderMarkers()}, and it is not necessary to run it again.
Nevertheless, the results will indicate how close, in terms of number
of crossovers, the next-best marker order is to the inferred one.

<<ripplechr5run,echo=FALSE,results=hide>>=
file <- "Rcache/rip5.RData"
if(file.exists(file)) {
  load(file)
} else {
rip5 <- ripple(mapthis, chr=5, window=7)
save(rip5, file=file)
}
@
<<ripplechr5,eval=FALSE>>=
rip5 <- ripple(mapthis, chr=5, window=7)
@
\begin{Schunk}
\begin{Soutput}
   13680 total orders
\end{Soutput}
\end{Schunk}
<<summaryripple5>>=
summary(rip5)
@

As seen above, 13,680 marker orders were considered.  (There are
$9!/2$ = 181,440 total marker orders.)  On my computer, the code above
took about 1.5 seconds.  If we'd looked at all possible orders (that
is, with \code{window=9}), it would take about 14 seconds, but the
results---I checked---are the same.)  Switching markers 8 and 9
results in one additional obligate crossover.  If, instead, one switches
markers 5 and 6, there are two additional obligate crossovers.

It is good to also study the likelihood of different orders, though we
will want to greatly reduce the \code{window} argument, so that it can
be accomplished in a reasonable amount of time.

<<ripplechr5likrun,echo=FALSE,results=hide>>=
file <- "Rcache/rip5lik.RData"
if(file.exists(file)) {
  load(file)
} else {
rip5lik <- ripple(mapthis, chr=5, window=4, method="likelihood",
                  error.prob=0.005)
save(rip5lik, file=file)
}
@
<<ripplechr5lik,eval=FALSE>>=
rip5lik <- ripple(mapthis, chr=5, window=4, method="likelihood",
                  error.prob=0.005)
@
\begin{Schunk}
\begin{Soutput}
   114 total orders
\end{Soutput}
\end{Schunk}
<<summaryripple5lik>>=
summary(rip5lik)
@

The result (on my computer) took about two minutes.  We see that
switching markers 8 and 9 has a LOD score (that is, log$_{10}$
likelihood, relative to the initial order) of
\Sexpr{round(rip5lik[2,"LOD"], 1)}, which indicates that it is slightly
preferred.  However, the estimated chromosome length is slightly
longer (\Sexpr{round(rip5lik[2,"chrlen"], 1)} versus
\Sexpr{round(rip5lik[1,"chrlen"], 1)}~cM).

We know, from the marker names, that the initial order was the true
order, but in practice we would not have such information, and we
would probably want to switch markers 8 and 9.  Usually we are looking
to have the estimated chromosome length be as short as possible, but
if we trust the likelihood calculation, we should go with the
alternate order.  Of course, the two orders are not really
distinguishable; we can't really say, on the basis of these data,
whether the correct order is the first or the second.

Note that the results here can be sensitive to the assumed genotyping
error rate.  (I chose \code{error.prob=0.005} above, because the
data were simulated with this error rate.)  The function
\code{compareorder()} can be used to compare an initial order to a
fixed alternative order.  We can use this to quickly inspect how
sensitive the results are to the assumed error rate.

<<compareorder>>=
compareorder(mapthis, chr=5, c(1:7,9,8), error.prob=0.01)
compareorder(mapthis, chr=5, c(1:7,9,8), error.prob=0.001)
compareorder(mapthis, chr=5, c(1:7,9,8), error.prob=0)
@

These results indicate that with smaller assumed genotyping error rates, the evidence in
favor of switching markers 8 and 9 increases somewhat.  Note also that
the map length increases quite a bit.

If we were looking at these data blindly, I would likely go with
switching markers 8 and 9, so let's go ahead and do that here.  We use
\code{switch.order()} to do so.  It takes the same sort of arguments
as \code{est.map()}, as after the marker order is switched,
\code{est.map()} is called to revise the estimated map for the chromosome.

<<switchorder>>=
mapthis <- switch.order(mapthis, chr=5, c(1:7,9,8), error.prob=0.005)
pull.map(mapthis, chr=5)
@

Note that the map is slightly smaller than what we had seen above,
after running \code{orderMarkers()}, as we had used the default
value for \code{error.prob} in that function, and now we are using
\code{error.prob=0.005}.  Also note that markers \code{C5M10} and
\code{C5M9} are quite close together and are separated from the next
marker by \Sexpr{round(diff(pull.map(mapthis, chr=5)[[1]][7:8]),1)}~cM.
This explains why it is difficult to assess the appropriate order for
these two markers.

\bigskip
\textbf{\sffamily Order markers on chromosome~4}
\nopagebreak

We now turn to chromosome~4.  First, we run \code{orderMarkers()} and print out the estimated map.

<<orderchrfour,eval=FALSE>>=
mapthis <- orderMarkers(mapthis, chr=4)
pull.map(mapthis, chr=4)
@
<<orderchrfourrun,echo=FALSE>>=
file <- "Rcache/order4.RData"
if(file.exists(file)) {
  load(file)
} else {
<<orderchrfour>>
  save(mapthis, file=file)
}
pull.map(mapthis, chr=4)
@

The marker names tell us the true order, and so we see that the
inferred order is correct except that markers \code{C4M10} and
\code{C4M9} are switched.  Let us run \code{ripple()}, to see which
orders have similar numbers of obligate crossovers.

<<ripplechr4run,echo=FALSE,results=hide>>=
file <- "Rcache/rip4.RData"
if(file.exists(file)) {
  load(file)
} else {
rip4 <- ripple(mapthis, chr=4, window=7)
save(rip4, file=file)
}
@
<<ripplechr4,eval=FALSE>>=
rip4 <- ripple(mapthis, chr=4, window=7)
@
\begin{Schunk}
\begin{Soutput}
   18000 total orders
\end{Soutput}
\end{Schunk}
<<summaryripple4>>=
summary(rip4)
@

The order with markers 9 and 10 switched gives the same number of
obligate crossovers, and so we can not distinguish between these two
marker orders.  We turn to the likelihood comparison.

<<ripplechr4likrun,echo=FALSE,results=hide>>=
file <- "Rcache/rip4lik.RData"
if(file.exists(file)) {
  load(file)
} else {
rip4lik <- ripple(mapthis, chr=4, window=4, method="likelihood",
                  error.prob=0.005)
save(rip4lik, file=file)
}
@
<<ripplechr4lik,eval=FALSE>>=
rip4lik <- ripple(mapthis, chr=4, window=4, method="likelihood",
                  error.prob=0.005)
@
\begin{Schunk}
\begin{Soutput}
   132 total orders
\end{Soutput}
\end{Schunk}
<<summaryripple4lik>>=
summary(rip4lik)
@

There is reasonably good evidence to switch markers 9 and 10 (which is
nice, as this results in the true marker order).

<<switchmarkers4>>=
mapthis <- switch.order(mapthis, chr=4, c(1:8,10,9), error.prob=0.005)
pull.map(mapthis, chr=4)
@

\bigskip
\textbf{\sffamily Order markers on chromosome~3}
\nopagebreak

Turning to chromosome~3, we again run \code{orderMarkers()}.  These
calculations are starting to take quite a bit of time.  It would all
be faster if we reduced the argument \code{window} to a smaller value
(say 4 rather than 7), but then not as many alternate orders will be
explored and so we may not identify the best order.

<<orderchrthree,eval=FALSE>>=
mapthis <- orderMarkers(mapthis, chr=3)
pull.map(mapthis, chr=3)
@
<<orderchrthreerun,echo=FALSE>>=
file <- "Rcache/order3.RData"
if(file.exists(file)) {
  load(file)
} else {
<<orderchrthree>>
  save(mapthis, file=file)
}
pull.map(mapthis, chr=3)
@

Note, from the marker names, that the marker order is the true order.
The whole chromosome is flipped, but we have no information, from the
genotype data, to orient the chromosome.  Let us again use
\code{ripple()} to study alternate orders.

<<ripplechr3run,echo=FALSE,results=hide>>=
file <- "Rcache/rip3.RData"
if(file.exists(file)) {
  load(file)
} else {
rip3 <- ripple(mapthis, chr=3, window=7)
save(rip3, file=file)
}
@
<<ripplechr3,eval=FALSE>>=
rip3 <- ripple(mapthis, chr=3, window=7)
@
\begin{Schunk}
\begin{Soutput}
   39600 total orders
\end{Soutput}
\end{Schunk}
<<summaryripple3>>=
summary(rip3)
@

The next-best order, with markers 2 and 3 switched, results in an
additional \Sexpr{diff(rip3[1:2,"obligXO"])} obligate crossovers.
We turn to the likelihood comparison.

<<ripplechr3likrun,echo=FALSE,results=hide>>=
file <- "Rcache/rip3lik.RData"
if(file.exists(file)) {
  load(file)
} else {
rip3lik <- ripple(mapthis, chr=3, window=4, method="likelihood",
                  error.prob=0.005)
save(rip3lik, file=file)
}
@
<<ripplechr3lik,eval=FALSE>>=
rip3lik <- ripple(mapthis, chr=3, window=4, method="likelihood",
                  error.prob=0.005)
@
\begin{Schunk}
\begin{Soutput}
   222 total orders
\end{Soutput}
\end{Schunk}
<<summaryripple3lik>>=
summary(rip3lik)
@

The next-best order (with markers 2 and 3 switched) is considerably
worse than our initial order.

\bigskip
\textbf{\sffamily Order markers on chromosome~2}
\nopagebreak

We turn to chromosome~2, beginning with \code{orderMarkers()}.

<<orderchrtwo,eval=FALSE>>=
mapthis <- orderMarkers(mapthis, chr=2)
pull.map(mapthis, chr=2)
@
<<orderchrtworun,echo=FALSE>>=
file <- "Rcache/order2.RData"
if(file.exists(file)) {
  load(file)
} else {
<<orderchrtwo>>
  save(mapthis, file=file)
}
pull.map(mapthis, chr=2)
@

Again, as seen from the marker names, the inferred order is the true
one.  Let us run \code{ripple()} to inspect alternate orders.

<<ripplechr2run,echo=FALSE,results=hide>>=
file <- "Rcache/rip2.RData"
if(file.exists(file)) {
  load(file)
} else {
rip2 <- ripple(mapthis, chr=2, window=7)
save(rip2, file=file)
}
@
<<ripplechr2,eval=FALSE>>=
rip2 <- ripple(mapthis, chr=2, window=7)
@
\begin{Schunk}
\begin{Soutput}
   78480 total orders
\end{Soutput}
\end{Schunk}
<<summaryripple2>>=
summary(rip2)
@

The next-best order, with markers 3 and 4 switched, has
\Sexpr{diff(rip2[1:2,"obligXO"])} additional obligate crossovers. We
turn to the likelihood comparison.

<<ripplechr2likrun,echo=FALSE,results=hide>>=
file <- "Rcache/rip2lik.RData"
if(file.exists(file)) {
  load(file)
} else {
rip2lik <- ripple(mapthis, chr=2, window=4, method="likelihood",
                  error.prob=0.005)
save(rip2lik, file=file)
}
@
<<ripplechr2lik,eval=FALSE>>=
rip2lik <- ripple(mapthis, chr=2, window=4, method="likelihood",
                  error.prob=0.005)
@
\begin{Schunk}
\begin{Soutput}
   384 total orders
\end{Soutput}
\end{Schunk}
<<summaryripple2lik>>=
summary(rip2lik)
@

The next-best order, in terms of likelihood, has markers 21 and 22
switched, and is considerably worse than our initial order.

It is interesting to compare the numbers of obligate crossovers for
different orders to the LOD scores.  We have LOD scores for a much
smaller number of orders (\Sexpr{nrow(rip2lik)} versus
\Sexpr{nrow(rip2) %/% 1000},\Sexpr{nrow(rip2) %% 1000}), since we had used a smaller value for
\code{window}, but we can pull out the obligate crossover counts
for those orders that we evaluated in terms of likelihood.  It is a
bit tricky to line up the orders.

<<comparexo2lik,eval=FALSE>>=
pat2 <- apply(rip2[,1:24], 1, paste, collapse=":")
pat2lik <- apply(rip2lik[,1:24], 1, paste, collapse=":")
rip2 <- rip2[match(pat2lik, pat2),]
plot(rip2[,"obligXO"], rip2lik[,"LOD"], xlab="obligate crossover count",
     ylab="LOD score")
@

\begin{figure}
\centering
<<comparexo2likplot,fig=TRUE,echo=FALSE,height=3>>=
par(las=1, mar=c(4.1,4.1,1.1,0.1), cex=0.8)
<<comparexo2lik>>
@
\caption{Comparison of the number of obligate crossovers to the LOD
  score (relative to our inferred order), for chromosome~2 marker orders explored
  via \code{ripple()}.\label{fig:comparexo2lik}}
\end{figure}

As seen in Fig.~\ref{fig:comparexo2lik}, there is a clear negative relationship
between the crossover counts and the likelihoods, though the relationship is
not perfect, particularly for the orders that are not well supported.

\bigskip
\textbf{\sffamily Order markers on chromosome~1}
\nopagebreak

Finally, we turn to chromosome~1, first running \code{orderMarkers()}.

<<orderchrone,eval=FALSE>>=
mapthis <- orderMarkers(mapthis, chr=1)
pull.map(mapthis, chr=1)
@
<<orderchronerun,echo=FALSE>>=
file <- "Rcache/order1.RData"
if(file.exists(file)) {
  load(file)
} else {
<<orderchrone>>
  save(mapthis, file=file)
}
pull.map(mapthis, chr=1)
@

Again, the inferred order is precisely the true order.  I was quite
surprised to see how well \code{orderMarkers()} performed with these
data.  Of course, the data are quite clean (by design) and comprise quite a
large number of individuals.  In practice, one can't expect
\code{orderMarkers()} to perform so well,
and it is worthwhile to study the estimated map and the
pairwise linkage information closely.  We will do so in a moment, but
first let's complete our analysis of chromosome~1 by studying the results of
\code{ripple}.

<<ripplechr1run,echo=FALSE,results=hide>>=
file <- "Rcache/rip1.RData"
if(file.exists(file)) {
  load(file)
} else {
rip1 <- ripple(mapthis, chr=1, window=7)
save(rip1, file=file)
}
@
<<ripplechr1,eval=FALSE>>=
rip1 <- ripple(mapthis, chr=1, window=7)
@
\begin{Schunk}
\begin{Soutput}
   117360 total orders
\end{Soutput}
\end{Schunk}
<<summaryripple1>>=
summary(rip1)
@

Two alternate orders (switching markers 28 and 29, or switching
markers 31 and 32) have 2 additional obligate crossovers, compared
to our initial order.  We turn to the likelihood comparison.

<<ripplechr1likrun,echo=FALSE,results=hide>>=
file <- "Rcache/rip1lik.RData"
if(file.exists(file)) {
  load(file)
} else {
rip1lik <- ripple(mapthis, chr=1, window=4, method="likelihood",
                  error.prob=0.005)
save(rip1lik, file=file)
}
@
<<ripplechr1lik,eval=FALSE>>=
rip1lik <- ripple(mapthis, chr=1, window=4, method="likelihood",
                  error.prob=0.005)
@
\begin{Schunk}
\begin{Soutput}
   546 total orders
\end{Soutput}
\end{Schunk}
<<summaryripple1lik>>=
summary(rip1lik)
@

The next best order (switching markers 28 and 29) results in a
moderately large decrease in likelihood, and so we stick with the
initial order produced by \code{orderMarkers()}.

Let us return to the question of how to assess the quality of the
results of \code{orderMarkers()}, aside from our investigations with
\code{ripple()}.  The \code{ripple()} analysis is excellent for
comparing nearby marker orders.  And for chromosomes with a modest
number of markers (like chromosome~5, here), one can consider all
possible orders exhaustively.  But with many markers (as with
chromosome~1), we can investigate only a small proportion of the possible
orders.  For example, with 33 markers (as on chromosome~1), there are
$33!/2 \approx 10^{36}$ possible marker orders, and we investigated
only 117,360 of them using \code{ripple} with \code{window=7}.

There are several things to look it, in assessing whether gross
changes in marker order are necessary.  (Ideally, R/qtl would include
a function providing a more complete investigation of marker order,
perhaps via simulated annealing or another randomized optimization
algorithm, and we do hope to implement such a feature in the future.)
First, we should look at the actual map: are there large gaps between
markers, indicating adjacent markers that are only weakly linked?
Studying the map locations with \code{pull.map()}, as above, is hard
when there are lots of markers.  The function \code{summaryMap()}
provides a summary of the average inter-marker distance and the
largest gap on each chromosome.

<<summarymap>>=
summaryMap(mapthis)
@
<<savesummarymap,echo=FALSE,results=hide>>=
firstsummary <- summaryMap(mapthis)
@

We see that chromosome~1 has a
\Sexpr{round(summaryMap(mapthis)[1,"max.spacing"],1)}~cM gap; the
other chromosomes have gaps no larger than 25~cM.  The large gap on
chromosome~1 is suspicious, but it is not terrible.

In addition to this summary, it is good to also plot the map.
We use the argument \code{show.marker.names=TRUE} so that marker names
are included.  Most of them are unreadable, because the markers are
densely spaced, but at least we learn the identity of markers that are
surrounded by large gaps.

<<plotmap,eval=FALSE>>=
plotMap(mapthis, show.marker.names=TRUE)
@

\begin{figure}
\centering
<<plotmapplot,fig=TRUE,echo=FALSE,height=3>>=
par(las=1, mar=c(4.1,4.1,1.1,0.1), cex=0.8)
plotMap(mapthis, main="", show.marker.names=TRUE)
@
\caption{Plot of the estimated genetic map, after the markers on each
  chromosome have been ordered.\label{fig:plotmap}}
\end{figure}

As seen in Fig.~\ref{fig:plotmap}, there are some large gaps on
chromosome~1 and some smaller gaps on the other chromosomes.  They
deserve further investigation (and we will do so in the next section),
but they aren't particularly troubling.  With gross mistakes in marker
order, one will often see much larger gaps, say $>$ 100~cM.

Finally, we inspect the pairwise linkage information again.

<<plotrfonemoretime,eval=FALSE>>=
plotRF(mapthis)
@

\begin{figure}
\centering
<<plotrfonemoretimeplot,fig=TRUE,echo=FALSE,height=5.5>>=
par(mar=c(4.1,4.1,1.6,1.6), las=1)
plotRF(mapthis, main="")
@
\caption{Plot of estimated recombination fractions (upper-left
  triangle) and LOD scores (lower-right triangle) for all pairs of
  markers, after ordering the markers on each chromosome.
  Red indicates linked (large LOD score or small
  recombination fraction) and blue indicates not linked (small LOD
  score or large recombination fraction).\label{fig:plotrfonemoretime}}
\end{figure}

The pairwise linkage information in Fig.~\ref{fig:plotrfonemoretime}
is close to what we want to see: nearby markers show clear association
and no distant markers show any association: red along the diagonal,
dissipating to blue away from the diagonal.

If there were gross problems with marker order, we might see groups of
distantly placed markers that are more highly associated than more closely
placed markers.  Just as an example, suppose we split chromosome~1
into three pieces and moved the latter piece into the middle.  Let's
look at the pairwise linkage information following such a
re-ordering.

<<plotrfafterreorder,eval=FALSE>>=
messedup <- switch.order(mapthis, chr=1, c(1:11,23:33,12:22),
                         error.prob=0.005)
plotRF(messedup, chr=1)
@

\begin{figure}
\centering
<<plotrfafterreorderplot,fig=TRUE,echo=FALSE,height=3>>=
par(mar=c(4.1,4.1,1.6,1.6), las=1, pty="s", cex=0.8)
messedup <- switch.order(mapthis, chr=1, c(1:11,23:33,12:22),
                         error.prob=0.005)
plotRF(messedup, chr=1, main="")
@
\caption{Plot of estimated recombination fractions (upper-left
  triangle) and LOD scores (lower-right triangle) for all pairs of
  markers on chromosome~1, after messing up the order of the markers.
  Red indicates linked (large LOD score or small
  recombination fraction) and blue indicates not linked (small LOD
  score or large recombination fraction).\label{fig:plotrfafterreorder}}
\end{figure}

Fig.~\ref{fig:plotrfafterreorder} (on
page~\pageref{fig:plotrfafterreorder}) displays the sort of pattern one should
expect with gross problems in marker order: the markers in the first and last
segments are more tightly associated to each other than they are to
the markers in the middle segment.

A plot of the genetic map (see Fig.~\ref{fig:plotmapmessedup}, page~\pageref{fig:plotmapmessedup}) shows a
50~cM gap between the first and middle segments and an 80~cM gap
between the middle and last segments.

<<plotmapmessedup,eval=FALSE>>=
plotMap(messedup, show.marker.names=TRUE)
@

\begin{figure}
\centering
<<plotmapmessedupplot,fig=TRUE,echo=FALSE,height=3>>=
par(las=1, mar=c(4.1,4.1,1.1,0.1), cex=0.8)
plotMap(messedup, main="", show.marker.names=TRUE)
@
\caption{Plot of the estimated genetic map, after messing up the order
  of markers on chromosome~1.\label{fig:plotmapmessedup}}
\end{figure}

The map is not as telling as the pairwise linkage information, but
these are the sorts of things to look at, in trying to decide whether
there are gross problems that need to be corrected.

If features such as those in Fig.~\ref{fig:plotrfafterreorder} and
\ref{fig:plotmapmessedup} were seen, one should identify the segments
of markers that need to be moved around, and then use
\code{switch.order()} to reorganize the markers.  One might first use
\code{compareorder()} to compare the current order to the reorganized
one, to see that the new order gave a clear improvement in likelihood.
In addition, one will often need to alternate between \code{ripple()}
and \code{switch.order()} until the final marker order is established.

\bigskip
\textbf{\sffamily Drop one marker at a time}
\nopagebreak

The large gaps in the genetic map on chromosome~1 remain a concern.
While such gaps may indicate problems with the order of markers, they
might also indicate a high genotyping error rate at certain markers.
If an individual marker is more prone to genotyping errors than
others, it would often (in the sort of analyses performed above) be
placed at one end of the chromosome or the other, but in some cases
(particularly if the genotyping error rate is not high and there are a
large number of individuals in the cross) it may be placed at
approximately the correct position but result in reasonably large gaps
surrounding the marker.

One approach for identifying such problematic markers is to drop one
marker at a time and investigate the change in chromosome length and
the change in log likelihood.  This analysis may be accomplished with
the function \code{droponemarker()}.

<<droponemarker,eval=FALSE>>=
dropone <- droponemarker(mapthis, error.prob=0.005)
@
<<droponemarkerrun,echo=FALSE,results=hide>>=
file <- "Rcache/dropone.RData"
if(file.exists(file)) {
  load(file)
} else {
  dropone <- droponemarker(mapthis, error.prob=0.005)
  save(dropone, file=file)
}
@

The results are of the same form as a genome scan by QTL mapping.  (In
particular, they have class \code{"scanone"}, like an object produced
by the \code{scanone()} function.)  We may plot the results as follows.

<<plotdropone,eval=FALSE>>=
par(mfrow=c(2,1))
plot(dropone, lod=1, ylim=c(-100,0))
plot(dropone, lod=2, ylab="Change in chromosome length")
@

\begin{figure}
\centering
<<plotdroponeplot,fig=TRUE,echo=FALSE,height=4>>=
par(mar=c(4.1,4.1,1.6,0.1), mfrow=c(2,1), cex=0.8)
plot(dropone, lod=1, ylim=c(-100,0))
plot(dropone, lod=2, ylab="Change in chr length (cM)")
@
\caption{Results of dropping one marker at a time.  The top panel
  contains LOD scores; positive values would indicate that dropping a
  marker improves the likelihood.  The bottom panel indicates the
  decrease in estimated chromosome length (in cM) following dropping
  a marker.\label{fig:plotdropone}}
\end{figure}

As seen in top panel of Fig.~\ref{fig:plotdropone}, there is no one
marker for which its omission results in an increase in likelihood,
but as seen in the bottom panel, there are a number of markers that
give an appreciable decrease in chromosome length, of 15--25 cM, when
omitted.  Markers at the ends of chromosomes will often result in a
smaller estimated chromosome length, but that is just because such
terminal markers hang off some distance from the rest of the
markers, and so these changes can often be discounted.  Interior markers
that result in a big change, and there appears to be one on each of
chromosomes 1, 2 and 3, might indicate error-prone markers that are
best omitted.

One may identify the marker on each chromosome whose omission results
in the largest decreases in chromosome length as follows.

<<worstmarkers>>=
summary(dropone, lod.column=2)
@

For chromosomes 4 and 5, these are terminal markers.  The markers on
chromosomes 1, 2 and 3 are all interior markers.  One should probably
study the pairwise linkage between these markers and
surrounding markers before proceeding, but we will go ahead and remove
these markers without any further investigations.

<<dropbadmarkers>>=
badmar <- rownames(summary(dropone, lod.column=2))[1:3]
mapthis <- drop.markers(mapthis, badmar)
@

One should re-estimate the genetic map.  We use \code{replace.map()}
to insert it into the cross object.

<<reestimatemap>>=
newmap <- est.map(mapthis, error.prob=0.005)
mapthis <- replace.map(mapthis, newmap)
summaryMap(mapthis)
@
<<savenewsummary,echo=FALSE,results=hide>>=
secondsummary <- summaryMap(mapthis)
@

Removing those three markers resulted in a decrease in the overall map
length from \Sexpr{round(firstsummary["overall","length"])}~cM to
\Sexpr{round(secondsummary["overall","length"])}~cM.

\bigskip
\textbf{\sffamily Look for problem individuals}
\nopagebreak

Now that we have the markers in their appropriate order, it is a good
idea to return to the question of whether there are particular
individuals whose data are problematic.  One should ask: if a
particular individual showed considerable genotyping errors or did not
actually belong to the cross under investigation (e.g., through some
sort of labeling or breeding error), what sort of aberrations might be
seen in the data?  Above, we studied the amount of missing genotype
data for each individual as well as the individuals' genotype
frequencies.  Another feature to investigate is the observed number of
crossovers in each individual.  These may be counted with the function
\code{countXO()}.  (A related function \code{locateXO()} will return
the estimated locations of all crossovers.)

<<countxo,eval=FALSE>>=
plot(countXO(mapthis), ylab="Number of crossovers")
@

\begin{figure}
\centering
<<countxoplot,fig=TRUE,echo=FALSE,height=3>>=
par(mar=c(4.1,4.1,0.6,0.6), cex=0.8)
<<countxo>>
thecounts <- countXO(mapthis)
worst <- rev(sort(thecounts, decreasing=TRUE)[1:2])
@
\caption{Numbers of observed crossovers in each individual.\label{fig:countxo}}
\end{figure}

The crossover counts in Fig.~\ref{fig:countxo} clearly indicate two
problematic individuals, with \Sexpr{worst[1]} and \Sexpr{worst[2]}
crossovers; the other individuals have
\Sexpr{min(thecounts)}--\Sexpr{max(thecounts[thecounts < worst[1]])}
crossovers.  We should remove these individuals.

<<drophighxoind>>=
mapthis <- subset(mapthis, ind=(countXO(mapthis) < 50))
@

Ideally, we would now revisit the entire process again; in particular,
after removing these problematic individuals, is there evidence that
marker order needs to be changed?  Let us at least look at chromosome 5.

<<rip5again>>=
summary(rip <- ripple(mapthis, chr=5, window=7))
summary(rip <- ripple(mapthis, chr=5, window=2, method="likelihood",
                      error.prob=0.005))
@

There is good evidence for switching markers 8 and 9, which actually
brings us back to the true order of the markers.

<<switchchr5again>>=
mapthis <- switch.order(mapthis, chr=5, c(1:7,9,8), error.prob=0.005)
pull.map(mapthis, chr=5)
@

I investigated the other four chromosomes similarly, and there was no
evidence for further changes in marker order, so I won't present the
results here.  We should, finally, re-estimate the genetic map.

<<reestmapagain>>=
newmap <- est.map(mapthis, error.prob=0.005)
mapthis <- replace.map(mapthis, newmap)
summaryMap(mapthis)
@
<<savethirdsummary,echo=FALSE,results=hide>>=
thirdsummary <- summaryMap(mapthis)
@

The overall map length has decreased further, from
\Sexpr{round(secondsummary["overall","length"])}~cM to
\Sexpr{round(thirdsummary["overall","length"])}~cM.

\bigskip
\textbf{\sffamily Estimate genotyping error rate}
\nopagebreak

Above, I had generally assumed a genotyping error rate of 5/1000 (in
estimating map distances and in likelihood calculations comparing
different marker orders); but I cheated somewhat in using this value,
as the genotype data were simulated with this error rate.

One can actually estimate the genotyping error rate from the data, as
the function \code{est.map()} not only estimates the inter-marker
distances, but also calculates the log likelihood for each
chromosome.  Thus, if we run \code{est.map()} with different assumed
values for the genotyping error rate (specified with the
\code{error.prob} argument), one can identify the maximum likelihood
estimate of the error rate.

<<studyerrorrate,eval=FALSE>>=
loglik <- err <- c(0.001, 0.0025, 0.005, 0.0075, 0.01, 0.0125, 0.015, 0.0175, 0.02)
for(i in seq(along=err)) {
  cat(i, "of", length(err), "\n")
  tempmap <- est.map(mapthis, error.prob=err[i])
  loglik[i] <- sum(sapply(tempmap, attr, "loglik"))
}
lod <- (loglik - max(loglik))/log(10)
@
<<runstudyerrorrate,echo=FALSE,results=hide>>=
file <- "Rcache/errorrate.RData"
if(file.exists(file)) {
  load(file)
} else {
<<studyerrorrate>>
  save(err, lod, file=file)
}
@

The code is a bit tricky, mostly because the log likelihoods (and
note that they are on the natural log scale) are included as
``attributes'' to each chromosome component in the output from
\code{est.map()}.  We use \code{sapply()} to pull those out, and then
we add them up.  We finally convert them to the log$_{10}$ scale and
re-center them so that the maximum is 0.

We may plot the log$_{10}$ likelihood as follows.

<<ploterrorratelik,eval=FALSE>>=
plot(err, lod, xlab="Genotyping error rate", xlim=c(0,0.02),
     ylab=expression(paste(log[10], " likelihood")))
@

\begin{figure}
\centering
<<ploterrorratelikplot,fig=TRUE,echo=FALSE,height=3>>=
par(mar=c(4.1,4.1,0.6,0.6), las=1)
<<ploterrorratelik>>
@
\caption{The log$_{10}$ likelihood for the genotyping error
  rate.\label{fig:errorratelik}}
\end{figure}

The log$_{10}$ likelihood in Fig.~\ref{fig:errorratelik} indicates
that the MLE is approximately 0.005.  Error rates of 0.0025 and 0.0075
have log$_{10}$ likelihoods that are 3 less than that of the MLE.  We
might investigate additional assumed error rates, or even use the R
function \code{optimize()} to refine our estimate, but we won't pursue
that here.

\bigskip
\textbf{\sffamily Look for genotyping errors}
\nopagebreak

While the methods for estimating inter-marker distances allow for the
presence of genotyping errors at a fixed rate, it is nevertheless
worthwhile to look for, and ideally correct, potential genotyping
errors in the data.  Such errors may be identified through apparent
tight double-crossovers, with a single marker being out of phase with
its adjacent markers.

The most convenient approach for identifying such double-crossovers is
to calculate genotyping error LOD scores, first developed by Lincoln
and Lander (\emph{Genomics\/} 14:604--610, 1992).  The LOD score
compares the likelihood for a genotype being in error versus it not
being in error.   R/qtl uses a modified calculation of such
genotyping error LOD scores, with all genotypes
except that being considered assumed to be strictly correct.

The error LOD scores are calculated with \code{calc.errorlod()}.  One
must assume a genotyping error rate, but the results are almost
identical for a wide range of values.

<<errorlod,eval=FALSE>>=
mapthis <- calc.errorlod(mapthis, error.prob=0.005)
@
<<runerrorlod,echo=FALSE,results=hide>>=
file <- "Rcache/errorlod.RData"
if(file.exists(file)) {
  load(file)
} else {
<<errorlod>>
  save(mapthis, file=file)
}
@

The function \code{top.errorlod()} produces a list of the genotypes
with the largest error LOD scores.  One may generally focus on those
with quite large values, say at least 4--5.  Here will we look at just
those genotypes with error LOD $\ge$ 6; this is indicated with the
argument \code{cutoff}.

<<toperrorlod>>=
print(toperr <- top.errorlod(mapthis, cutoff=6))
@

There are \Sexpr{nrow(toperr)} genotypes that meet this criterion.
Let us look at the genotypes that were flagged on chromosome 1, using
the function \code{plotGeno()}.  The argument \code{cutoff} indicates
a threshold for flagging genotypes as potential errors, based on their
error LOD scores.  If we had used the argument \code{include.xo=TRUE}
(which is the default), inferred locations of crossovers would be
displayed; we suppress that here to get a more clean figure.

<<plotgeno,eval=FALSE>>=
plotGeno(mapthis, chr=1, ind=toperr$id[toperr$chr==1],
          cutoff=6, include.xo=FALSE)
@

\begin{figure}
\centering
<<plotgenoplot,fig=TRUE,echo=FALSE,height=3>>=
par(mar=c(4.1,4.1,0.6,0.6), las=1, cex.axis=0.9)
plotGeno(mapthis, chr=1, ind=toperr$id[toperr$chr==1], main="", cex=0.8,
          include.xo=FALSE, cutoff=6)
@
\caption{Genotypes on chromosome 1 for individuals with some potential
  errors flagged by red squares.  White, gray and black circles correspond to AA, AB
  and BB genotypes, respectively.\label{fig:plotgeno}}
\end{figure}

The results are in Fig.~\ref{fig:plotgeno}.  All of the flagged
genotypes are cases with an exchange from one homozygote to the other
and then back again (thus, two crossovers in each interval flanking a
single marker).

One might zero out these suspicious genotypes (that is, make them
missing).  Even better would be to revisit the raw genotyping
information, or even re-genotype these instances.  But we are talking
about just \Sexpr{nrow(toperr)} genotypes out of
\Sexpr{sum(ntyped(mapthis)) %/% 1000},\Sexpr{sum(ntyped(mapthis)) %% 1000},
and with our allowance for genotyping
errors in the map estimation, they have little influence on the
results.

If we did wish to delete these genotypes, we could do so as follows.

<<dropgenotypes>>=
mapthis.clean <- mapthis
for(i in 1:nrow(toperr)) {
  chr <- toperr$chr[i]
  id <- toperr$id[i]
  mar <- toperr$marker[i]
  mapthis.clean$geno[[chr]]$data[mapthis$pheno$id==id, mar] <- NA
}
@
%$

\bigskip
\textbf{\sffamily Revisit segregation distortion}
\nopagebreak

Finally, let us return to an investigation of segregation distortion in these data.

<<segdis,eval=FALSE>>=
gt <- geno.table(mapthis, scanone.output=TRUE)
par(mfrow=c(2,1))
plot(gt, ylab=expression(paste(-log[10], " P-value")))
plot(gt, lod=3:5, ylab="Genotype frequency")
abline(h=c(0.25, 0.5), lty=2, col="gray")
@

\begin{figure}
\centering
<<plotsegdis,fig=TRUE,echo=FALSE,height=5>>=
gt <- geno.table(mapthis, scanone.output=TRUE)
par(mar=c(4.1,4.1,0.6,0.6), las=1, mfrow=c(2,1), cex=0.8)
plot(gt, ylab=expression(paste(-log[10], " P-value")))
plot(gt, lod=3:5, ylab="Genotype frequency")
abline(h=c(0.25, 0.5), lty=2, col="gray")
@
\caption{Evidence for segregation distortion: $-$log$_{10}$ P-values from
  tests of 1:2:1 segregation at each marker (top panel) and the
  genotype frequencies at each marker (bottom panel, with black, blue
  and red denoting AA, AB and BB genotypes, respectively).\label{fig:segdis}}
\end{figure}

The top panel of Fig.~\ref{fig:segdis} contains $-$log$_{10}$ P-values
from tests of 1:2:1 segregation at each marker.  The bottom panel in
Fig.~\ref{fig:segdis} contains the observed genotype frequencies at
each marker (with black, blue and red corresponding to AA, AB and BB
genotypes, respectively).

The greatest departure from 1:2:1
segregation is on chromosome 4, with somewhat more AA genotypes and
somewhat fewer BB genotypes.   If we apply a Bonferroni correction for
the \Sexpr{totmar(mapthis)} tests (\Sexpr{totmar(mapthis)} is the
total number of markers we have retained in the data), we would look
for P $\ge$ 0.05/\Sexpr{totmar(mapthis)} which corresponds to
$-$log$_{10}$ P $\ge$ \Sexpr{round(-log10(0.05/totmar(mapthis)),2)},
and there is one marker on chromosome 4 that exceeds this.

There are also some departures from 1:2:1 segregation on chromosomes 2
and 3, but these appear to be within the range of what would be
expected by chance; the evidence for a real departure from normal
segregation is not strong.

The aberrant segregation pattern on chromosome 4 is not too worrisome.
Multipoint estimates of genetic map distances are little affected by
segregation distortion, and the pattern of distortion on the
chromosome indicates rather smooth changes in genotype frequency.
More worrisome would be a single distorted marker in the midst of other
markers with normal segregation, which would indicate genotyping
errors rather than, for example, the presence of partially lethal
alleles.

And so, finally, we're done.  Let us plot the final map.

<<plotfinalmap,eval=FALSE>>=
plotMap(mapthis, show.marker.names=TRUE)
@

\begin{figure}
\centering
<<plotfinalmapplot,fig=TRUE,echo=FALSE,height=3>>=
par(las=1, mar=c(4.6,4.6,0.6,0.6), cex=0.8)
plotMap(mapthis, main="", show.marker.names=TRUE)
@
\caption{Plot of the final estimated genetic map\label{fig:plotfinalmap}}
\end{figure}

The map in
Fig.~\ref{fig:plotfinalmap} is not pretty, as most of the marker names
are obscured.  R/qtl does not produce production-quality figures
automatically; I always go through quite a few extra contortions within R to
produce a figure suitable for a paper.

\bigskip
\textbf{Discussion}
\nopagebreak

The process of genetic map construction seems to be at least 90\% data
diagnostics.  While many might find that frustrating, for me that is a
large part of what makes it fun: it is interesting detective work.  I
have been involved in the construction of genetic maps for humans,
mice, dogs, zebrafish, mosquitoes, and sea squirts.  Each project was
different, with its own special issues that needed to be overcome, and
such issues can seldom be anticipated in advance.

The general strategy is to think about what sorts of things might be
going wrong with data, and then what sorts of features of the data
(summary statistics or plots) might indicate the presence of such
problems.  There are a variety of things to check routinely, and note
that the particular order in which these checks are performed is often
important.

In the procedures described above, and in forming these simulated
data, I attempted to cover most of the possible problems that might be
expected to arise.  With the simple intercross considered here, and
particularly as the data were simulated so that the problems would
generally be quite clear, the decisions about how to proceed were easy
to make.  In practice, potential problems in data will often be more
murky, and careful judgment calls will need to be made and frequently
revisited, ideally with careful consideration of raw genotyping data
and other records, and perhaps even with some additional rounds of
genotyping or even the redesign of genotyping assays.  Moreover, with
more complex experiments, such as an outcross or the combined analysis
of multiple crosses, additional issues will arise that we have not
touched on here.  But the overall strategy, laid out above, can be
applied quite generally.

\newpage
{\small
\noindent \textbf{\sffamily R/qtl functions useful for genetic map construction} \\
\noindent \begin{tabular}{ll}
\hspace*{25mm} & \hspace*{103mm} \\ \hline
plotMissing & Plot pattern of missing genotypes \\
ntyped & Count number of typed markers for each individual \\
nmissing & Count number of missing genotypes for each individual \\
subset.cross & Pull out a specified set of chromosomes and/or
individuals from a cross \\
drop.markers & Remove a list of markers \\
pull.markers & Drop all but a selected set of markers \\
drop.nullmarkers & Remove markers without data \\
comparegeno & Count proportion of matching genotypes between all pairs
of individuals \\
findDupMarkers & Find markers with identical genotype data \\
drop.dupmarkers & Drop duplicate markers \\
geno.table & Create table of genotype distributions \\
est.rf & Estimate pairwise recombination fractions \\
markerlrt & General likelihood ratio test for association between
marker pairs \\
checkAlleles & Identify markers with potentially switched alleles \\
pull.rf & Pull out the pairwise recombination fractions or LOD scores
as a matrix \\
formLinkageGroups & Partition markers into linkage groups \\
plotRF & Plot recombination fractions \\
markernames & Pull out the marker names from a cross \\
plot.rfmatrix & Plot a slice through the pairwise recombination
fractions or LOD scores \\
geno.crosstab & Create cross-tabulation of genotypes at two markers \\
switchAlleles & Switch alleles at selected markers \\
orderMarkers & Find an initial order for markers within chromosomes \\
ripple & Assess marker order by permuting groups of adjacent
markers \\
summary.ripple & Print summary of ripple output \\
est.map & Estimate genetic map \\
pull.map & Pull out the genetic map from a cross \\
compareorder & Compare two orderings of markers on a chromosome \\
switch.order & Switch the order of markers on a chromosome \\
summaryMap & Print summary of a genetic map \\
plotMap & Plot genetic map(s) \\
droponemarker & Drop one marker at a time from a genetic map \\
replace.map & Replace the genetic map of a cross \\
countXO & Count number of obligate crossovers for each individual \\
locateXO & Estimate locations of crossovers \\
calc.errorlod & Calculate Lincoln \& Lander (1992) error LOD scores \\
top.errorlod & List genotypes with highest error LOD values \\
plotGeno & Plot genotypes on a particular chromosomes for selected
individuals \\ \hline
%
tryallpositions & Test all possible positions for a marker \\
allchrsplits & Test all possible splits of a chromosome into two
pieces \\
movemarker & Move a marker from one chromosome to another \\
convert.map & Change map function for a genetic map \\
shiftmap & Shift starting points in genetic maps \\
rescalemap & Rescale genetic map \\
\hline
\end{tabular}
}

\end{document}