Man:gem-mapper

From Algorithm Development Wiki
Jump to: navigation, search

Contents

NAME

gem-mapper -- A fast and accurate mapper for short genomic reads

SYNOPSIS

gem-mapper [OPTIONS-I index_file [-i input_file] [-o output_prefix]

DESCRIPTION

The following options are available:

File input/output

-I index_file (string, mandatory)
The indexed reference you intend to use for mapping -- see gem-indexer, option -o.
-C|--emulate-complement
Specifies that although your archive does not index the reverse-complement, the reads should be mapped to it pretending that it does (this is accomplished by first mapping the read and then mapping its reverse complement; hence, if the archive is not too large it is twice as expensive as mapping to an archive which indexes both strands -- see the complete GEM documentation at http://gemlibrary.sourceforge.net). This option is normally not needed if you generated your index correctly, since gem-indexer records in each archive the parameter which were used to create it, and this information is automatically taken into account when gem-mapper loads the index. In particular, you do not need to specify this option if you generated the index with
gem-indexer --complement emulate
you only need this option if you used
gem-indexer --complement no
and you now wish the archive to behave as if it had been generated with the option --complement emulate.
-i input_file (string, default=stdin)
In single-end mode: The name of the file containing the reads you intend to map. It must follow the multi-FASTA/multi-FASTQ syntax, with the additional limitation that each sequence (that is, each read) must be presented on a single line. In paired-end mode: Same as above, but with the additional requirement that the two ends of each read must be presented interlaced, that is with alternating first and second end.
-1 end_1_input_file -2 end_2_input_file (strings)
In paired-end mode: The name of the files containing the reads you intend to map (end_1_input_file for the first end, and end_2_input_file for the second end). Both files must follow the same multi-FASTA/multi-FASTQ syntax, with the additional limitation that each sequence (that is, each end) must be presented on a single line.
--granularity size; (non-negative integer,; default=10,000 lines when reading from stdio or 2,500,000 B when reading from file)
Sets the number of lines/bytes that are going to be buffered from the input for each thread. If the input comes from stdin, size is considered as the number of lines. Otherwise, if the input is a file, size is taken as a number of bytes.
--sequence-range initial,final (non-negative integers, default=1,EOF)
Specifies the range of reads in the input that are going to be processed. Initial and final reads are included in the interval. Lines begin at 1.
-o output_file (string, default=stdout)
The name of the file containing the mappings. The order of the reads in the output is guaranteed to be the same as that of the initial input. The output format is proprietary to GEM (see the complete GEM documentation at http://gemlibrary.sourceforge.net for a precise definition). It can be converted to SAM (even if the conversion is lossy) using gem-2-sam.

Qualities

-q|--quality-format format; (ignore or offset-33 or offset-64, mandatory if FASTQ input)
If the input file is in multi-FASTQ format --that is, it contains quality information for each read-- specifies how such qualities should be interpreted. Please note that it is impossible from the point of view of the programmer to distinguish whether the qualities in a FASTQ file follow the offset-33 or the offset-64 encoding, so this option is mandatory. If you are not sure about which option to choose, please ask the people who generated the data: specifying the wrong option will greatly reduce the quality of your mapping. If you say ignore here, the mapping will be done assuming a flat quality model (that is, the maximum number of mismatches allowed will be that specified by the option -m, and no additional mismatch will be allowed in bad-quality bases).
--gem-quality-threshold quality_threshold; (non-negative integer, default=26 that is error=0.002)
In case of a mapping which takes qualities into account (as deduced from a FASTQ input in Phred or Solexa format), consider as good-quality the bases having a quality score above this threshold, and as bad-quality those having a quality score below this threshold. It should be noted that the interpretation of the specified quality threshold in terms of an error probability differs for Phred and Solexa scales when q<7, or error>=0.17; aside from that, the program takes care to automatically convert this value to the correct ASCII encoding (which again differs depending on whether Phred or Solexa conventions are being used in the input).

Single-end alignment

--mismatch-alphabet replacement_characters (string, default=ACGT)
Specifies the set of characters which are valid replacements in case of mismatch. Note that if you would like to consider Ns in the reference as wildcards, you should specify ACGNT here; otherwise, the mapper will never return positions in the reference containing Ns.
-m max_mismatches|%_mismatches; (non-negative integer or fraction, default=0.04)
The maximum number of nucleotide substitutions allowed while mapping each read. It is always guaranteed that, however other options are chosen, all the matches up to the specified number of substitutions will be found by the program. In case qualities are being taken into account (multi-FASTQ input), this parameter assumes the meaning of the maximum number of mismatches which can be found in high-quality bases.
If an integer number is specified, that fixed number of mismatches is used for each read; if a floating point between 0 and 1 is given, it is assumed to be a fraction, and the number of mismatches used during the search will be (depending on the read length l) floor(%_mismatches * l).
-e max_edit_distance|%_differences; (non-negative integer or fraction, default=0.00)
The maximum number of edit operations allowed while verifying candidate matches by dynamic programming. Saying
--e 0
disables the possibilities of finding alignments with indels (however, "big" indels might still be found, see option --max-big-indel-length below).
If an integer number is specified, that fixed number of edit operations is used for each read; if a floating point between 0 and 1 is given, it is assumed to be a fraction, and the number of edit operations used during the search will be (depending on the read length l) floor(%_differences * l).
--min-matched-bases number|%; (non-negative integer or fraction, default=0.80)
This parameter limits the number of deletions that can occur in the read (if there are too many deletions, the quality of the alignment will be questionable). The default says that at least the 80% of the bases must be mapped, i.e. there cannot be more than 20% of the bases deleted.
If an integer number is specified, that fixed number of bases is used for each read; if a floating point between 0 and 1 is given, it is assumed to be a fraction, and the number of bases used during the search will be (depending on the read length l) floor(% * l).
--max-big-indel-length number (non-negative integer, default=15)
The GEM mapper implements a special algorithm that, in addition to ordinary matches, is sometimes able to find a single long indel (in particular, a long insertion in the read). This option specifies the maximum allowed size for such long indel.
-s|--strata-after-best number (non-negative integer, default=0)
A stratum is a set of matches all having the same string distance from the query. The GEM mapper is able to find not only the matches belonging to the best stratum (i.e., the best matches having minimum string distance from the query) but also additional sets of matches (the next-to-best matches, the next-to-next-to-best matches, and so on) having alignment score worse than that of the best matches. This parameter determines how many strata should be explored after the best one (i.e.,
--strata-after-best 1
will list all the best and all the second best matches).
--fast-mapping number (non-negative integer, default=false)
Activates fast mapping modes, whereby the aligner does not align "hard" reads (that is, reads which would require too large a computational budget, usually a few). Other reads are aligned as in the normal modes. The parameter number defines the computational budget (and hence
--fast-mapping 0
will be the cheapest fast mode,
--fast mapping 1
the next-to-cheapest, and so on).
--unique-mapping (default=false)
Activates a fast mapping mode that only aligns reads mapping to the reference once. Other reads are flagged as multiply mapping and not aligned.
--allow-incomplete-strata number|% (non-negative integer or fraction, default=0.00)
Lists additional matches lying outside the strata requested by the user, at the mapper's discretion. In principle, when this option is set many more matches with a possibly very high number of errors (and hence with a possibly questionable quality) can be found.

Selecting alignments for output (single-end mode) or pairing (paired-end mode)

-d|--max-decoded-matches number|all (non-negative integer, default=20)
In single-end mode: The GEM mapper always provides a complete count of all the existing matches up to the selected number of mismatches; however, not all matches are printed, since only a few will be needed for the typical application. This options allows to fine-tune this behaviour. You should specify all only if due to some reason you already know that the maximum number of matches has a reasonable bound (which is not the case for typical mammalian genomes).
In paired-end mode: As above, but controls the alignments which are passed on to the pairing stage rather than to the printing stage.
-D|--min-decoded-strata number (non-negative integer, default=1)
In single-end mode: In some occasions (when maximum sensitivity is desirable) it might be useful to be sure that all the matches belonging to a number of strata are always output, irrespectively of their number. By default, the first stratum is always printed in full. If max_decoded_matches is greater than the number of matches belonging to the strata that should be printed mandatorily, additional strata are possibly printed.
In paired-end mode: As above, but controls the alignments which are passed on to the pairing stage rather than to the printing stage.

Paired-end alignment

-p|--paired-end-alignment (default=false)
Activates paired-end alignment (single-end alignment is performed otherwise).
-b|--map-both-ends (default=false)
Selects between the two possible workflows for paired-end alignment.
If --map-both-ends is specified, both ends are mapped separately, and then the program tries to pair the returned single-end matches based on the constraints imposed by relative distance and orientation. If no paired match for both ends can be found, the mapper tries to extend the single-end matches previously obtained for either end by dynamic programming. This procedure returns all the accurate results derived from independent single-end alignment of both ends, plus all the matches such that only one end is mapping as a single end, and the other end can be recovered by extending the first one using more permissive alignment parameters.
If --map-both-ends is not specified, only one end is mapped, and then the program tries to extend through dynamic programming the matches for the first end to the second end. If no match is found, the second end is mapped, and an extension of the matches thus found to the first end is attempted. As with the first workflow, and no matter whether the match for a given end is found during the mapping or the dynamic programming step, this second pairing approach too is guaranteed to find all the pairs within a given string distance; however, in many situations it turns out to be more efficient, as typically one has to single-end map only one of the two ends. On the other hand the --map-both-ends approach, despite being slower, can be used to retrieve pairs when one of the two ends contains more errors.
--min-insert-size number (default=0)
Specifies the minimum acceptable insert size for the pair. If the leftmost end aligns at position pos_lo and the rightmost end aligns at position pos_hi, the insert size is computed as pos_hi-pos_lo. This definition is different from the conventions adopted by other mappers, and is likely to change in future releases.
--max-insert-size number (default=1000)
Specifies the maximum acceptable insert size for the pair. If the leftmost end aligns at position pos_lo and the rightmost end aligns at position pos_hi, the insert size is computed as pos_hi-pos_lo. This definition is different from the conventions adopted by other mappers, and is likely to change in future releases.
-E max_edit_distance|%_differences; (non-negative integer or fraction, default=0.08)
The maximum number of edit operations allowed while extending the alignment of one end to the other one by dynamic programming. Saying
--E 0
disables extension (however, paired-end matches might still be found by simple pairing if both ends have been mapped separately at the beginning of the workflow, see option -b above).
If an integer number is specified, that fixed number of edit operations is used for each read; if a floating point between 0 and 1 is given, it is assumed to be a fraction, and the number of edit operations used during the search will be (depending on the read length l) floor(%_differences * l).
--max-extendable-matches number|all (non-negative integer, default=20)
Selects the maximum number of alignments found during the mapping of one end that can be extended to the other end with dynamic programming.
--max-matches-per-extension number (default=1)
Selects how many extensions per match should be attempted. As the extension step by dynamic programming tries to find a solution in the [min_insert_size,max_insert_size] range, it might be that the first extension is not the best one, resulting in a bias in the distance between ends (systematically too short) each time one of the two ends is aligned by dynamic programming. Hence, when maximum precision is essential, one should specify a number >1 here, depending on the insert size.
--unique-pairing (default=false)
Similar to what the corresponding option --unique-mapping for single-end mapping does, activates a paired-end mapping mode that only aligns reads mapping to the reference once. Other reads are flagged as multiply mapping and not aligned.

Miscellaneous

-T|--threads thread_number (non-negative integer, default=1)
The number of threads to be started.
-v|--verbose (default=false)
Enable additional logging messages.
-h|--help
Prints help information and exits without performing other actions.

EXAMPLES

To be completed.

AUTHORS

Paolo Ribeca mailto:paolo.ribeca@gmail.com.

SEE ALSO

gem-indexer, gem-rna-mapper, and the GEM website.

Personal tools
Namespaces

Variants
Actions
Navigation
Downloads
Toolbox