gem-indexer -- The GEM indexer for genomes and proteins (protein indexing disabled in this release)
gem-indexer [OPTIONS] -i input_file -o output_prefix
The following options are available:
- -i input_sequences (string, mandatory)
- The name of the multi-FASTA file to be indexed.
- -o index_prefix (string, mandatory)
- The prefix of the generated archive. The full name will be index_prefix.gem.
- -c|--content-type content_type (dna or protein, default=dna)
- The type of data contained in the multi-FASTA file to be indexed.
- --force-fm-general-index (default: deduced from content)
- For the sake of optimization, more than one type of archive can be produced by the GEM system (being all of them, as of this writing, variations of the FM-index). In particular, a specially optimized kind of archive fm-dna is supported, which is suitable for indexing DNA sequences (supporting nucleotide characters 'A', 'C', 'G', 'T', an unknown character 'N' and an inter-sequence separator character ' '). Another archive format also exists, allowing to index strings containing arbitrary characters; with respect to DNA-only archives it will provide reduced performance, though. By default, the program uses a fm-dna index when the content is dna and a fm-general when the content is protein, but --force-fm-general-index can be specified to index DNA with the generic index (obtaining reduced performance).
- --filter-function filter_function; (iupac-dna, iupac-colorspace-dna or none, defaults: iupac-dna for DNA, none for protein)
- In the case of DNA indexing, it is advisable to pre-filter the contents of the sequence file: for example, it might contain IUPAC wildcards for ambiguous bases. As explained above, for optimization reasons the specialized archives produced by using the option --index-type fm-dna can only contain characters 'A', 'C', 'G', 'T' and 'N'; hence, the iupac-dna filter maps all the ambiguities to 'N'. Filter iupac-colorspace-dna must be used when indexing a reference for subsequent colorspace mapping.
- Same as --filter-function iupac-colorspace-dna.
- --strip-unknown-bases-threshold unknown_bases_threshold; (disable or non-negative integer, default=50)
- It happens frequently that reference genomes contain many very long stretches of Ns (the typical reason being that some regions in the genome are known to be there, but they have not yet been sequenced). You might wish to strip such reasons from your archive if they are longer than unknown_bases_threshold; this makes the index more compact. Of course, the genomic coordinates remain the same no matter whether this option is used or not.
- --complement-size-threshold complement_threshold; (non-negative integer, default=2000000000, that is 2GB)
- How about mapping to the reverse strand of a genome? Usually indexing both strands of the genome guarantees the fastest mapping, provided that your reference genome is not larger than approximately 1GB (on modern architectures index access typically starts to be degraded by cache miss effects when the generated archive is larger than 2GB, but the exact value can vary depending on your machine; you should select such a value here). Alternatively, you can index only one strand, and in this case the mapping to the reverse complement will be achieved by mapping both the read and its reverse complement to the index; this option produces the most compact indices at the price of a slower mapping, but is recommended for genomes larger than 1GB since in this case performance loss will be moderate. By default, if you do not explicitly specify a value for reverse_complement to the option --reverse-complement below, a single-stranded archive will be generated if the final size of the two-stranded index is greater than reverse_complement_threshold; otherwise, a double-stranded index will be generated.
- --complement complement (yes, emulate or no)
- How about mapping to the reverse strand of a genome? If you specify --reverse-complement yes you will obtain an archive indexing both strands of the genome; this is the option which guarantees the fastest mapping, provided that your reference genome is not larger than approximately 1GB (on modern architectures index access typically starts to be degraded by cache miss effects when the generated archive is larger than 2GB, but the exact value can vary depending on your machine). If you choose --complement emulate only one strand will be indexed, and the mapping to the reverse complement will be achieved by mapping both the read and its reverse complement to the index; this option produces the most compact indices at the price of a slower mapping, but is recommended for genomes larger than 1GB since in this case performance loss will be moderate. Finally, in some situation you might legitimately wish to index only one strand, and this is achieved by specifying --complement no.
- -m|--max-memory memory_threshold; (force-external-memory, unlimited or non-negative integer)
- To generate the Burrows-Wheeler transform of a file a large amount of memory is required. If your reference is large (say, more than 512MB/1GB) and you do not have enough RAM available, you can still generate the index with slower algorithms (using disk space in the case of protein content, and a slower algorithm requiring little memory when content is dna). This option specifies the maximum amount of RAM which can be used by the program during BWT computation: in case more RAM is required than memory_threshold, the BWT is generated using slower algorithms.
- --sampling-rate sampling_rate (non-negative integer, default=32)
- Controls the sampling/compression rate of the FM index. The smaller this number, the larger the generated index (but only marginally: the size of most of the index is actually independent of the compression ratio) and the better the performance (but only for some of the index operations). Values which are not powers of 2 are rounded to the closest one. Choices like 64, 32 or 16 should be appropriate, depending on the desired performance.
- Keep the files generated during the intermediate stages of the computation, in particular the ones (if any) related to the generation of the Burrows-Wheeler transform. Although this option can be useful for debugging, it is probably not suitable for production, since the temporary files are bulky.
- --mm-tmp-prefix swap_prefix (string, default='/tmp/mm_new-')
- Choose the prefix of the on-disk swap files which can be generated by the memory manager if there is not enough RAM memory available. If this happens, you had better make use of the option --use-external-memory to generate the Burrows-Wheeler transform directly on-disk using specialized algorithms, which is much faster. Useful only in very peculiar cases.
- Verify the correctness of the produced index by running extensive (and expensive) tests. Very slow.
- Be (very) verbose during all stages of index generation. Useful mainly for debugging purposes.
- Print help information and exit without doing anything else.
Supposing that your genome (in multi-FASTA format) is contained in the file my_genome.fas, the basic example is as simple as
gem-do-index -i my_genome.fas -o my_indexed_genome.
This will produce a file named
which contains the GEM archive to be used later by other programs.
You need to supply additional options only rarely. A more complicated example could be the following one. You have the chicken genome, which is 1.1GB, stored in the file chicken.fas, and you intend to index both the forward and the reverse complement strand. This would normally not happen with the standard choice of parameters, since the size of the generated two-stranded archive is 2.2GB, which is greater than the default value of 2GB for reverse_complement_threshold. You can then force the generation of a two-stranded index by using either
gem-do-index -i chicken.fas -o chicken --reverse-complement-size-threshold 3000000000
gem-do-index -i chicken.fas -o chicken --reverse-complement yes.
Paolo Ribeca mailto:firstname.lastname@example.org.