The Institute for Systems Biology COSEG Download

COSEG is a program which automatically identifies repeat subfamilies using significant co-segregating ( 2-3 bp ) mutations.

Prerequisites
  1. Unix system with RepeatMasker 3.2.0 or higher installed
    • A perl script included in the COSEG package depends on the RepeatMasker 3.2.0 (and higher) release. RepeatMasker installation instructions can be found here
  2. Sequence Search Engine
    COSEG uses alignment data generated by a sequence search engine. Cross_match and WUBlast are supported by COSEG.
  3. GraphViz Program [optional]
    COSEG outputs tree data in the GraphViz format. Installing the GraphViz program will allow you to convert this format into many different graphics file formats for viewing.
Installation
  1. Download COSEG
    Latest Released Version: 11/5/2015: coseg-0.2.2.tar.gz
    Previous Version: 8/3/2010: coseg-0.2.1.tar.gz
  2. Unpack Distribution
    Unpack the distribution in your home directory or in a temporary location ( ie. /tmp ).
    • cd /mytmp/location/
    • gunzip coseg-#.#.#.tar.gz
    • tar xvf coseg-#.#.#.tar
  3. Build
    To compile the C program run:
    • cd coseg
    • make
    • Edit preprocessAlignments.pl: Change line 114:
                 use lib "/usr/local/RepeatMasker";
               
      To point to your local RepeatMasker installation.
  4. Test
    See tutorial below for running COSEG on the provided example datasets.
  5. Install
    • Edit the Makefile and change INSTALLDIR to the location where you want the files installed.
    • run: "make install"
ALU Example Run
  1. Given the sample ALU dataset provided by Alkes Price in the original distribution of his code ( files have been renamed ):

    • ALU.seqs: Human sequence data from alignment to AluSx consensus )
    • ALU.ins : Insertion sequences from the alignments
    • ALU.cons: AluSx consensus

    Note: that positions 116-135 (inclusive) are lowercase while all remaining positions are uppercase. This is how we encode positions within the consensus that have lower quality and should not be considered in this analysis.

  2. Run analysis:
    runcoseg.pl -d -m 50 -c ALU.cons -s ALU.seqs -i ALU.ins
    NOTE: In this run we must tell coseg to treat lower case characters in the consensus as a blacklist designation by using the "-d" flag.

  3. Create png/svg files:

    Coseg outputs a visualization of the subfamily tree in both GraphViz and SVG formats. The SVG format can be viewed directly in any web browser. The GraphViz format needs to be converted to a graphics format ( ie. PNG ) using the GraphViz command line tools ( http://www.graphviz.org/ ) or loaded using the graphViz webapp ( https://mdaines.github.io/viz.js/ ). The SVG output may be resized by changing the first line in the SVG file to reflect the output size you would like. For example the default is to display the entire graph in the current size of the web browser window:

              <svg height="100%" width="100%" viewbox="0 0 90.625 146.25">
           
    Changing height/width to:
              <svg height="147" width="91" viewbox="0 0 90.625 146.25">
           
Output Files
The following files are named after the aligned sequence file name as a prefix.
  • *.log -- A log of mutation sites found and the order of the clustering.
  • *.subfamililies.seq -- name, count, P-value and consensus sequence of each subfamily found by our algorithm. (For subfamilies not in the original scaffold, we also include in parentheses the P-value of the scaffold subfamily from which it is derived).
  • *.assign -- for each of the elements, lists the subfamily to which the algorithm has assigned it.
  • *.tree.svg -- evolutionary tree of the subfamilies, in SVG format.
  • *.tree.viz -- evolutionary tree of the subfamilies, in GRAPHVIZ format.
Input Files
  • *.cons -- Consensus File: The consensus used to generate the aligned seqeunce file. This is in fasta format.
  • *.seqs -- Aligned Sequences File: Near full length aligned sequences to the consensus file ( one per line ). Deletions are represented by 1 or more "-" characters. Insertions of any length are represented by a single "+" character.
  • *.ins -- Inserts File: Inserts pulled from the alignments are stored in this file. The aligned sequence file ordering is used in this file. Each insert is encoded as:
                [<position>:<sequence>][<position>:<sequence>]..
    
        ie.
                 3:AA 10:AGTTTA
        
    A script is provided to build these files automatically given either wu-blast or cross_match alignment files -- see below.
Running Using Your Own Data
  1. Cross_match a reference sequence against a genome or database.
           cross_match line1copies consensus -M 25p41g.matrix
                       -gap_init -25 -gap_ext -5 -minscore 200
                       -minmatch 6 -alignments -bandwidth 50 -word_raw > LINE1
           
    The example file LINE1 included in this distribution was created using the command line above and can be used directly in the following steps.

  2. Determine consensus range to use for analysis ( ie. 298 - 797 bp )

  3. Create input files to alkes programs:
           preprocessAlignments.pl -maxEdgeGap 10
                                    -minConsRange 298
                                    -maxConsRange 797
                                    -alignments LINE1
    
           This will create 3 new files: LINE1.seqs
                                         LINE1.ins
                                         LINE1.cons
           
    NOTE: Use the -w flag to preprocessAlignments if you use WUBlast to perform the alignments.

  4. Run analysis:
           runcoseg.pl -t -m 50 -c LINE1.cons -s LINE1.seqs
                       -i LINE1.ins
           
    NOTE: In this example we use two new flags. The first "-a" indicates we want to use the newer pvalue calculation developed by Andy Siegel. The second flag "-t" indicates we want to use 3 bp co-segregating mutations as well as 2bp co-segregating mutations when developing subfamilies.

  5. Open up a web browser and point it at the file LINE1.seqs.tree.svg. Most browsers support zooming in on svg files. If you want to render the SVG file larger by default simply edit the *.svg file and change the line:
              <svg height="100%" width="100%" viewbox="0 0 90.625 146.25">
         
    to reflect a fixed size for the graph. The "viewbox" values give the absolute size of the drawing so a 1:1 scale would be:
              <svg height="145.25" width="90.625" viewbox="0 0 90.625 146.25">
         

  6. If you would like to produce a SVG file without node label or using either the divergence, P-value or subfamily size as the label, simply rerun ./postprocess.pl without the "-l" flag or with "-l div", "-l pv", "-l c" accordingly.
Experimental
Using the refiner contained in the RepeatModeler package one may submit subfamily members for consensus refinement. There is an experimental script called refineConsSeqs.pl that will run this analysis:
  1. From the directory where your coseg results can be found run:
              refineConsSeqs.pl -subConsFile mycosegrun.subfamilies.seq
    
    Where "mycosegrun" is the prefix of the coseg run.
Release Notes
coseg-0.2.2
  • Create a *.svg graph file without the need to download/use GraphViz. The layout is handled by an adaptation of algorithm developed by Atze van der Ploeg. The SVG file produced supports various labeling options and subfamily details displayed when a node is hovered over.
  • Changed the default colormap for the graph output. Now warm colors denote more diverged subfamilies in the tree while cooler colors represent younger subfamilies. To restore the original color scheme use the new "-o" flag to coseg.
  • Added parameter to control the minimum distance between diagnostic sites. Now the user can override the historic value of 10 using the -u flag.
  • Improved error reporting when there is a mismatch between an individual sequence length and the consensus length in the input files.
  • Fixed a bug that caused coseg to segfault.
  • Added experimental script refineConsSeqs.pl. This script uses the RepeatModeler application to build and refine the consensus sequences for each subfamily.
coseg-0.2.1
  • Improved code documentation
  • Single mutation significance cutoff ( SIGMATHRESH ) was pre-calculated for Alkes Alu analysis and hardcoded. This version calculates the correct sigma cutoff using the length of the input sequence.
  • Fixed bug with implementation of Siegel's pValue calculation which caused a segfault -- found by Neal Platt.
  • Switched default pvalue method to Andy Siegel's method and provided a new "-k" switch to use Alkes Price's method.
  • Fixed bug where the program was exiting when calculations fell below the precision of the machine ( epsilon ). Message given was "Below epsilon..." and the runcoseg.pl script moved on even though coseg failed.

coseg-0.2.0

  • First release of COSEG.
COSEG is licensed under the Open Source License v2.1.
Institute for Systems Biology
This server is made possible by funding from the National Human Genome Research Institute (NIGRI grant # RO1 HG002939).