IsoInfer V0.8

Source Code

Compilation

Install the following C/C++ libraries: glpk, gsl and QuadProg++. If you cannot install those packages in the standard system directories but install them at /your/installed/path/ for example, you have to modify the environment variables LD_LIBRARY_PATH and CXXFLAGS by :
```
export CXXFLAGS="-g -O3 -I/your/installed/path/include -L/your/installed/path/lib" 
export LD_LIBRARY_PATH="/your/installed/path/lib:"$LD_LIBRARY_PATH 
```
If the compiler complains that it cannot find the library even you installed them by default, you should also find the installed path of these libraries and specify the two environment variables above manually. This may happen on QuadProg installed on Ubuntu by "apt-get install" command.
Compile the graphlib package in the source code. The compilation follows the standard configure, make process by executing the following commands in sequence:
```
./configure
make
```
Compile the isoinfer package in the source code. The compilation follows the standard configure, make process by executing the following commands in sequence:
```
./configure
make
```

Note that it is necessary to keep the graphlib and isoinfer under the same directory. Environment variable LD_LIBRARY_PATH should be exported like the first step every time when you open a new shell to run the program. I suggest you putting it in the .bashrc file in your home directory.

Manual

Usage:

Jobs:

-h	Print help information
-ext_junc_ref	Extract junction ref sequence. Involved parameters are -rstart, -bound, -grange, -tsspas, -ref and -read_info.
-predict	Infer isoforms. Involved parameters are -bound, -grange, -tsspas, -read_info, -conf_level, -intron_exp, -min_exp, -ps, -min_dup, -bse and -o.

Options:

Parameter	Range	Default Value	Description
-rstart	number	0	For job -ext_junc_ref, the parameter specifies the start position of the first neocliotide of a chromosome. This parameter is to make sure that the coordinations used in the program is consistent with the coordinations provided by -bound, -grange and -tsspas.
-bound	file	N/A	Boundary file. The format of the file is : chromosome strand position type Every two consecutive fields are separated by a single TAB. Each line corresponds to a boundary with a certain type. The position is always the position of the first base of an exon or intron. 'type' are binary. type = 0 for intron->exon, type = 1 for exon->intron. It is possible that a boundary is both type 0 and type 1. In this case, provide two lines for this boundary, with one line for type 0 and another line for type 1. If type information is unavailable, set types as 0 for all the boundaries.
-grange	file	N/A	Gene range file. The format of the file is : gene_name chromosome strand start_position end_position Every two consecutive fields are separated by a single TAB. Each line corresponds to a gene.
-tsspas	file	N/A	TSS and PAS file. The format of the file is : gene_name TSSs PASs Every two consecutive fields are separated by a single TAB. gene_name should be consistent with the gene range file (specified by -grange). TSSs or PASs are sepereted by commas. In each line, an isoform starting from one element in TSSs must end with some element in PASs on the same line. For a gene, multiple lines may be provided. There is no constraint on different lines.
-ref	file	N/A	Reference sequence in a single file.
-read_info	file	N/A	A file storing the basic read information. The format of this file is: mapping_file 0/1 [end_len] cross_strength noise_level total_read_cnt distribution_type definition_of_a_distribution The format for the mapping_file is : chromosome strand start_positions end_positions Every two consecutive fields are separated by a single TAB. Each line corresponds to the mapping information of a read. Each read could be mapped to multiple segments of the reference sequence. 'start_positions' ('end_positions) are all the start (end) positions, separated by commas, of all the segments. This format is similar to the BED format of USCS. For example, a paired-end read with end length 50 is mapped to the RefSeq. The first end of this paired-end read is mapped to segments: [300,330) and [700,720) on the positive strand of the chromosome chr1 and the second end of this paired-end read is mapped to segments: [1300,1315) and [2000,2035) on the positive strand of the chromosome chr1. Then the mapping file for this paired-end read should contain two consecutive lines: chr1 + 300,700 330,720 chr1 + 1300,2000 1315,2035 The second term in a read info file indicates whether the read is paired-end or not. If it is paired-end, set 1 here. Otherwise 0 should be set. If it is 1, then the first item in the following line is the length of each end of the paired-end read. The lengths of the two ends of a paired-end read is supposed to be the same. If the read is not paired-end, then the following line starts from cross_strength. In the second line of a read_info file, cross_strength is the minimum number of base pairs of agreement between a read and an exon when the read is considered as partially aligned to the exon. This parameter influence all the stuff related to junctions. noise_level is the estimated noise level in RPKM. When doing job -predict, this parameter will be used to reduce noise. In the RNA-seq experiment conducted in (Nat Methods. 2008 Jul;5(7):621-8), about 7% short reads are mapped to introns and intergenic regions, while introns and intergenic regions comprise more than 98% of the entire mouse genome. By taking into consideration of the fact that the size of the mouse genome is about 2.5G, the noise level could be estimated as less than 0.03 RPKM. total_read_cnt should be the total number of successfully mapped short reads. This number would not be known before the reads are actually loaded by IsoInfer. Therefore, put an arbitrary number here initially. The program will update this number automatically. Currently, three type of distributions are supported: Constant: The definition_of_a_distribution should be the_constant Gaussian distribution: The definition_of_a_distribution should be mean standard_deviation Customized (Under construction): The definition_of_a_distribution should be value_cnt value probability value probability ... For example, for paired-end reads with span distribution N(300, 30^2) and end length 20. If the mapping file is "my_map_file", cross strength is 3, noise level is 1 RPKM, the number of total reads is 10M. Then the read info file should be: my_map_file 1 20 3 1 10000000 1 300, 30 After one read info, another one could be followed in the same file. On job -ext_junc_ref, only the first read info in the file is effective. If, on some job, not all the information in the read info is usefully, then the unused items can be set to any value.
-s	T/F	F	Whether are the operations strand specific?
-ins	file	N/A	A file containing instances.
-min_exp	number	0	The minimum expression level in RPKM. By default, this parameter will not have influence on the result. The greater the value is, the higher sensitivity / lower precision the result achieves. For example, this parameter could be set as 1.
-intron_exp	number	3	When doing job -predict, a segment with expression level below this parameter will be considered as an intron. A carefully selected value for this parameter is critical. If this value is 0, then all the segments will considered as expressed, which will introduce noise segments in the following isoform predictions. If this value is too high, many expressed segments will be considered as introns, which will lower the sensitivity. The sequence bias, noise, multireads and inaccuracy of exon-intron boundary should be taken into consideration in the setting of -intron_exp. By our tests, 3~5 is a reasonable value for this parameter.
-min_dup	number	1	A junction is covered if at least -min_dup reads cover this junction.
-bse	T/F	T	Use the TSS/PAS information or not.
-ps	number	7	Partition size. On whole mouse genome, the isoform inference process (Step4 in the following example) costs about 10 minutes on a standard PC with this default parameter. A larger value is supposed to lead to better results.
-conf_level	number in [0,1]	0.05	Set the confidence level.
-o	file	N/A	A file for output For job -ext_junc_ref, each junction forms two consecutive lines in the file. The odd line is the junction ID which is in the form of : chromosome\|position1\|position2\|cross_len\|Junc The following schematic graph defines position1 and position2 for a junction The even line is the concatenation of sequences [position1, position1+cross_len) and [position2, position2+cross_len) For job -predict, each line in the output is an predicted isoforms in a format similar to UCSC known genes: ID chromosome strand start_position end_position exon_start_positions exon_end_positions

Example:

The following example is based on single-end short reads. In the following example, an example read_info file and several useful scripts are provided. The usages of all the scripts are straightforward. Please read the script for the usages.

Use a script knownGeneExtractor to extract the required files needed by -bound, -grange and -tsspas from a knownGene table downloaded from UCSC
```
./knownGeneExtractor knownGene
```
Modify the the example read_info file as you want.
Extract junction sequences non-strand-specifically
```
isoinfer -ext_junc_ref -s F -rstart 0 -bound Bound -grange GeneRange -tsspas TSSPAS -ref refseq -read_info read_info -o juncref
```
Note that the known gene table and the reference sequence you downloaded should be consistent. Only the read length and cross strength in the read_info file is used in this step.
Use Bowtie to map single-end short reads to the reference sequence and junction sequences. You can use the script tranMappedRefReads to extract the mapping information of reads to the reference sequence from the default output of bowtie. You can use the script tranMappedJuncReads to extract the mapping information of reads to junction sequences from the default output of bowtie. Then put the output of these two scripts together, e.g. into file "mapped_reads".

Predict isoforms. Set -min_exp to 0.1 and all other parameters to default.

isoinfer -predict -bound Bound -grange GeneRange -tsspas TSSPAS -min_exp 0.1 -read_info read_info -o results