LSC

  • LSC is a long read error correction tool.
    It offers fast correction with high sensitivity
    and good accuracy.
  •  
  •     

Latest News: Major update version 2.0 ... read more

Manual

Installation

No explicit installation is required for LSC. You may save the LSC code to any location, and you are welcome to add the bin/ directory to your path if you want.

You need to Python 2.7 installed in your computer which should be included by default with most linux distributions. Please see LSC requirements for more details

Using LSC

Firstly, see the tutorial on how to use LSC on some example data.

In order to use LSC on your own data, you will need to decide if doing a full run or a parallelized run would be more efficient. If you are running a large data set (Tens of millions of short reads, and hundreds of thousands of long reads or bigger), we highly recommend doing a parallelized run. A step by step guide for both normal and parallized execution can be found in the tutorial. Too access the help regarding the command line options, you can used -h option with runLSC.py.

Executing runLSC.py

"runLSC.py" is the main program in the LSC package. Output is written to the "--output" folder. Details of the output files are described in file formats. Its options are described here and can be accssed using the -h option when running runLSC.py:

$ LSC-2.0/bin/runLSC.py -h
usage: runLSC.py [-h] [--long_reads LONG_READS]
                 [--short_reads [SHORT_READS [SHORT_READS ...]]]
                 [--short_read_file_type {fa,fq,cps}] [--threads THREADS]
                 [--tempdir TEMPDIR | --specific_tempdir SPECIFIC_TEMPDIR]
                 [-o OUTPUT]
                 [--mode {0,1,2,3} | --parallelized_mode_2 PARALLELIZED_MODE_2]
                 [--aligner {hisat,bowtie2}] [--sort_mem_max SORT_MEM_MAX]
                 [--minNumberofNonN MINNUMBEROFNONN] [--maxN MAXN]
                 [--error_rate_threshold ERROR_RATE_THRESHOLD]
                 [--short_read_coverage_threshold SHORT_READ_COVERAGE_THRESHOLD]
                 [--long_read_batch_size LONG_READ_BATCH_SIZE]
                 [--samtools_path SAMTOOLS_PATH]

LSC 2.0: Correct errors (e.g. homopolymer errors) in long reads, using short
read data

optional arguments:
  -h, --help            show this help message and exit
  --long_reads LONG_READS
                        FASTAFILE Long reads to correct. Required in mode 0 or
                        1. (default: None)
  --short_reads [SHORT_READS [SHORT_READS ...]]
                        FASTA/FASTQ FILE Short reads used to correct the long
                        reads. Can be multiple files. If choice is cps reads,
                        then there must be 2 files, the cps and the idx file
                        following --short reads. Required in mode 0 or 1.
                        (default: None)
  --short_read_file_type {fa,fq,cps}
                        Short read file type (default: fa)
  --threads THREADS     Number of threads (Default = cpu_count) (default: 0)
  --tempdir TEMPDIR     FOLDERNAME where temporary files can be placed
                        (default: /tmp)
  --specific_tempdir SPECIFIC_TEMPDIR
                        FOLDERNAME of exactly where to place temproary
                        folders. Required in mode 1, 2 or 3. Recommended for
                        any run where you may want to look back at
                        intermediate files. (default: None)
  -o OUTPUT, --output OUTPUT
                        FOLDERNAME where output is to be written. Required in
                        mode 0 or 3. (default: None)
  --mode {0,1,2,3}      0: run through, 1: Prepare homopolymer compressed long
                        and short reads. 2: Execute correction on batches of
                        long reads. Can be superseded by --parallelized_mode_2
                        where you will only execute a single batch. 3: Combine
                        corrected batches into a final output folder.
                        (default: 0)
  --parallelized_mode_2 PARALLELIZED_MODE_2
                        Mode 2, but you specify a sigle batch to execute.
                        (default: None)
  --aligner {hisat,bowtie2}
                        Aligner choice. hisat parameters have not been
                        optimized, so we recommend bowtie2. (default: bowtie2)
  --sort_mem_max SORT_MEM_MAX
                        -S option for memory in unix sort (default: None)
  --minNumberofNonN MINNUMBEROFNONN
                        Minimum number of non-N characters in the compressed
                        read (default: 40)
  --maxN MAXN           Maximum number of Ns in the compressed read (default:
                        None)
  --error_rate_threshold ERROR_RATE_THRESHOLD
                        Maximum percent of errors in a read to use the
                        alignment (default: 12)
  --short_read_coverage_threshold SHORT_READ_COVERAGE_THRESHOLD
                        Minimum short read coverage to do correction (default:
                        20)
  --long_read_batch_size LONG_READ_BATCH_SIZE
                        INT number of long reads to work on at a time. This is
                        a key parameter to adjusting performance. A smaller
                        batch size keeps the sizes and runtimes of
                        intermediate steps tractable on large datasets, but
                        can slow down execution on small datasets. The default
                        value should be suitable for large datasets. (default:
                        500)
  --samtools_path SAMTOOLS_PATH
                        Path to samtools by default assumes its installed. If
                        not specified, the included version will be used.
                        (default: samtools)
The required options differ depending on run mode, but the most basic way to run LSC is to do an end to end run that does not save temporary files.
    $ LSC-2.0/bin/runLSC.py --long_reads myLR.fa --short_reads mySR.fa --output myoutputdir
    
You may also execute LSC step-by-step. To do this we must specify a temporary directory that will not be deleted using --specific_tempdir.
    $ LSC-2.0/bin/runLSC.py --mode 1 --long_reads myLR.fa --short_reads mySR.fa 
    --specifc_tempdir mytempdir
    
    $ LSC-2.0/bin/runLSC.py --mode 2 --specifc_tempdir mytempdir
    
    $ LSC-2.0/bin/runLSC.py --mode 3 --specifc_tempdir mytempdir --output myoutdir
    
Alternatively a parallelized work flow can be done by replacing the --mode 2 paramater with --parallelized_mode_2 X, where X is the batch number. You can execute the command for each batch. If you need to find the number of batches, after running --mode 1, you can look in mytempdir/batch_count. You will need to execute 1 to and including batch_count.
    $ LSC-2.0/bin/runLSC.py --mode 1 --long_reads myLR.fa --short_reads mySR.fa 
    --specifc_tempdir mytempdir
    
Now the compressed long and short reads are ready for analysis.
    $ cat mytempdir/batch_count
    
This will tell you how many X batches to run. Now for 1 to X:
    $ LSC-2.0/bin/runLSC.py --parallized_mode_2 X --specifc_tempdir mytempdir
    
Finally you can combine all the outputs back together.
    $ LSC-2.0/bin/runLSC.py --mode 3 --specifc_tempdir mytempdir --output myoutdir
    
Please refer to the Tutorial for an example.

Input files

LSC accepts one long-read sequences file (to be corrected) and one or more short-read sequences file as input. The input files could be in standard fasta or fastq formats. Gzipped inputs are supported.

Note: As part of LSC algorithm, it generates homopolyer-compressed short-read sequences before alignment. If you have already run LSC with the same SR dataeset you can skip this step by using previously generated homopolyer-compressed SR files. (You can find SR.fa.cps and SR.fa.idx in temp folderpath.) Please keep in mind if you are using the cps and idx SR files, you will need to specify both their locations as two parameters following --short_reads

Output files

There are four output files: corrected_LR.fa, corrected_LR.fq, full_LR.fa, uncorrected_LR.fa in output folder:

The quality (error rate) of corrected reads in corrected_LR.fq depends on its SR coverage and it uses Sanger standard encoding.

Reference: LSC paper
* Error probablity is modeled with logarithmic funtion fitted to real data error-probabilities computed in the paper.
SRs CoverageError Probability*
00.275
10.086
20.063
30.051
40.041
50.034
60.028
70.023
80.018
90.014
100.011
110.008
120.005
130.002
>= 14~0.000

Note: Part of corrected_LR sequence without any short read coverage would have the default 27.5% error rate. If input LRs are in fastq format, the original quality values are not used here.

Module: filter_corrected_reads.py

In addition to quality information in corrected_LR.fq file, you can also select corrected LR sequences with higher percentage of SR covered length using filter_corrected_reads.py script in the bin folder.

LSC_bin_path/utilities/filter_corrected_reads.py <SR_covered_length_threshold> <corrected_LR.fa or fq file> > <output_file>

exapmle:     python bin/filter_corrected_reads.py 0.5 output/corrected_LR.fa > output/corrected_LR.filtered.fa

You can also select "best" reads for your downstream analysis by mapping corrected LRs to the reference genome or annotation (for RNA-seq analysis). Then, filter the reads by mapping score or percentage of base match (e.g. "identity" in BLAT)

Short read-Long read Aligner

LSC uses a short read aligner in the first step. By default, Bowtie2 is used. You can have BWA, , Novoalign or RazerS (v3) to run this step as well.

Default aligners setting are:

    BWA : -n 0.08 -o 10 -e 3 -d 0 -i 0 -M 1 -O 0 -E 1 -N
    Novoalign* : -r All -F FA -n 300 -o sam
    RazerS3 : -i 92 -mr 0 -of sam
You can change these settings through .cfg file. Please refer to their manuals for more details.
* Note: novoalign has limitation on read length. If you are using LSC with novoalign, please make sure your short reads length do not exceed maximum threashold.

Following figures compare LSC correction results configured with different supported aligners. Identity metric is defined as number-of-matchs/error-corrected-read-length after aligning reads to reference genome using Blat.


Data-set:


Based on your system configuration, you can select the aligner which fits better with your CPU or Memory resources.
The below table is derived experimentally by running LSC using different aligners on above-mentioned data-set.

 CPU  Memory 
 BWA  Less  Less
 Bowtie2  More  Less
 RazerS3  More  More

Short-read coverage depth (SCD)

LSC uses consensus of short-read mapping results to correct long read sequences. In case of having high SR coverage, pile of SRs mapped to a LR segment would significantly increase running time and memory usage in correction step, while having repetitive (redundant) information. By setting SCD parameter in run.cfg file, LSC uses a probabilistic algorithm to randomly select bounded number of SR alignemt results for each LR region in order to maintain expected SR coverage depth of SCD value. This would eliminate high memory peaks in corection step due to pile of SRs mapped in high coverage or repetitive regions. Based on our experiment on multilpe datasets, setting SCD = 20 gave comparable results w.r.t SCD = -1 (using all alignment results,i.e. without any bounded coverage limit).

Execution Time

Following CPU and execution times are suggested-usage using LSC.0.2.2 and LSC 1.alpha on our clusters with six thread. These figures will greatly differ based on your system configuration.

100,000 PacBio long reads X 64 million 75bp Illumina short reads (Dataset)