LSC

  • LSC is a long read error correction tool.
    It offers fast correction with high sensitivity
    and good accuracy.
  •  
  •     

Latest News: Major update version 2.0 ... read more

Tutorial

This tutorial will help you get started with LSC by demonstrating how to error correct 500k sample PacBio long reads with 1 million short reads of length 75bp. If you experience any problems following these steps, please don't hesitate to contact us.

Step 1 - Download and extract the example files

Download the example:

Extract the example to an empty folder of your choice. After extracting the folder should contain the following files and folders:

dn800c9107:example moo$ ls
data	bin		run.cfg

The tutorial will be given with the OSX version. However, the steps are the same for all versions.

Step 2 - Examine the example directory contents

Before we continue, it will be helpful to learn the purpose of each file in this example. When you run LSC on your data, all of these files can be in separate locations if you wish.
run.cfg
This is the most important file. It is a text file that contains the path to your sequencer reads and the configuration settings. Please see .cfg file format for details. It is simple to edit and you will need to edit it once for each data-set.
data directory
This directory contains all of the sequencer reads in the example. In your case, this directory could be anywhere and it may be read-only. In this example, you have a long read file: LR.fa and a short reads file: SR.fa
bin directory
This is directory stores all of the LSC binaries. It is important that all the binaries are in the same location. No installation is required! Simply copy this directory to a location convenient for you.
temp directory
This is a temporary directory created during the execution of LSC. The results of the initial short reads mapping is stored here, so this directory can be quite large.
Note: You can use '-clean_up' option in run.cfg file to limit useful intermediate files that LSC keeps after run-time for later reference.
output directory
This is directory stores all the useful output files after executing LSC. It is also created during the execution of LSC

Step 4 - Run LSC on the example data

Only one command is need to to initiate LSC.

Make sure your terminal is pointed to the example folder and type the following in one line:

./bin/runLSC.py run.cfg

You should then see some output:


=== Welcome to LSC 0.3 ===
['python_path ', ' /usr/bin/python']
['mode ', ' 0']
['LR_pathfilename ', ' data/LR.fa']
['LR_filetype ', ' fa']
['SR_pathfilename ', ' data/SR.fa']
['SR_filetype ', ' fa']
['I_nonredundant ', ' N']
['Nthread1 ', ' 7']
['Nthread2 ', ' 7']
['temp_foldername ', ' temp']
['output_foldername ', ' output']
['Lpseudochr ', ' 50000000']
['LgapInpseudochr ', ' 100']
['I_RemoveBothTails ', ' Y']
['MinNumberofNonN ', ' 39']
['MaxN ', ' 1']
['clean_up ', ' 0']
['aligner ', ' bowtie2']
['novoalign_options ', '  -r All -F FA  -n 300 -o sam -o']
['bwa_options ', '  -n 0.08 -o 10 -e 3 -d 0 -i 0 -M 1 -O 0 -E  1 -N']
['bowtie2_options ', ' --end-to-end -a -f -L 15 --mp 1,1 --np 1 --rdg 0,1 --rfg 0,1 --score-min L,0,-0.08 --no-unal']
['razers3_options ', ' -i 92 -mr 0 -of sam']
=== sort and uniq SR data ===
0:00:08.609253
===split SR:===
0:00:08.976453
===compress SR.aa:===
0:00:04.194401
finsish genome
0:00:04.264694
0:00:04.273020
0:00:04.327496
finsish genome
finsish genome
0:00:04.331644
0:00:04.341139
0:00:04.348802
finsish genome
finsish genome
finsish genome
finsish genome
0:00:14.166704
===RemoveBothTails in LR:===
0:00:16.357049
===compress LR:===
/usr/bin/python ../bin_review_3/FASTA2fa.py temp/Notwotails_LR.fa temp/LR.fa
rm temp/Notwotails_LR.fa
0:00:16.831772
===compress LR:===
/usr/bin/python ../bin_review_3/compress.py -MinNonN=0 -MaxN=10000 fa temp/LR.fa temp/LR.fa.
0:00:08.897705
finsish genome
rm temp/LR.fa
0:00:26.352285
===poolchr LR:===
finsish genome
0:00:26.865437
===bowtie2 index pseudochr:===
Settings:
  Output files: "temp/pseudochr_LR.fa.cps.*.bt2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
 . . . . 
Total time for backward call to driver() for mirror index: 00:00:16
0:00:57.808296
===bowtie2 SR.aa.cps:===
134942 reads; of these:
  134942 (100.00%) were unpaired; of these:
    115074 (85.28%) aligned 0 times
    4408 (3.27%) aligned exactly 1 time
    15460 (11.46%) aligned >1 times
14.72% overall alignment rate
. . . 
16.32% overall alignment rate
0:09:28.006754
===samParser SR.aa.cps.nav:===
0:10:10.478844
===convertNAV SR.aa.cps.nav:===
0:10:18.218553
===merge_mapping_file SR.aa.cps.nav.map:===
#####write LR_SR_mapping to file:0:00:00.592847
0:10:20.628160
===cat SR.aa.cps:===
0:10:20.809205
===cat SR.aa.idx:===
0:10:21.019699
===cat SR.??.cps.sam :===
0:10:21.938583
===cat SR.??.cps.nav :===
0:10:22.068057
===split LR_SR.map:===
0:10:22.390992
===write LR_SR.map.??_tmp:===
finish loading files
0:00:02.571950
0:00:00.880853
0:00:00.929013
0:00:00.881578
0:00:01.012865
0:00:00.755201
0:00:00.788939
0:00:00.909538
0:10:31.767625
===correct.py LR_SR.map.??_tmp :===
0:10:31.794182
0:19:52.385551
===cat full_LR_SR.map.fa :===
===cat corrected_LR_SR.map.fa :===
===cat corrected_LR_SR.map.fq :===
===cat uncorrected_LR_SR.map.fa :===

At this point, feel free to take a break. After about 3-4 minutes the the mapping and error correction will be completed.

Step 5 - Examining the output

All of the output from LSC is automatically copied to the "output" directory. After this execution, it should contain:
dnab4167d9:output moo$ ls
corrected_LR.fa	corrected_LR.fq uncorrected_LR.fa full_LR.fa
Each output file is descired on maual page in more details:
 corrected_LR
As long as there are short reads (SR) mapped to a long read, this long read can be corrected at the SR-covered regions. (Please see more details from the paper). The sequence from the left-most SR-covered base to the right-most SR-covered base is outputted in the file corrected_LR_SR.map
 full_LR
Although the terminus sequences are corrected, they are concatenated with their corrected sequence (corrected_LR_SR.map.fa) to be a "full" sequence. Thus, this sequence covers the equivalent length as the raw read and is outputted in the file full_LR_SR.map.fa
 uncorrected_LR
This is the negative control. uncorrected_LR_SR.map.fa contains the left-most SR-covered base to the right-most SR-covered base (equivalent region in corrected_LR_SR.map.fa) but not error corrected. Thus, it is fragments of the raw reads.

Step 6 - Learning how to apply this tutorial to your own data

See the Using LSC section of the manual.