SAMtools의 일반 명령
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ samtools --help
Program: samtools (Tools for alignments in the SAM format)
Version: 1.3-20-gd49c73b (using htslib 1.3-35-g26b3085)
Usage: samtools <command> [options]
Commands:
-- Indexing
dict create a sequence dictionary file
faidx index/extract FASTA
index index alignment
-- Editing
calmd recalculate MD/NM tags and '=' bases
fixmate fix mate information
reheader replace BAM header
rmdup remove PCR duplicates
targetcut cut fosmid regions (for fosmid pool only)
addreplacerg adds or replaces RG tags
-- File operations
collate shuffle and group alignments by name
cat concatenate BAMs
merge merge sorted alignments
mpileup multi-way pileup
sort sort alignment file
split splits a file by read group
quickcheck quickly check if SAM/BAM/CRAM file appears intact
fastq converts a BAM to a FASTQ
fasta converts a BAM to a FASTA
-- Statistics
bedcov read depth per BED region
depth compute the depth
flagstat simple stats
idxstats BAM index stats
phase phase heterozygotes
stats generate stats (former bamcheck)
-- Viewing
flags explain BAM flags
tview text alignment viewer
view SAM<->BAM<->CRAM conversion
depad convert padded BAM to unpadded BAM
2. Bam과 Sam의 상호 변환:
samtools view -h file.bam > file.sam
samtools view -b -S file.sam > file.bam
예:
samtools view -h NA12878.bam >NA12878_2.sam
samtools view -b -S NA12878.sam > NA12878_2.bam
3. 색인 index 구축:
4
samtools index sorted.bam
인스턴스:4
samtools index NA12878.bam
NA12878. 생성bam.bai, 구체적:어떻게 보는지 모르겠어요.bai 파일???4. 통계 flagstat:
samtools flagstat NA12878Merge.bam
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ samtools flagstat NA12878Merge.bam
1058 + 72 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
118 + 32 duplicates
1058 + 72 mapped (100.00% : 100.00%)
1058 + 72 paired in sequencing
516 + 26 read1
542 + 46 read2
1048 + 64 properly paired (99.05% : 88.89%)
1048 + 64 with itself and mate mapped
10 + 8 singletons (0.95% : 11.11%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
큰 파일:
hadoop@Mcnode6:~/cloud/adam/xubo/1000genomes/GIH/NA21144/alignment$ samtools flagstat NA21144.alt_bwamem_GRCh38DH.20150718.GIH.low_coverage.cram
247384356 + 0 in total (QC-passed reads + QC-failed reads)
2090839 + 0 secondary
9082475 + 0 supplementary
5797777 + 0 duplicates
246212003 + 0 mapped (99.53% : N/A)
236211042 + 0 paired in sequencing
118105521 + 0 read1
118105521 + 0 read2
226096416 + 0 properly paired (95.72% : N/A)
233890591 + 0 with itself and mate mapped
1148098 + 0 singletons (0.49% : N/A)
4379358 + 0 with mate mapped to a different chr
1561618 + 0 with mate mapped to a different chr (mapQ>=5)
5. 파일 merge 병합하기
bam 파일:
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ samtools merge NA12878Merge.bam NA12878.bam NA12878_2.bam
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ ls
aln.sorted.bam cramtools-3.0.jar NA12878_2.bam NA12878_2.sam NA12878.bam NA12878.bam.bai NA12878Merge.bam NA12878.sam
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ ll
total 4676
drwxrwxr-x 2 hadoop hadoop 4096 3 9 20:41 ./
drwx------ 13 hadoop hadoop 4096 3 7 22:07 ../
-rw-rw-r-- 1 hadoop hadoop 64321 3 9 20:29 aln.sorted.bam
-rw-rw-r-- 1 hadoop hadoop 3978747 3 8 22:01 cramtools-3.0.jar
-rw-rw-r-- 1 hadoop hadoop 64321 3 9 20:27 NA12878_2.bam
-rw-rw-r-- 1 hadoop hadoop 255818 3 9 16:06 NA12878_2.sam
-rw-rw-r-- 1 hadoop hadoop 64321 3 9 16:00 NA12878.bam
-rw-rw-r-- 1 hadoop hadoop 200 3 9 16:04 NA12878.bam.bai
-rw-r--r-- 1 hadoop hadoop 4096 3 9 16:01 .NA12878.bam.swp
-rw-rw-r-- 1 hadoop hadoop 75781 3 9 20:41 NA12878Merge.bam
-rw-r--r-- 1 hadoop hadoop 255818 11 16 2014 NA12878.sam
bam 통계:
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ samtools flagstat NA12878.bam
529 + 36 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
59 + 16 duplicates
529 + 36 mapped (100.00% : 100.00%)
529 + 36 paired in sequencing
258 + 13 read1
271 + 23 read2
524 + 32 properly paired (99.05% : 88.89%)
524 + 32 with itself and mate mapped
5 + 4 singletons (0.95% : 11.11%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ samtools flagstat NA12878Merge.bam
1058 + 72 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
118 + 32 duplicates
1058 + 72 mapped (100.00% : 100.00%)
1058 + 72 paired in sequencing
516 + 26 read1
542 + 46 read2
1048 + 64 properly paired (99.05% : 88.89%)
1048 + 64 with itself and mate mapped
10 + 8 singletons (0.95% : 11.11%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
sam 파일:
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ samtools merge NA12878Merge.sam NA12878.sam NA12878_2.sam
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ ls
aln.sorted.bam cramtools-3.0.jar NA12878_2.bam NA12878_2.sam NA12878.bam NA12878.bam.bai NA12878Merge.bam NA12878Merge.sam NA12878.sam
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ ll
total 5188
drwxrwxr-x 2 hadoop hadoop 4096 3 9 20:44 ./
drwx------ 13 hadoop hadoop 4096 3 7 22:07 ../
-rw-rw-r-- 1 hadoop hadoop 64321 3 9 20:29 aln.sorted.bam
-rw-rw-r-- 1 hadoop hadoop 3978747 3 8 22:01 cramtools-3.0.jar
-rw-rw-r-- 1 hadoop hadoop 64321 3 9 20:27 NA12878_2.bam
-rw-rw-r-- 1 hadoop hadoop 255818 3 9 16:06 NA12878_2.sam
-rw-rw-r-- 1 hadoop hadoop 64321 3 9 16:00 NA12878.bam
-rw-rw-r-- 1 hadoop hadoop 200 3 9 16:04 NA12878.bam.bai
-rw-r--r-- 1 hadoop hadoop 4096 3 9 16:01 .NA12878.bam.swp
-rw-rw-r-- 1 hadoop hadoop 75781 3 9 20:41 NA12878Merge.bam
-rw-rw-r-- 1 hadoop hadoop 521677 3 9 20:44 NA12878Merge.sam
-rw-r--r-- 1 hadoop hadoop 255818 11 16 2014 NA12878.sam
sam 파일 통계:
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ samtools flagstat NA12878Merge.sam
1058 + 72 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
118 + 32 duplicates
1058 + 72 mapped (100.00% : 100.00%)
1058 + 72 paired in sequencing
516 + 26 read1
542 + 46 read2
1048 + 64 properly paired (99.05% : 88.89%)
1048 + 64 with itself and mate mapped
10 + 8 singletons (0.95% : 11.11%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ samtools flagstat NA12878.sam
529 + 36 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
59 + 16 duplicates
529 + 36 mapped (100.00% : 100.00%)
529 + 36 paired in sequencing
258 + 13 read1
271 + 23 read2
524 + 32 properly paired (99.05% : 88.89%)
524 + 32 with itself and mate mapped
5 + 4 singletons (0.95% : 11.11%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
6.정렬sort:
samtools sort -o NA12878.Sorted.bam NA12878.bam
7. 인덱스 통계 idxstats:
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ samtools index aln.sorted.bam
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ samtools idxstats aln.sorted.bam
20 63025520 565 0
* 0 0 0
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ ls
aln.sorted.bam aln.sorted.bam.bai cramtools-3.0.jar NA12878_2.bam NA12878_2.sam NA12878.bam NA12878.bam.bai NA12878Merge.bam NA12878Merge.sam NA12878.sam NA12878.Sorted.bam
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam/resources$ samtools view -b -S small.sam >small.bam
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam/resources$ ls
artificial.fa fastq_sample1.fq interleaved_fastq_sample1.ifq proper_pairs_2.fq small_realignment_targets.pileup
artificial.README.txt fastq_sample2.fq interleaved_fastq_sample1.ifq.output reads-0-2-0 small_realignment_targets_README.txt
artificial.realigned.sam fastq_sample3.fq interleaved_fastq_sample2.ifq reads12_diff1.sam small_realignment_targets.sam
artificial.sam fastq_sample4.fq interleaved_fastq_sample2.ifq.output reads12.sam small.sam
bqsr1-ref.observed features interleaved_fastq_sample3.ifq reads21.sam small.vcf
bqsr1.sam gencode.v19.pc_transcripts.250k.fa.gz interleaved_fastq_sample3.ifq.output single_fastq_sample1.fq.output sorted.sam
bqsr1.snps hg19.chrM.2bit interleaved_fastq_sample4.ifq single_fastq_sample2.fq.output tags.sam
bqsr1.vcf Hs_Ensembl_example_genes.gtf interleaved_fastq_sample4.ifq.output single_fastq_sample3.fq.output test.conf
chr20.250k.fa.gz human_g1k_v37_chr1_59kb.2bit multiline_fastq.fq single_fastq_sample4.fq.output test_rowgroup_rangeindex.1.txt
dict_with_accession.dict human_g1k_v37_chr1_59kb.fasta ordered.sam small.bam unmapped.sam
example_intervals.list improper_pairs_1.fq parquet_lister_dir_empty small_missing.vcf unordered.sam
fastq_noqual.fq improper_pairs_2.fq proper_pairs_1.fq small_realignment_targets.intervals unsorted.sam
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam/resources$ samtools sort -o smallSort.bam small.bam
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam/resources$ samtools index smallSort.bam
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam/resources$ samtools idxstats smallSort.bam
1 249250621 20 0
2 243199373 0 0
* 0 0 0
7. 깊이 depth:
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam/resources$ samtools depth --help
depth: unrecognized option '--help'
Usage: samtools depth [options] in1.bam [in2.bam [...]]
Options:
-a output all positions (including zero depth)
-a -a (or -aa) output absolutely all positions, including unused ref. sequences
-b <bed> list of positions or regions
-f <list> list of input BAM filenames, one per line [null]
-l <int> read length threshold (ignore reads shorter than <int>)
-d/-m <int> maximum coverage depth [8000]
-q <int> base quality threshold
-Q <int> mapping quality threshold
-r <chr:from-to> region
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
--reference FILE
Reference sequence FASTA FILE [null]
The output is a simple tab-separated table with three columns: reference name,
position, and coverage depth. Note that positions with zero coverage may be
omitted by default; see the -a option.
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam/resources$ samtools depth small.bam
1 26472784 1
1 26472785 1
1 26472786 1
1 26472787 1
1 26472788 1
1 26472789 1
1 26472790 1
1 26472791 1
1 26472792 1
1 26472793 1
1 26472794 1
1 26472795 1
1 26472796 1
1 26472797 1
1 26472798 1
1 26472799 1
1 26472800 1
1 26472801 1
1 26472802 1
1 26472803 1
1 26472804 1
1 26472805 1
1 26472806 1
1 26472807 1
1 26472808 1
1 26472809 1
1 26472810 1
1 26472811 1
1 26472812 1
1 26472813 1
1 26472814 1
1 26472815 1
1 26472816 1
1 26472817 1
1 26472818 1
1 26472819 1
1 26472820 1
1 26472821 1
1 26472822 1
1 26472823 1
1 26472824 1
1 26472825 1
1 26472826 1
1 26472827 1
1 26472828 1
1 26472829 1
1 26472830 1
1 26472831 1
1 26472832 1
1 26472833 1
1 26472834 1
1 26472835 1
1 26472836 1
1 26472837 1
1 26472838 1
1 26472839 1
1 26472840 1
1 26472841 1
1 26472842 1
1 26472843 1
1 26472844 1
1 26472845 1
1 26472846 1
1 26472847 1
1 26472848 1
1 26472849 1
1 26472850 1
1 26472851 1
1 26472852 1
1 26472853 1
1 26472854 1
1 26472855 1
1 26472856 1
1 26472857 1
1 26472858 1
8.fa 파일에 색인faidx 만들기
faidx
samtools faidx <ref.fasta> [region1 [...]]
Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create <ref.fasta>.fai on the disk. If regions are specified, the subsequences will be retrieved and printed to stdout in the FASTA format.
The input file can be compressed in the BGZF format.
The sequences in the input file should all have different names. If they do not, indexing will emit a warning about duplicate sequences and retrieval will only produce subsequences from the first sequence with the duplicated name.
hadoop@Mcnode1:~/cloud/adam/xubo/code/samtools/examples$ ls
00README.txt ex1.fa ex1.sam.gz toy.fa toy.sam
hadoop@Mcnode1:~/cloud/adam/xubo/code/samtools/examples$ samtools faidx ex1.fa
hadoop@Mcnode1:~/cloud/adam/xubo/code/samtools/examples$ ls
00README.txt ex1.fa ex1.fa.fai ex1.sam.gz toy.fa toy.sam
9.text 비교 차트 tview:
samtools tview [-p chr:pos] [-s STR] [-d display] <in.sorted.bam> [ref.fasta]
Text alignment viewer (based on the ncurses library). In the viewer, press `?' for help and press `g' to check the alignment start from a region in the format like `chr10:10,000,000' or `=10,000,000' when viewing the same reference sequence.
Options:
-d display
Output as (H)tml or (C)urses or (T)ext
-p chr:pos
Go directly to this position
-s STR
Display only alignments from this sample or read group
실행:
hadoop@Mcnode1:~/cloud/adam/xubo/code/samtools/examples$ samtools view -h -S toy.sam >toy.bam
hadoop@Mcnode1:~/cloud/adam/xubo/code/samtools/examples$ samtools sort -o toy.sorted.bam toy.bam
hadoop@Mcnode1:~/cloud/adam/xubo/code/samtools/examples$ samtools tview toy.bam toy.fa
Cannot read index for 'toy.bam'.
hadoop@Mcnode1:~/cloud/adam/xubo/code/samtools/examples$ samtools index toy.sorted.bam
hadoop@Mcnode1:~/cloud/adam/xubo/code/samtools/examples$ ls
00README.txt ex1.fa ex1.fa.fai ex1.sam.gz toy.bam toy.fa toy.fa.fai toy.sam toy.sorted.bam toy.sorted.bam.bai toy.sorted.sam
hadoop@Mcnode1:~/cloud/adam/xubo/code/samtools/examples$ samtools tview toy.sbam toy.fa
toy.sam toy.sorted.bam toy.sorted.bam.bai toy.sorted.sam
hadoop@Mcnode1:~/cloud/adam/xubo/code/samtools/examples$ samtools tview toy.sbam toy.fa
toy.sam toy.sorted.bam toy.sorted.bam.bai toy.sorted.sam
hadoop@Mcnode1:~/cloud/adam/xubo/code/samtools/examples$ samtools tview toy.sorted.bam toy.fa
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171
AGCATGTTAGATAA****GATA**GCTGTGCTAGTAGGCAG*TCAGCGCCATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
........ .... ......K.K......K. ..........
........AGAG....***... ,,,,, ,,,,,,,,,
......GG**....AA
..C...**** ...**...>>>>>>>>>>>>>>T.....
10.bam/sam/cram 파일을fasta/fastq 파일로 변환하기
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ samtools fasta -s NA12878.fa NA12878.sam
[M::bam2fq_mainloop_singletontrack] discarded 565 singletons
[M::bam2fq_mainloop_singletontrack] processed 565 reads
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ ls
aln.sorted.bam cramtools-3.0.jar NA12878_2.sam NA12878.bam.bai NA12878Merge.bam NA12878.sam resources
aln.sorted.bam.bai NA12878_2.bam NA12878.bam NA12878.fa NA12878Merge.sam NA12878.Sorted.bam
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ vi NA12878.fa
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ samtools fastq -s NA12878.fq NA12878.sam
[M::bam2fq_mainloop_singletontrack] discarded 565 singletons
[M::bam2fq_mainloop_singletontrack] processed 565 reads
hadoop@Mcnode1:~/cloud/adam/xubo/data/data_HDFS/adam$ more NA12878.fq
@20FUKAAXX100202:3:6:15018:84106/2
ACCCAAATCTAATCAAGGCTCCCACTCTAACTCCCAAGCTCTAGGATATACCAAGGACAAAGGAAGATCATGAAATACCACCATGGGGATTCAATCAGCAA
+
?@BBBCEEDFEFEEEFDEEFEEEEBFEDEFCFDDEEFEDFDFEEEFEEEECEEFEEFCEFDEEFFEFEDEEEFFFDECEDCEFEEDDFFBFEFGEAEDCCC
@20GAVAAXX100126:8:62:5578:2527/1
TTGCTGATTGAATCCCCATGGTGGTATTTCATGATCTTCCTTTGTCCTTGGTATATCCTAGAGCTTGGGAGTTAGAGTGGGAGCCTTGATTAGCTTTGGGT
+
1;@3@?<>>?<;:=;==?=>9=?9>9===>=<<;;@@BAABAA>B>4?><8<:=<==;>9:@?>=;<7+/;:/80?5>;-(-;89<::8::8:(1,/:438
@20FUKAAXX100202:4:47:20584:49257/2
CCAAATCTAATCAAGGCTCCCACTCTAACTCCCAAGCTCTAGGATATACCAAGGACAAAGGAAGATCATGAAATACCACCATGGGGATTCAATCAGCAAAT
+
?ACDBBCEDFEDEFEEEFEDBECFBFEFCFDEEEFEDFDFEEEFEEEECEEFEEFCEFFEEFFEFEDEAEFFFAECEFCDFEEFBFFDBEEC:@6A?C4>B
@20GAVAAXX100126:7:47:4730:37293/2
CCAAATCTAATCAAGGCTCCCACTCTAACTCCCAAGCTCTAGGATATACCAAGGACAAAGGAAGATCATGAAATACCACCATGGGGATTCAATCAGCAAAT
+
?BB@BCBFDDECC=E@@DB;BDCFDE<<AEB@B>BADD>?C?EDEB>@AC=<?=DAE?E=CAC?;<C=@ADD?ACACCAC>:>4=B676<17@@<:AA<;6
@20GAVAAXX100126:5:46:21151:39489/1
ATTTGCTGATTGAATCCCCATGGTGGTATTTCATGATCTTCCTTTGTCCTTGGTATATCCTAGAGCTTGGGAGTTAGAGTGGGAGCCTTGATTAGATTTGG
+
A=DED=FEFEFEFFEEEEEFEEEEEEEEEFFDEEDFEDFFDDFFFDDDDFFEEDEEEEDDFEEFEEFFEEEFEDFEEFECEEEFEEEFFE>BB>BBB=<9>
@20FUKAAXX100202:6:2:2264:85470/1
CAAATCTAATCAAGGCTCCCACTCTAACTCCCAAGCTCTAGGATATACCAAGGACAAAGGAAGATCATGAAATACCACCATGGGGATTCAATCAGCAAATT
+
ABDFEEFEGEEFFEFEFEEEFCCDFEFCFEEEEFEDFDF?EEFCEEEBCEFECFCEFFEDFFBE=CF=CFEFDEB@D=CDDC>D:?;;=@DACECAD5A>9
@20FUKAAXX100202:7:24:21050:192748/1
CAAATCTAATCAAGGCTCCCACTCTAACTCCCAAGCTCTAGGATATACCAAGGACAAAGGAAGATCATGAAATACCACCATGGGGATTCAATCAGCAAATT
+
ABEFEEFFGFEFFEEEFEEEFCFDFEFCFEEEEFEDFDFEEEFEEEECDEFEEFCEFFEEFFEFCCFBE@FGDEAAEBCDC@:FBFBB?EDEEDCCDCB=<
@20FUKAAXX100202:6:48:15418:77607/1
AAATCTAATCAAGGCTCCCACTCTAACTCCCAAGCTCTAGGATATACCAAGGACAAAGGAAGATCATGAAATACCACCATGGGGATTCAATCAGCAAATTC
+
ACDEDFEGEEFGEEEFEEEECFEFEFCFEEEEFEDFDFEEEFEEEECEEFEEFCEFFEEFFEFEDEEEFFFEECEFCEFFEFFFDDFEFGFEFFEEFECCB
@20GAVAAXX100126:3:65:13772:1294/1
AAATCTAATCAAGGCTCCCACTCTAACTCCCAAGCTCTAGGATATACCAAGGACAAAGGAAGATCATGAAATACCACCATGGGGATTCAATCAGCAAATTC
+
자세한 조작은 참조【1】
참조 자료:
【1】 http://www.bbioo.com/lifesciences/40-113338-1.html
【2】 http://www.htslib.org/doc/samtools.html
【3】