TogoVar를 SQLite에 밀어 넣고 검색해 보았습니다.

19664 단어 SQLite3 bioinformatics

작은 재료입니다. (자신용 메모입니다)

TogoVar 라는 일본인의 변형을 정리한 사전과 같은 웹사이트가 있습니다.

여기가 WebAPI를 제공하고 있는지 찾아 보았습니다만, 없을 것 같습니다. 그래서, 배포되고 있는 tsv 파일을 SQLite에 담아 보기로 했습니다.

파일 다운로드

사이트에서 tsv 파일을 다운로드하고 압축을 풉니 다. 일련이므로 명령이나 스크립트 등으로.
나는 평소 Ruby 사용이므로 이런 느낌

[*1..22, "X", "Y", "MT"].each do |i|
        puts i
        `wget https://togovar.biosciencedbc.jp/public/release/current/chr_#{i}_frequency.tsv.gz`
        `wget https://togovar.biosciencedbc.jp/public/release/current/chr_#{i}_molecular_annotation.tsv.gz`
end

Sqlite

SQLite를 준비합니다.

sqlite3 togo_var.sqlite

.separator "\t"

로 구분 기호를 구분 기호로 지정합니다.

.import some_nice.tsv some_nice_table

그렇다면 tsv를 읽을 수 있습니다. 다음과 같은 Schema가 됩니다. 파일을 로드하지 않으면 TYPE은 모두 TEXT가 됩니다. 부동 소수점 등을 제대로 취급하고 싶은 경우는 INTEGER나 REAL 등 제대로 설정해야 할지도 모릅니다. 여기에서는 freq 와 anno 의 2 개의 테이블을 준비했습니다.

CREATE TABLE freq(
  "tgv_id" TEXT,
  "rs" TEXT,
  "variant_type" TEXT,
  "chr" TEXT,
  "position_grch37" TEXT,
  "ref" TEXT,
  "alt" TEXT,
  "symbol" TEXT,
  "jga_ngs_allele_alt" TEXT,
  "jga_ngs_allele_total" TEXT,
  "jga_ngs_alt_allele_freq" TEXT,
  "jga_ngs_genotype_alt_alt" TEXT,
  "jga_ngs_genotype_ref_alt" TEXT,
  "jga_ngs_genotype_ref_ref" TEXT,
  "jga_ngs_qc_status" TEXT,
  "jga_snp_allele_alt" TEXT,
  "jga_snp_allele_total" TEXT,
  "jga_snp_alt_allele_freq" TEXT,
  "jga_snp_genotype_alt_alt" TEXT,
  "jga_snp_genotype_ref_alt" TEXT,
  "jga_snp_genotype_ref_ref" TEXT,
  "jga_snp_qc_status" TEXT,
  "3.5kjpn_allele_alt" TEXT,
  "3.5kjpn_allele_total" TEXT,
  "3.5kjpn_alt_allele_freq" TEXT,
  "3.5kjpn_genotype_alt_alt" TEXT,
  "3.5kjpn_genotype_ref_alt" TEXT,
  "3.5kjpn_genotype_ref_ref" TEXT,
  "3.5kjpn_qc_status" TEXT,
  "hgvd_allele_alt" TEXT,
  "hgvd_allele_total" TEXT,
  "hgvd_alt_allele_freq" TEXT,
  "hgvd_genotype_alt_alt" TEXT,
  "hgvd_genotype_ref_alt" TEXT,
  "hgvd_genotype_ref_ref" TEXT,
  "hgvd_qc_status" TEXT,
  "exac_total_allele_alt" TEXT,
  "exac_total_allele_total" TEXT,
  "exac_total_alt_allele_freq" TEXT,
  "exac_total_genotype_alt_alt" TEXT,
  "exac_total_genotype_ref_alt" TEXT,
  "exac_total_genotype_ref_ref" TEXT,
  "exac_total_qc_status" TEXT,
  "exac_african_allele_alt" TEXT,
  "exac_african_allele_total" TEXT,
  "exac_african_alt_allele_freq" TEXT,
  "exac_african_genotype_alt_alt" TEXT,
  "exac_african_genotype_ref_alt" TEXT,
  "exac_african_genotype_ref_ref" TEXT,
  "exac_eastasian_allele_alt" TEXT,
  "exac_eastasian_allele_total" TEXT,
  "exac_eastasian_alt_allele_freq" TEXT,
  "exac_eastasian_genotype_alt_alt" TEXT,
  "exac_eastasian_genotype_ref_alt" TEXT,
  "exac_eastasian_genotype_ref_ref" TEXT,
  "exac_finnish_allele_alt" TEXT,
  "exac_finnish_allele_total" TEXT,
  "exac_finnish_alt_allele_freq" TEXT,
  "exac_finnish_genotype_alt_alt" TEXT,
  "exac_finnish_genotype_ref_alt" TEXT,
  "exac_finnish_genotype_ref_ref" TEXT,
  "exac_european_allele_alt" TEXT,
  "exac_european_allele_total" TEXT,
  "exac_european_alt_allele_freq" TEXT,
  "exac_european_genotype_alt_alt" TEXT,
  "exac_european_genotype_ref_alt" TEXT,
  "exac_european_genotype_ref_ref" TEXT,
  "exac_latino_allele_alt" TEXT,
  "exac_latino_allele_total" TEXT,
  "exac_latino_alt_allele_freq" TEXT,
  "exac_latino_genotype_alt_alt" TEXT,
  "exac_latino_genotype_ref_alt" TEXT,
  "exac_latino_genotype_ref_ref" TEXT,
  "exac_other_allele_alt" TEXT,
  "exac_other_allele_total" TEXT,
  "exac_other_alt_allele_freq" TEXT,
  "exac_other_genotype_alt_alt" TEXT,
  "exac_other_genotype_ref_alt" TEXT,
  "exac_other_genotype_ref_ref" TEXT,
  "exac_southasian_allele_alt" TEXT,
  "exac_southasian_allele_total" TEXT,
  "exac_southasian_alt_allele_freq" TEXT,
  "exac_southasian_genotype_alt_alt" TEXT,
  "exac_southasian_genotype_ref_alt" TEXT,
  "exac_southasian_genotype_ref_ref" TEXT
);
CREATE TABLE anno(
  "tgv_id" TEXT,
  "rs" TEXT,
  "chr" TEXT,
  "position_grch37" TEXT,
  "ref" TEXT,
  "alt" TEXT,
  "symbol" TEXT,
  "transcript_id" TEXT,
  "consequence" TEXT,
  "sift_qualitative_prediction" TEXT,
  "sift_value" TEXT,
  "polyphen2_qualitative_prediction" TEXT,
  "polyphen2_value" TEXT
);

그런데, Schema가 정해졌으므로, 점점 파일을 임포트 해 갑니다만, 각 파일의 선두행(헤더)의 부분도 등록되어 버리므로, 나중에 삭제할 필요가 있습니다. 혹은,tsv 파일의 단계에서 삭제해 두는 것도 좋은 방법일지도 모릅니다.

.import chr_1_frequency.tsv freq
.import chr_2_frequency.tsv freq
.import chr_3_frequency.tsv freq
.import chr_4_frequency.tsv freq
.import chr_5_frequency.tsv freq
.import chr_6_frequency.tsv freq
.import chr_7_frequency.tsv freq
.import chr_8_frequency.tsv freq
.import chr_9_frequency.tsv freq
.import chr_10_frequency.tsv freq
.import chr_11_frequency.tsv freq
.import chr_12_frequency.tsv freq
.import chr_13_frequency.tsv freq
.import chr_14_frequency.tsv freq
.import chr_15_frequency.tsv freq
.import chr_16_frequency.tsv freq
.import chr_17_frequency.tsv freq
.import chr_18_frequency.tsv freq
.import chr_19_frequency.tsv freq
.import chr_20_frequency.tsv freq
.import chr_21_frequency.tsv freq
.import chr_22_frequency.tsv freq
.import chr_X_frequency.tsv freq
.import chr_Y_frequency.tsv freq
.import chr_MT_frequency.tsv freq
.import chr_1_molecular_annotation.tsv anno
.import chr_2_molecular_annotation.tsv anno
.import chr_3_molecular_annotation.tsv anno
.import chr_4_molecular_annotation.tsv anno
.import chr_5_molecular_annotation.tsv anno
.import chr_6_molecular_annotation.tsv anno
.import chr_7_molecular_annotation.tsv anno
.import chr_8_molecular_annotation.tsv anno
.import chr_9_molecular_annotation.tsv anno
.import chr_10_molecular_annotation.tsv anno
.import chr_11_molecular_annotation.tsv anno
.import chr_12_molecular_annotation.tsv anno
.import chr_13_molecular_annotation.tsv anno
.import chr_14_molecular_annotation.tsv anno
.import chr_15_molecular_annotation.tsv anno
.import chr_16_molecular_annotation.tsv anno
.import chr_17_molecular_annotation.tsv anno
.import chr_18_molecular_annotation.tsv anno
.import chr_19_molecular_annotation.tsv anno
.import chr_20_molecular_annotation.tsv anno
.import chr_21_molecular_annotation.tsv anno
.import chr_22_molecular_annotation.tsv anno
.import chr_X_molecular_annotation.tsv anno
.import chr_Y_molecular_annotation.tsv anno
.import chr_MT_molecular_annotation.tsv anno

자, 이제 검색을 할 수 있게 되었습니다만, 인덱스를 만들지 않으면 검색은 매우 느립니다.
그리고 인터넷의 각종 정보에 의하면, 인덱스의 붙이는 방법에는 요령이 있는 것 같습니다.

여기에서는 다음과 같이 최소한의 인덱스를 붙여 보았습니다.
인덱스를 붙이는 것도 나름대로 시간이 걸리므로, 많이 호출하게 되면, 그 때에 붙이는 느낌이라도 좋을까라고 생각합니다.

create index freq_tgv_id_index on freq(tgv_id);
create index anno_tgv_id_index on anno(tgv_id);

create index freq_symbol_index on freq(symbol);
create index anno_symbol_index on anno(symbol);

create index freq_rs_index on freq(rs);
create index anno_rs_index on anno(rs);

create index freq_position_grch37_index on freq(position_grch37);
create index anno_position_grch37_index on anno(position_grch37);
create index freq_jga_ngs_alt_allele_freq_index on freq(jga_ngs_alt_allele_freq);
create index freq_jga_snp_alt_allele_freq_index on freq(jga_snp_alt_allele_freq);
create index 'freq_3.5kjpn_alt_allele_freq_index' on freq('3.5kjpn_alt_allele_freq');

자, 이제 TogoVar의 기능을 로컬 데스크톱에서도 실현할 수 있을까? 라고 생각하면 그렇게 하지 않았습니다. Variant와 질병의 관계는 Clinvar가 제공하고 있으며 TogoVar가 제공하는 tsv는 어디까지나 Variant의 빈도 정보 만 있습니다. 이 점은 주의해 주십시오.

겨우 검색해보기

음주로 유명한 ALDH2 유전자를 찾습니다.

select count(*) from freq where symbol="ALDH2";

635

대량으로 히트하므로 빈도가 높은 돌연변이로 좁혀집니다.

select rs from freq where symbol="ALDH2" and jga_snp_alt_allele_freq > 0.2 ;

rs10744777
rs4767035
rs671
rs11066028
rs11066029
rs7296651

무사히 rs671이 히트했습니다.

이 기사는 이상입니다.

그리고는, SSD에 이 sqlite 파일을 돌진하면, Ruby의 parallel로 멀티 프로세스로 검색해 버릴 것. .

Reference

이 문제에 관하여(TogoVar를 SQLite에 밀어 넣고 검색해 보았습니다.), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://qiita.com/kojix2/items/d6a8b7468aa911f703d5

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

초보자가 10분 내에 구현하는 코멘트 기능

【비망록】Python으로 SQLite3 사용

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다