Skip to content

Biopython GATK #

Find similar titles

4회 업데이트 됨.

Edit
  • 최초 작성자
  • 최근 업데이트
    SeokmoonChoi

Structured data

Category
Software

GATK #

  • GenomeAnalysisToolKit
  • BroadInstitute에서 개발된 NGS 데이터를 이용한 유전체 분석 tool package

Requirements #

  • Mapping file은 반듯이 .bam 파일
  • bam 파일은 indexing 되어 있어야 함
  • Mapping reads는 refernece ordering을 이용한 sorting이 되어 있어야함
  • Reads는 하나 이상의 reads group 정보를 가지고 있어야함

Install programs #

Manual #

  • Pre-Processing ( FASTQToolkit )

    $ fastx_clipper –a [adapter_seq] -n –o [output.fastq] –i [input.fastq] > [report.txt]
    $ fastx_quality_filter [-i INFILE] [-o OUTFILE] > [report.txt]
    
  • Mapping ( BWA )

    $ bwa index [-p prefix] [-a bwtsw|is] <in.fasta>
    $ bwa mem [reference_genome] [read1] [read2] > [output.sam]
    
  • Mark Duplicates ( Picard )

    #ordering 
    $java -jar picard.jar SortSam I=[input.sam] O=[output.bam] SO=coordinate > [log]
    
    #duplicate marking
    $java -jar picard.jar MarkDuplicates I=[input.bam] O=[output.dup.bam] M=[matrix] > [log]
    
    #RG tag insertion
    $java -jar picard.jar \
    AddOrReplaceReadGroups \
    I=[input.dup.bam] \
    O=[output.dup.RG.bam]\
    RGID=[id] RGLB=[library] RGPL=[illumina] \
    RGPU=[barcord] RGSM=[group] RGCN=[center]> [log]
    
  • Indel Realignment ( GATK )

    # index / dictionary file
    $ java -jar picard.jar\
    CreateSequenceDictionary \
    R=[input_ref_seq.fasta] \
    O=[ouput_ref_seq.dict]
    
    $ samtools faidx [genome.fasta]
    $ samtools index [input.bam]
    
    # RealignerTargetCreator
    $ java -jar GenomeAnalysisTK.jar \
    -T RealignerTargetCreator \
    -R [reference] \
    -I [original bam] \
    -known [vcf_file] \
    -o [output_candiate_region]
    
    # Realignment
    $ java -jar GenomeAnalysisTK.jar\
    -T IndelRealigner\
    -R [reference] \
    -I [original bam] \
    -known [vcf_file] \
    –targetIntervals [file with target region] \
    -o [output_candiate_region] \
    –filterNoBases
    
  • Base Recalibration ( GATK )

    #searching
    $java -jar GenomeAnalysisTK.jar\
    -T BaseRecalibrator \
    -R [reference]\
    -I [realigned.bam] \
    -knownSites [.vcf] \
    –knowSites [.vcf] \
    -o [output_file]
    
    #ReCalibration
    $ java -jar GenomeAnalysisTK.jar \
    -T PrintReads \
    -R [genome.fasta] \
    -I [original.bam] \
    -BQSR [recal_searching.table] \
    -o [output.bam]
    
  • Variation Calling

  • UnifiedGenotyper ( GATK )

    $ java -jar GenomeAnalysisTK.jar \
    -T UnifiedGenotyper \
    -R [reference] \
    -I [input.bam] \
    -o [output.vcf] \
    -stand_call_conf 30 \
    -stand_emit_conf 10
    
  • HaplotypeCaller ( GATK )

    $ java -jar GenomeAnalysisTK.jar \
    -T HaplotypeCaller \
    -R [reference] \
    -I [input.bam] \
    -o [output.vcf] \
    -stand_call_conf 30 \
    -stand_emit_conf 10 \
    -minPruning 3
    
  • Variant Recalibration ( GATK )

    # Recalibration training
    $java -jar GenomeAnalysisTK.jar \
    -T VariantRecalibrator \
    –R [human_refernece] \
    –input CEUTrio.HiSeq.WGS.b37.bestPractices.b37.chr20.vcf \
     -resource:hapmap,known=false,training=true,truth=true,prior=15.0   hapmap_3.3.b37.vcf \
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0  dbsnp_138.b37.vcf \
    -an DP -an QD -an FS \ 
    -mode [SNP|inde] \
    -recalFile [recalibration_file] \
    -tranchesFile [reacal.tranches] \ 
    -rscriptFile [R_recal.plots.R]
    
    # Recalibration apply for VQSR
    $java -jar /DATA/1.src/bin/GenomeAnalysisTK.jar \
    -T ApplyRecalibration \ 
    -R  [human_refernece] \
    -input [input.vcf] \
    -mode SNP \
    -recalFile [recalibration_file]  \
    -tranchesFile [reacal.tranches] \
    -ts_filter_level [99.0] \
    -o [output.vcf]
    
  • Genotype annotation ( snpEff )

    # databases search
    $java -jar snpEff.jar databases
    
    # databases download
    $java -jar snpEff.jar download -v [GRCh37.71|rice5]
    
    # datbases building
    $java -jar /DATA/1.src/snpEff/snpEff.jar \
    -v -onlyProtein \
    -i vcf \
    -o gatk [database]\
    [input.vcf] > [snpEff.output.vcf]
    
    # Annotation with GATK
    $java -jar /DATA/1.src/bin/GenomeAnalysisTK.jar \ 
    -T VariantAnnotator \
    -R [human_reference.fasta]\
    -A SnpEff \
    --variant [input.SNP.vcf] \
    --snpEffFile [snpEff.vcf] \
    -o [output.SNP.anno.vcf]
    

    Suggested Pages #

0.0.1_20230725_7_v68