An Experiment to Whole-genome Resequencing Data Analysis (II)

  • SV formation mechanism analysis, including the identification and discovery of the following main mechanisms that may exist:

    (A) Homology-mediated orthologous sequence segment recombination (NAHR);

    (B) Non-homologous recombination (NHR) related to DNA double-strand break repair or replication fork pause repair;

    (C) A variable number of tandem repeat sequences (VNTR) is formed through expansion and compression mechanisms;

    (D) Insertion of transposable elements (generally mainly long / short interval sequence elements LINE / SINE or a combination of both accompanied by TEI related events)

    Structural variation detection and amplicon detection and identification analysis:

    1. In-depth sequencing analysis

    Sequencing depth analysis refers to associating the depth of coverage within the genome frame with the depth of coverage desired and identifying the SV. We will also use different algorithms to identify the deletions and duplications in the original sequencing data.

    1. Integration of SV detection recognition results and FDR inference (optional step)

    (1). PCR or chip verification SV

    (2). Calculate the FDR-error discovery rate (cooperate with the verification test specified by the customer)

    (3) Screening SV detection results for SV merging and subsequent analysis: use different methods to detect and identify SVs to greatly detect SVs and reduce their FDR (<= 10%). The SV set used in the subsequent analysis is determined by the subordinate screening method. The SV FDR requirements of each SV detection and identification algorithm are less than 10%, and the SVs that meet their conditions are combined; for algorithms with FDR greater than 10%, the SV results calculated and identified can also be included if there is PCR and chip platform verification data. In the subsequent SV analysis. Finally, for the SV obtained by different algorithms, the integration process is evaluated according to the confidence interval of the overlap coverage of the breakpoint.

    1. Variation attribute analysis

    (1) Neutral coalescent analysis

    Sequencing data can detect low-frequency variants (MAF <= 5%). The distribution of low-frequency variation can be calculated based on the expected value from population genetics theory (neutral coalescent theory). Researchers use the ratio of the number of mutations per Mb at different allele frequencies to the expected value under the selection of neutral coalescent, that is, the theta observations in the windows of each Mb genome, to characterize and reflect natural purification options and populations (cancer cell-line can be specifically considered Is a distinguishable population) growth rate. This distribution examines SNP, Indel, large genotype deletions, and SNP on exon regions at different allele frequency intervals.

    (2). Allelic frequency and number distribution of novel variants

    The analysis objects include the newly predicted SNP, indel, large deletion, and the number ratio of exon SNP under each allele frequency category (fraction); the new prediction refers to the prediction analysis results and dbSNP (current Version 129) and the deletion database dbVar (June 2010 version) and the published genomic data on the indels study were compared to identify new SNPs, indels and deletions. dbSNP contains SNP and indels; dbVAR contains deletion, duplication, and mobile element insertion. Short indels and large deletions provided by the results of dbRIP and other genomics studies (JC Ventrer and Watson Genomics, Yanhuang Project Asian Genome).

    (3). Size distribution and novelty distribution of variants

    Calculate the size distribution of SNP, Deletion, and Insertion; calculate the ratio of the number of new prediction results in SNP, Deletion, and Insertion to the number of existing reference databases (relative to the dbSNP database; dbSNP contains SNP and indels; dbVAR contains deletion, duplication, and mobile element insertion. dbRIP and other genomics studies (JC Ventrer and Watson genome, Yanhuang Project Asian genome) provide short indels and large deletion results, which can give the characteristic location of LINE and Alu.

    (4). Breakpoint Junction analysis of structural variation SV

    According to the different detection results of SV, through a series of screening steps, a database of breakpoint junctions of all structural variants SV is constructed, and SVs with a length of 50bp or more are retained; SVs with homology or microhomology at the breakpoint junction are analyzed; The different SVs at the start and end position coordinates are de-redundant.

    Analyze and identify the breakpoint of SV (Breakpoint): Breakpoint can be classified into the following categories according to the possible formation method:

    (A) Non-allelic homologous recombination (NAHR);

    (B) Nonhomologous recombination (NHR), including nonhomologous end-joining (NHEJ) and fork stalling / template switching (FoSTeS / MMBIR);

    (C) Variable tandem repeat (VNTR)

    (D) Transposable insert element (TEI).

    SV formation preference analysis

    Analyze the relationship between the formation mechanism of SV and the sequence of the region near the break point, including chromatin landmarks (telomere, centromere), recombination region, repetitive sequence and GC content, short DNA motif and microhomology region.

    1. Estimation of mutation rate

    For the sequencing scheme based on family members, researchers mainly detect the mutation of de novo (DNM); by using different methods / algorithms, we give an inferred DNM report for each family;

    (1) According to the result of genotype inference, comprehensively measure the de novo mutation per base position per person;

    (2) Using Bayesian method to calculate the posterior probability of DNM in family group design

    1. SNP, SNV function analysis and annotation

    (1). Annotation of ancestral alleles

    Genomic alignment of the four genomes of human (NCBI36), chimpanzee (chimpanzee 2.1), orangutan (PPYG2) and rhesus monkey (MMUL1) to find conserved sequence regions and calculate ancestral alleles; and duplication / deletion events Evolutionary analysis.

    (2). Analyze the diversity and divergence of different regions on the gene structure sequence

    According to the genotype analysis results, calculate the degree of diversity in the gene structure sequence, that is, heterozygosity (heterozygosity); the heterozygosity index can explain the existence of selection effects and local variation of structural distribution characteristic patterns. We will consider the gene 5'UTR 200bp upstream, 5'UTR, first exon, first intron, middle exon, middle intron, last exon and intron, and 3 'UTR and its downstream 200bp region to examine the scope. Analyze the diversity and evolutionary divergence of the regions near the start / end positions of the encoded transcripts.

    (3). Detection of disease variants

    Compare the SV and HGMD disease variant data analyzed in the sample sequencing to obtain the cross-recorded missense and nonsense SNP; by comparing the HGMD disease-related mutations with the CUI (disease concept classification identification database) to obtain all the HGMD The disease phenotype of SV, and the disease phenotype of SV obtained by analyzing HGMD and sequencing data; and correcting the disease phenotype enriched by the sample SV by Fisher test and Bonferroni multiple hypothesis test.

    (4). Functional annotation of genes contained in copy number variation CNV

    Whether or not CNV covers the repeated SD regions of the segment is classified into 2 categories. The functional enrichment of the genes contained in each category of CNV is calculated, and the significance is indicated on the horizontal axis; the various significant functions are indicated on the vertical axis.

    To be continued in Part III…