Python代写:BIOL7200 Estimating Genomic Distances Using The Approximate Jaccard



For this assignment, the module list is “sys”, “re”, “os”, “random”, “argparse”, “multiprocessing”, and “threading”. The grading CI will not install missing modules. Do not use input() for any input.

Instructions for submission

  • Your code must be available on GitLab at the above time to be graded
  • Name your script as
  • Your code should run as ./ <options and inputs described below>
  • DO NOT HARDCODE any file name!
  • Please use the #!/usr/bin/env python as your shebang
  • Your script should finish within 10 minutes

Estimating genomic distances using the approximate Jaccard

In genomics, quantitatively comparing two or more genomes (or individuals or isolates) is a common task. We do this when we are performing any sort of population genomic analysis such as comparing genomic distances between human individuals (Hamming distance), genomic distances between microbial genomes (ANI, MLST, etc.) or genomic distances between microbial communities (Bray-Curtis, UniFrac, etc.). Next semester in BIOL 7210, you will almost certainly be doing some sort of whole-genome comparison. In the pre-genomics era, genomes were compared using a technique called DNA-DNA hybridization (DDH). DDH is a time consuming, costly, and dangerous (radio-labeled DNA) process that does not scale well to tens of samples. A computational approach that can be thought of as in silico DDH, Average Nucleotide Identity (ANI), was introduced in 2007 and uses BLASTn or other sequence alignment techniques to estimate a whole-genome alignment by aligning ~1000 bp chunks of two genomes. For those interested in learning more, this page: contains a good, simplified discussion on what ANI is.

ANI scales to tens and even a few hundred genomes but is computationally expensive. A relative recent technique, Mash, was introduced in 2016 and approximates genomic distance using the MinHash. Mash estimates genomic distances by comparing sets of kmers between genomes and computing an approximate Jaccard index.

For this bonus exercise, we want you to implement a simplified version of the core Mash algorithm in Python and return the Mash distance, D, for two or more input genomes. This whole problem is algorithmically and conceptually simple so don’t get afraid from the set theory formula below. We have tried to be explicit about the complete process in this document so you have a clear understanding of the problem statement. The solution, however, is simple.

  1. What are you provided with?
    You are provided with one or more genomes serving as your input. These should be nucleotide FASTA files. You are welcome to reuse code from the kmer couter or all2fasta scripts, try importing them as modules.
  2. What is Jaccard (j) and Mash distance (D)?
    The Jaccard index is a measure of similarity between sets and is defined by equation (1). Jaccard distance, a measure of dissimilarity between sets, is obtained by subtracting the Jaccard index from 1. In the case of genomic sequences, your set is comprised of words length k (i.e. kmers) derived from genome sequences. The below image depicts calculating an estimate of Jaccard index from two genomes (A, blue and B, red) using k = 5 and a set size, s = 5. A random subset (s) is used to simplify the computational requirements. Mash distance, D, is proportionally related to the Jaccard index calculated for a pair of genomes. Mash distance is more robust than a simple Jaccard and, with the appropriate assumptions, can closely approximate ANI-based distance measures with significantly less computation.
  3. How do you get your kmer?
    There is an additional wrinkle in how you should kmerize your genome sequences. For each window, length k, find the forward and reverse-complement and store the lexicographically smaller of the two. For example, for the 5-mer “ATGGCA” and its reverse complement “TGCCAT”, “ATGGCA” is lexicographically smaller. Finding the smaller kmer in Python is very easy, don’t over complicate things. The lexicographical order can be found using the sort function.
  4. What are you expected to do?
    We expect you to implement the below options / flags using argparse
    Below are listed the types of input your script should accept and what the expected outputs are:
    • If -a and -b are given, compute the pairwise distance. Output a single, tab-separated line: #a_file_name b_file_name distance. Please note that the space above are single tab spaces.
    • If -a and -d are given, compute the all-against-a pairwise distance. Output in two column, tab=separated format.
    • If only -d is given, compute the pairwise distance and output a square, tab-separated matrix
      Please note that the above values are only for example purposes.
  5. What errors are you expected to catch?
    You should not assume that your user has read the documentation, therefore you need to check for the common pitfalls listed below.
    • User supplies -a without -b or -d
    • User supplies -b without -a or with -d
    • User supplies -a, -b and -d
    • User supplies an option which does not exist
    • User supplies an empty directory for -d
    • User supplies files that are not nucleotide FASTA
    • Threads are positive integer values (thread 0 is obviously senseless)
    • You are welcome to catch other types of errors (like no write permission for output) but these are not required. Try-Except and If-Then statements will be your friends for this script.
  6. Where does the parallelization bit comes in?
    Each pair of genomic comparison is independent from each other. This is a great example of an embarrassingly parallel problem. Embarrassingly parallel problems are the easiest problems to implement as each task can be calculated without looking at other running/about to run tasks. Consider this example:
    If I need to calculate genomic distance between 4 genomes, I will have to do 4 x 4 pairwise calculations. Given that the distance is non-directional (i.e., distance of genome1 to genome2 is the same as distance of genome 2 to genome1) and distance to self is always 0, we can get away with calculating one half of the matrix (lower triangle or upper triangle). Thus, for 4 genomes, we really need to calculate 6 pairwise distances (generally speaking, for n genomes, we need to calculate n*(n-1)/2 distances). If I run my with 6 threads, I can launch 6 threads and give one pairwise calculation to each thread, wait for them to finish and generate my matrix. Theoretically, this will give me a 6X speed-up compared to when I run the script on a single thread.
    Deliverable: Python script ( which takes on three argument from the command line.
    Syntax: ./ -a <fasta> -b <fasta> [-d|--dirs <dir of fastas>] [-o|-output] [-t|--threads] [-v|--verbose] [-h|--help]

Example usage 1

./ -a genomeA.fasta -b genomeB.fasta -o genomeDist_output.txt -s 100

Compares genomeA and genomeB, using 100 random kmers, writes output to “genomeDist_output.txt”

Example usage 2

./ -a genomeA.fasta -d genomesFolder -o a_verse_dir.txt -t 10

Compares genomeA and the genomes in directory “genomesFolder”, using ten threads and writing the output to “a_verse_dir.txt”

Example usage 3

./ -a genomeA.fasta -d otherGenomes -o a_verse_dir.txt -t 10

When run immediately after example 2, your program should quit because the output file exists

Example usage 4

./ -a genomeA.fasta -d otherGenomes -o a_verse_dir.txt -t 10 -f

Compares genomeA and the genomes in directory “otherGenomes”, using ten threads and overwrites the output file, “a_verse_dir.txt”, with the new data

Again, keep in mind

  • Do not use any Python modules other than sys, re, os, random, argparse, multiprocessing, and threading. You may write and import your own modules, just make sure they’re on GitLab too.
  • Do not hardcode or use input()
  • Use shebang line