HUMAN GENOME ANALYSIS Recently, there was a major update of the human genome by T2T Consortium, for details see: https://www.biorxiv.org/content/10.1101/2021.05.26.445798v1 Nurk et al. "The complete sequence of a human genome." bioRxiv (27.05.2021) https://www.science.org/doi/10.1126/science.abj6987 (31.03.2022) Task 0: Download fasta file for all human chromosomes (v2.0). Note chm13v2.0.fa.gz file has 936MB Read at home also about: https://en.wikipedia.org/wiki/Compression_of_genomic_sequencing_data Task 1: Write the python program that will calculate: 1) for full human genome: - length - nucleotides numbers and frequencies - GC content 2) for each chromosome: - length - nucleotides numbers and frequencies - GC content The program should read line after line* and gather statistics and then show them on the screen (prettytable) and additionally store them in a CSV file. * Avoid loading whole file into memory (3GB). Hint: it is advised to inspect the content, or/and create shorter version of the file for testing (unix commands: head, tail, cut, grep, wc, etc.). Result: python script that can generate csv chr_id/total,GC%,G,C,T,A,N,G%,C%,T%,A%,N%,len Task 2: Write 6-frame ORF program a) version using biopython (Bio.SeqUtils.six_frame_translations) b) version using EMBOSS package (wrapper) c) pure python (no extra libraries) Run each version on chr1 and chrM*: 1) compare runtime of a-c version 2) report predicted CDS 3) write simple mapper (check how many proteins you were able to find, compare the numbers to the offical data**, and UniProt human proteome) * note that for chr1 and chrM you need to use different genetic codons ** proteins without chromosome annotation you can find in "Gene annotation" section (Protein coding translated transcripts) chm13.draft_v1.1.gene_annotation.protein.fasta For individual files see: https://www.mimuw.edu.pl/~lukaskoz/teaching/adp/labs/lab_human_genome/ Homework Prepare a short report (pdf) containing stats from Task 1 and results about runtime and CDS mapping for Task2. Provide also the scripts for points 1-3 from Task2. Send homework to lukaskoz@mimuw.edu.pl by 06.04.2025 (the email subject: 'ADP25_lab6_hw_Surname_Name' without email text body and with 'ADP25_lab6_hw_Surname_Name.7z' , without Polish letters, attachment containing the project).