HUMAN GENOME ANALYSIS

Recently, there was a major update of the human genome by T2T Consortium, for details see:

https://www.biorxiv.org/content/10.1101/2021.05.26.445798v1 Nurk et al. "The complete sequence of a human genome." bioRxiv (27.05.2021)
https://www.science.org/doi/10.1126/science.abj6987 (31.03.2022)

Task 0:
Download fasta file for all human chromosomes (v2.0). 
Note chm13v2.0.fa.gz file has 936MB

Read at home also about:
https://en.wikipedia.org/wiki/Compression_of_genomic_sequencing_data

Task 1:
Write the python program that will calculate:

1) for full human genome:
- length
- nucleotides numbers and frequencies
- GC content

2) for each chromosome:
- length
- nucleotides numbers and frequencies
- GC content

The program should read line after line* and gather statistics and then show them on the screen (prettytable) and additionally store them in a CSV file. 

* Avoid loading whole file into memory (3GB).

Hint: it is advised to inspect the content, or/and create shorter version of the file for testing (unix commands: head, tail, cut, grep, wc, etc.).

Result: python script that can generate csv 
chr_id/total,GC%,G,C,T,A,N,G%,C%,T%,A%,N%,len

Task 2:
Write 6-frame ORF program
a) version using biopython (Bio.SeqUtils.six_frame_translations)
b) version using EMBOSS package (wrapper)
c) pure python (no extra libraries)

Run each version on chr1 and chrM*:
1) compare runtime of a-c version
2) report predicted CDS
3) write simple mapper (check how many proteins you were able to find, compare the numbers to the offical data**, and UniProt human proteome)

*  note that for chr1 and chrM you need to use different genetic codons
** proteins without chromosome annotation you can find in "Gene annotation" section (Protein coding translated transcripts)
chm13.draft_v1.1.gene_annotation.protein.fasta


For individual files see:
https://www.mimuw.edu.pl/~lukaskoz/teaching/adp/labs/lab_human_genome/

Homework

Prepare a short report (pdf) containing stats from Task 1 and results about runtime and CDS mapping for Task2. 
Provide also the scripts for points 1-3 from Task2.

Send homework to lukaskoz@mimuw.edu.pl by 06.04.2025 (the email subject: 'ADP25_lab6_hw_Surname_Name' without email text body 
and with 'ADP25_lab6_hw_Surname_Name.7z' , without Polish letters, attachment containing the project).