================================================ ==== The importance of the (big) data ===== ==== PART 1 ===== ================================================ Exercise 1 Make a plot depicting how much data comes from different techiniques over the time in PDB using the formula: sum(n * 1/r) where n - number of residues r - resolution The data needed for the plots are available at: https://files.rcsb.org/pub/pdb/derived_data Let's focus on protein only entries. Expected output: 1) animated barplot with 3 bars (time span 2005-2024), two versions: - absolute numbers - percentages usefull www: https://matplotlib.org/stable/api/animation_api.html https://holypython.com/python-visualization-tutorial/creating-bar-chart-animations/ https://github.com/dexplo/bar_chart_race 2) static pie chart for all data (no time included) Possible classes: X-RAY DIFFRACTION, ELECTRON MICROSCOPY, SOLUTION NMR *whenever possible join similar classes (e.g. SOLUTION NMR and SOLID-STATE NMR) if there are more than one method mentioned, count them as separate units (add the data to all classes, you can ignore other rare classes like FIBER DIFFRACTION, SOLUTION SCATTERING, etc. NMR case: https://www.sciencedirect.com/science/article/abs/pii/S0066410306590052 "It has been known that the highest quality NMR structures have accuracies comparable to the medium-resolution X-ray structures (2.0-2.5 A) for protein backbone atomic coordinates" Thus, we set r for NMR structures at 2.5 Exercise 2 Calculate the average protein length and amino acid content (percentage composition for 20 amino acid) for different data sets: a) E.coli, Bacillus subtilis, human, yeast, A. thaliana, D. melanogaster, C. elegans, Mouse, Zebrafish (D. rerio) b) PDB c) UniProt - UniProt (Swiss-Prot only) - 100 randomly selected Bacteria - 100 randomly selected Viruses - 100 randomly selected Archaea - 100 randomly selected Eukaryota Help: https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/README Note: Sources of data: PDB - point (b): https://files.rcsb.org/pub/pdb/derived_data/pdb_seqres.txt.gz or https://www.mimuw.edu.pl/~lukaskoz/teaching/adp/labs/lab3/pdb_seqres.txt.gz All other data can be found at UniProt. For point (a) download it manualy, for point (c) you will need to do some scripting. Task: 1) PLOTS Make bar plots comparing average protein length between: - selected organisms (a) - all kingdoms (c) - PDB vs Uniprot (Swissprot) (b & c) 2) TABLES Prepare tables with the average amino acid content (Table 1) and protein length (Table 2) for (a) Prepare tables with the average amino acid content (Table 3) and protein length (Table 4) for (c) You need also to estimate error (e.g. standard deviation via the bootstrapping). 3) N-terminus case Check which amino acid is the most frequent at N-terminus. Can you justify why this one? See also: https://en.wikipedia.org/wiki/N-end_rule ============================================================= Additional material: https://en.wikipedia.org/wiki/FASTA_format https://en.wikipedia.org/wiki/List_of_model_organisms Useful packages: matplotlib, prettytable All statistics and plots must be calculated/done using PYTHON.