================================================
====     The importance of the (big) data  =====
====                 PART 1                =====
================================================
Exercise 1
Make a plot depicting how much data comes from different
techiniques over the time in PDB using the formula:

sum(n * 1/r)

where
n - number of residues
r - resolution

The data needed for the plots are available at:
https://files.rcsb.org/pub/pdb/derived_data

Let's focus on protein only entries.

Expected output:
1) animated barplot with 3 bars (time span 2005-2024), two versions:
- absolute numbers
- percentages

usefull www: 
https://matplotlib.org/stable/api/animation_api.html
https://holypython.com/python-visualization-tutorial/creating-bar-chart-animations/
https://github.com/dexplo/bar_chart_race

2) static pie chart for all data (no time included)

Possible classes: X-RAY DIFFRACTION, ELECTRON MICROSCOPY, SOLUTION NMR

*whenever possible join similar classes (e.g. SOLUTION NMR and SOLID-STATE NMR)
if there are more than one method mentioned, count them as separate units (add 
the data to all classes, you can ignore other rare classes like 
FIBER DIFFRACTION, SOLUTION SCATTERING, etc.

NMR case: 
https://www.sciencedirect.com/science/article/abs/pii/S0066410306590052
"It has been known that the highest quality NMR structures have accuracies comparable 
to the medium-resolution X-ray structures (2.0-2.5 A) for protein backbone atomic 
coordinates"
Thus, we set r for NMR structures at 2.5

Exercise 2

Calculate the average protein length and amino acid
content (percentage composition for 20 amino acid) for
different data sets:

a) E.coli, Bacillus subtilis, human, yeast,
A. thaliana, D. melanogaster, C. elegans, Mouse,
Zebrafish (D. rerio)

b) PDB

c) UniProt
- UniProt (Swiss-Prot only)
- 100 randomly selected Bacteria
- 100 randomly selected Viruses
- 100 randomly selected Archaea
- 100 randomly selected Eukaryota

Help: https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/README

Note:
Sources of data:
PDB - point (b): 
https://files.rcsb.org/pub/pdb/derived_data/pdb_seqres.txt.gz 
or
https://www.mimuw.edu.pl/~lukaskoz/teaching/adp/labs/lab3/pdb_seqres.txt.gz

All other data can be found at UniProt. For point (a) download it 
manualy, for point (c) you will need to do some scripting.


Task:
1) PLOTS
Make bar plots comparing average protein length between:
- selected organisms (a)
- all kingdoms (c)
- PDB vs Uniprot (Swissprot) (b & c)

2) TABLES
Prepare tables with the average amino acid content (Table 1)
and protein length (Table 2) for (a)

Prepare tables with the average amino acid content (Table 3)
and protein length (Table 4) for (c)

You need also to estimate error (e.g. standard deviation
via the bootstrapping).


3) N-terminus case
Check which amino acid is the most frequent
at N-terminus. Can you justify why this one?

See also: https://en.wikipedia.org/wiki/N-end_rule

=============================================================

Additional material:
https://en.wikipedia.org/wiki/FASTA_format
https://en.wikipedia.org/wiki/List_of_model_organisms

Useful packages: matplotlib, prettytable

All statistics and plots must be calculated/done using PYTHON.