My class work for bimm143 at UC San Diego
Cecilia Wang (PID:A18625854)
The Protein Data Bank (PDB) is the main repository of biomolecular structures. Let’s see what it contains:
Q1: What percentage of structures in the PDB are solved by X-Ray and Electron Microscopy.
data <- read.csv("Data Export Summary.csv")
data1=data.frame(data,row.names = 1)
The comma in these numbers leads to the numbers here being read as character.
library(readr)
stats <- read_csv("Data Export Summary.csv")
Rows: 6 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Molecular Type
dbl (4): Integrative, Multiple methods, Neutron, Other
num (4): X-ray, EM, NMR, Total
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
stats
# A tibble: 6 × 9
`Molecular Type` `X-ray` EM NMR Integrative `Multiple methods` Neutron
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Protein (only) 178795 21825 12773 343 226 84
2 Protein/Oligosacch… 10363 3564 34 8 11 1
3 Protein/NA 9106 6335 287 24 7 0
4 Nucleic acid (only) 3132 221 1566 3 15 3
5 Other 175 25 33 4 0 0
6 Oligosaccharide (o… 11 0 6 0 1 0
# ℹ 2 more variables: Other <dbl>, Total <dbl>
n_xray <- sum(stats$`X-ray`)
n_EM <- sum(stats$EM)
(n_xray+n_EM)/sum(stats$Total)
[1] 0.937892
Q2: What proportion of structures in the PDB are protein?
stat1 = data.frame(stats,row.names = 1)
head(stat1)
X.ray EM NMR Integrative Multiple.methods Neutron
Protein (only) 178795 21825 12773 343 226 84
Protein/Oligosaccharide 10363 3564 34 8 11 1
Protein/NA 9106 6335 287 24 7 0
Nucleic acid (only) 3132 221 1566 3 15 3
Other 175 25 33 4 0 0
Oligosaccharide (only) 11 0 6 0 1 0
Other Total
Protein (only) 32 214078
Protein/Oligosaccharide 0 13981
Protein/NA 0 15759
Nucleic acid (only) 1 4941
Other 0 237
Oligosaccharide (only) 4 22
protein_count <- stat1["Protein (only)", 8]
protein_count
[1] 214078
protein_count/sum(stats$Total)
[1] 0.8596889
Q3: Type HIV in the PDB website search box on the home page and determine how many HIV-1 protease structures are in the current PDB?
1108
We can use the Molstar viewer online: https://molstar.org/viewer/

A new clean image showingthe catalystic ASP25 animo acids in both chains of the HIV-PR dimer along with the inhibitor and the all important active site water

Q4: Water molecules normally have 3 atoms. Why do we see just one atom per water molecule in this structure?
Because PDB structures only show the single oxygen molecule per water molecule. The hydrogen atoms are very hard to detect using X-ray and therefore not modeled in most structures.
Q5: There is a critical “conserved” water molecule in the binding site. Can you identify this water molecule? What residue number does this water molecule have
HOH 308
Q6: Generate and save a figure clearly showing the two distinct chains of HIV-protease along with the ligand. You might also consider showing the catalytic residues ASP 25 in each chain and the critical water (we recommend “Ball & Stick” for these side-chains). Add this figure to your Quarto document.

library(bio3d)
pdb<- read.pdb("1hsg")
Note: Accessing on-line PDB file
pdb
Call: read.pdb(file = "1hsg")
Total Models#: 1
Total Atoms#: 1686, XYZs#: 5058 Chains#: 2 (values: A B)
Protein Atoms#: 1514 (residues/Calpha atoms#: 198)
Nucleic acid Atoms#: 0 (residues/phosphate atoms#: 0)
Non-protein/nucleic Atoms#: 172 (residues: 128)
Non-protein/nucleic resid values: [ HOH (127), MK1 (1) ]
Protein sequence:
PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMSLPGRWKPKMIGGIGGFIKVRQYD
QILIEICGHKAIGTVLVGPTPVNIIGRNLLTQIGCTLNFPQITLWQRPLVTIKIGGQLKE
ALLDTGADDTVLEEMSLPGRWKPKMIGGIGGFIKVRQYDQILIEICGHKAIGTVLVGPTP
VNIIGRNLLTQIGCTLNF
+ attr: atom, xyz, seqres, helix, sheet,
calpha, remark, call
head(pdb$atom)
type eleno elety alt resid chain resno insert x y z o b
1 ATOM 1 N <NA> PRO A 1 <NA> 29.361 39.686 5.862 1 38.10
2 ATOM 2 CA <NA> PRO A 1 <NA> 30.307 38.663 5.319 1 40.62
3 ATOM 3 C <NA> PRO A 1 <NA> 29.760 38.071 4.022 1 42.64
4 ATOM 4 O <NA> PRO A 1 <NA> 28.600 38.302 3.676 1 43.40
5 ATOM 5 CB <NA> PRO A 1 <NA> 30.508 37.541 6.342 1 37.87
6 ATOM 6 CG <NA> PRO A 1 <NA> 29.296 37.591 7.162 1 38.40
segid elesy charge
1 <NA> N <NA>
2 <NA> C <NA>
3 <NA> C <NA>
4 <NA> O <NA>
5 <NA> C <NA>
6 <NA> C <NA>
Q7: How many amino acid residues are there in this pdb object?
198 amino acid residues
Q8: Name one of the two non-protein residues?
HOH (127)
Q9: How many protein chains are in this structure?
2 chains
library(bio3dview)
library(NGLVieweR)
#view.pdb(pdb) |>
# setSpin()
Read an ADK structure from the PDB database.
adk <- read.pdb("6s36")
Note: Accessing on-line PDB file
PDB has ALT records, taking A only, rm.alt=TRUE
adk
Call: read.pdb(file = "6s36")
Total Models#: 1
Total Atoms#: 1898, XYZs#: 5694 Chains#: 1 (values: A)
Protein Atoms#: 1654 (residues/Calpha atoms#: 214)
Nucleic acid Atoms#: 0 (residues/phosphate atoms#: 0)
Non-protein/nucleic Atoms#: 244 (residues: 244)
Non-protein/nucleic resid values: [ CL (3), HOH (238), MG (2), NA (1) ]
Protein sequence:
MRIILLGAPGAGKGTQAQFIMEKYGIPQISTGDMLRAAVKSGSELGKQAKDIMDAGKLVT
DELVIALVKERIAQEDCRNGFLLDGFPRTIPQADAMKEAGINVDYVLEFDVPDELIVDKI
VGRRVHAPSGRVYHVKFNPPKVEGKDDVTGEELTTRKDDQEETVRKRLVEYHQMTAPLIG
YYSKEAEAGNTKYAKVDGTKPVAEVRADLEKILG
+ attr: atom, xyz, seqres, helix, sheet,
calpha, remark, call
# Perform flexiblity prediction
m <- nma(adk)
Building Hessian... Done in 0.012 seconds.
Diagonalizing Hessian... Done in 0.05 seconds.
plot(m)

write out our results as a wee trajectory/movie of predicted motions:
mktrj(m, file="adk_m7.pdb")
First step find an ADK sequence:
library(bio3d)
id <-"1ake_A" ##change this to run a different analysis
aa <- get.seq(id)
Warning in get.seq(id): Removing existing file: seqs.fasta
Fetching... Please wait. Done.
aa
1 . . . . . 60
pdb|1AKE|A MRIILLGAPGAGKGTQAQFIMEKYGIPQISTGDMLRAAVKSGSELGKQAKDIMDAGKLVT
1 . . . . . 60
61 . . . . . 120
pdb|1AKE|A DELVIALVKERIAQEDCRNGFLLDGFPRTIPQADAMKEAGINVDYVLEFDVPDELIVDRI
61 . . . . . 120
121 . . . . . 180
pdb|1AKE|A VGRRVHAPSGRVYHVKFNPPKVEGKDDVTGEELTTRKDDQEETVRKRLVEYHQMTAPLIG
121 . . . . . 180
181 . . . 214
pdb|1AKE|A YYSKEAEAGNTKYAKVDGTKPVAEVRADLEKILG
181 . . . 214
Call:
read.fasta(file = outfile)
Class:
fasta
Alignment dimensions:
1 sequence rows; 214 position columns (214 non-gap, 0 gap)
+ attr: id, ali, call
Next step, is search the PDB database for all related entries:
blast <- blast.pdb(aa)
Searching ... please wait (updates every 5 seconds) RID = V64DZ6FU014
.
Reporting 96 hits
hits <- plot(blast)
* Possible cutoff values: 260 3
Yielding Nhits: 20 96
* Chosen cutoff value of: 260
Yielding Nhits: 20

All the Blast results are here for use:
head(blast$hit.tbl)
queryid subjectids identity alignmentlength mismatches gapopens q.start
1 Query_4738079 1AKE_A 100.000 214 0 0 1
2 Query_4738079 8BQF_A 99.533 214 1 0 1
3 Query_4738079 4X8M_A 99.533 214 1 0 1
4 Query_4738079 6S36_A 99.533 214 1 0 1
5 Query_4738079 9R6U_A 99.533 214 1 0 1
6 Query_4738079 9R71_A 99.533 214 1 0 1
q.end s.start s.end evalue bitscore positives mlog.evalue pdb.id acc
1 214 1 214 1.79e-156 432 100.00 358.6211 1AKE_A 1AKE_A
2 214 21 234 2.93e-156 433 100.00 358.1283 8BQF_A 8BQF_A
3 214 1 214 3.20e-156 432 100.00 358.0401 4X8M_A 4X8M_A
4 214 1 214 4.71e-156 432 100.00 357.6536 6S36_A 6S36_A
5 214 1 214 1.05e-155 431 99.53 356.8519 9R6U_A 9R6U_A
6 214 1 214 1.24e-155 431 99.53 356.6856 9R71_A 9R71_A
The “top hits” are in the hits object. Now we can download these to
our computer. Put these in a wee-sub folder (directory) called “pdbs”
and use gzip to speed things up
# Download related PDB files
files <- get.pdb(hits$pdb.id, path="pdbs", split=TRUE, gzip=TRUE)
Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/1AKE.pdb.gz exists. Skipping download
Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/8BQF.pdb.gz exists. Skipping download
Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/4X8M.pdb.gz exists. Skipping download
Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/6S36.pdb.gz exists. Skipping download
Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/9R6U.pdb.gz exists. Skipping download
Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/9R71.pdb.gz exists. Skipping download
Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/8Q2B.pdb.gz exists. Skipping download
Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/8RJ9.pdb.gz exists. Skipping download
Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/6RZE.pdb.gz exists. Skipping download
Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/4X8H.pdb.gz exists. Skipping download
Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/3HPR.pdb.gz exists. Skipping download
Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/1E4V.pdb.gz exists. Skipping download
Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/5EJE.pdb.gz exists. Skipping download
Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/1E4Y.pdb.gz exists. Skipping download
Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/3X2S.pdb.gz exists. Skipping download
Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/6HAP.pdb.gz exists. Skipping download
Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/6HAM.pdb.gz exists. Skipping download
Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/8PVW.pdb.gz exists. Skipping download
Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/4K46.pdb.gz exists. Skipping download
Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/4NP6.pdb.gz exists. Skipping download
|
| | 0%
|
|==== | 5%
|
|======= | 10%
|
|========== | 15%
|
|============== | 20%
|
|================== | 25%
|
|===================== | 30%
|
|======================== | 35%
|
|============================ | 40%
|
|================================ | 45%
|
|=================================== | 50%
|
|====================================== | 55%
|
|========================================== | 60%
|
|============================================== | 65%
|
|================================================= | 70%
|
|==================================================== | 75%
|
|======================================================== | 80%
|
|============================================================ | 85%
|
|=============================================================== | 90%
|
|================================================================== | 95%
|
|======================================================================| 100%
These look like a hot mess