Cecilia's Site

Logo

My class work for bimm143 at UC San Diego

View the Project on GitHub cewang-bot/NEW_BIMM143

Class 10: Structural Bioinformatics (Pt. 1)

Cecilia Wang (PID:A18625854)

The PDB stattistic

The Protein Data Bank (PDB) is the main repository of biomolecular structures. Let’s see what it contains:

Q1: What percentage of structures in the PDB are solved by X-Ray and Electron Microscopy.

data <- read.csv("Data Export Summary.csv")
data1=data.frame(data,row.names = 1)

The comma in these numbers leads to the numbers here being read as character.

library(readr)
stats <- read_csv("Data Export Summary.csv")
Rows: 6 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Molecular Type
dbl (4): Integrative, Multiple methods, Neutron, Other
num (4): X-ray, EM, NMR, Total

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
stats
# A tibble: 6 × 9
  `Molecular Type`    `X-ray`    EM   NMR Integrative `Multiple methods` Neutron
  <chr>                 <dbl> <dbl> <dbl>       <dbl>              <dbl>   <dbl>
1 Protein (only)       178795 21825 12773         343                226      84
2 Protein/Oligosacch…   10363  3564    34           8                 11       1
3 Protein/NA             9106  6335   287          24                  7       0
4 Nucleic acid (only)    3132   221  1566           3                 15       3
5 Other                   175    25    33           4                  0       0
6 Oligosaccharide (o…      11     0     6           0                  1       0
# ℹ 2 more variables: Other <dbl>, Total <dbl>
n_xray <- sum(stats$`X-ray`)
n_EM <- sum(stats$EM)
(n_xray+n_EM)/sum(stats$Total)
[1] 0.937892

Q2: What proportion of structures in the PDB are protein?

stat1 = data.frame(stats,row.names = 1)
head(stat1)
                         X.ray    EM   NMR Integrative Multiple.methods Neutron
Protein (only)          178795 21825 12773         343              226      84
Protein/Oligosaccharide  10363  3564    34           8               11       1
Protein/NA                9106  6335   287          24                7       0
Nucleic acid (only)       3132   221  1566           3               15       3
Other                      175    25    33           4                0       0
Oligosaccharide (only)      11     0     6           0                1       0
                        Other  Total
Protein (only)             32 214078
Protein/Oligosaccharide     0  13981
Protein/NA                  0  15759
Nucleic acid (only)         1   4941
Other                       0    237
Oligosaccharide (only)      4     22
protein_count <- stat1["Protein (only)", 8]
protein_count
[1] 214078
protein_count/sum(stats$Total)
[1] 0.8596889

Q3: Type HIV in the PDB website search box on the home page and determine how many HIV-1 protease structures are in the current PDB?

1108

Visualizing the HIV-1 protease structure

We can use the Molstar viewer online: https://molstar.org/viewer/

My first image of HIV-Pr with surface display showing ligand
binding

A new clean image showingthe catalystic ASP25 animo acids in both chains of the HIV-PR dimer along with the inhibitor and the all important active site water

Q4: Water molecules normally have 3 atoms. Why do we see just one atom per water molecule in this structure?

Because PDB structures only show the single oxygen molecule per water molecule. The hydrogen atoms are very hard to detect using X-ray and therefore not modeled in most structures.

Q5: There is a critical “conserved” water molecule in the binding site. Can you identify this water molecule? What residue number does this water molecule have

HOH 308

Q6: Generate and save a figure clearly showing the two distinct chains of HIV-protease along with the ligand. You might also consider showing the catalytic residues ASP 25 in each chain and the critical water (we recommend “Ball & Stick” for these side-chains). Add this figure to your Quarto document.

Bio3D package for structural bioinformatics

library(bio3d)

pdb<- read.pdb("1hsg")
  Note: Accessing on-line PDB file
pdb
 Call:  read.pdb(file = "1hsg")

   Total Models#: 1
     Total Atoms#: 1686,  XYZs#: 5058  Chains#: 2  (values: A B)

     Protein Atoms#: 1514  (residues/Calpha atoms#: 198)
     Nucleic acid Atoms#: 0  (residues/phosphate atoms#: 0)

     Non-protein/nucleic Atoms#: 172  (residues: 128)
     Non-protein/nucleic resid values: [ HOH (127), MK1 (1) ]

   Protein sequence:
      PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMSLPGRWKPKMIGGIGGFIKVRQYD
      QILIEICGHKAIGTVLVGPTPVNIIGRNLLTQIGCTLNFPQITLWQRPLVTIKIGGQLKE
      ALLDTGADDTVLEEMSLPGRWKPKMIGGIGGFIKVRQYDQILIEICGHKAIGTVLVGPTP
      VNIIGRNLLTQIGCTLNF

+ attr: atom, xyz, seqres, helix, sheet,
        calpha, remark, call
head(pdb$atom)
  type eleno elety  alt resid chain resno insert      x      y     z o     b
1 ATOM     1     N <NA>   PRO     A     1   <NA> 29.361 39.686 5.862 1 38.10
2 ATOM     2    CA <NA>   PRO     A     1   <NA> 30.307 38.663 5.319 1 40.62
3 ATOM     3     C <NA>   PRO     A     1   <NA> 29.760 38.071 4.022 1 42.64
4 ATOM     4     O <NA>   PRO     A     1   <NA> 28.600 38.302 3.676 1 43.40
5 ATOM     5    CB <NA>   PRO     A     1   <NA> 30.508 37.541 6.342 1 37.87
6 ATOM     6    CG <NA>   PRO     A     1   <NA> 29.296 37.591 7.162 1 38.40
  segid elesy charge
1  <NA>     N   <NA>
2  <NA>     C   <NA>
3  <NA>     C   <NA>
4  <NA>     O   <NA>
5  <NA>     C   <NA>
6  <NA>     C   <NA>

Q7: How many amino acid residues are there in this pdb object?

198 amino acid residues

Q8: Name one of the two non-protein residues?

HOH (127)

Q9: How many protein chains are in this structure?

2 chains

library(bio3dview)
library(NGLVieweR)

#view.pdb(pdb) |>
 # setSpin()

Predicting functional motions of a single structure

Read an ADK structure from the PDB database.

adk <- read.pdb("6s36")
  Note: Accessing on-line PDB file
   PDB has ALT records, taking A only, rm.alt=TRUE
adk
 Call:  read.pdb(file = "6s36")

   Total Models#: 1
     Total Atoms#: 1898,  XYZs#: 5694  Chains#: 1  (values: A)

     Protein Atoms#: 1654  (residues/Calpha atoms#: 214)
     Nucleic acid Atoms#: 0  (residues/phosphate atoms#: 0)

     Non-protein/nucleic Atoms#: 244  (residues: 244)
     Non-protein/nucleic resid values: [ CL (3), HOH (238), MG (2), NA (1) ]

   Protein sequence:
      MRIILLGAPGAGKGTQAQFIMEKYGIPQISTGDMLRAAVKSGSELGKQAKDIMDAGKLVT
      DELVIALVKERIAQEDCRNGFLLDGFPRTIPQADAMKEAGINVDYVLEFDVPDELIVDKI
      VGRRVHAPSGRVYHVKFNPPKVEGKDDVTGEELTTRKDDQEETVRKRLVEYHQMTAPLIG
      YYSKEAEAGNTKYAKVDGTKPVAEVRADLEKILG

+ attr: atom, xyz, seqres, helix, sheet,
        calpha, remark, call
# Perform flexiblity prediction
m <- nma(adk)
 Building Hessian...        Done in 0.012 seconds.
 Diagonalizing Hessian...   Done in 0.05 seconds.
plot(m)

write out our results as a wee trajectory/movie of predicted motions:

mktrj(m, file="adk_m7.pdb")

Comparative analysis with PCA

First step find an ADK sequence:

library(bio3d)
id <-"1ake_A" ##change this to run a different analysis
aa <- get.seq(id)
Warning in get.seq(id): Removing existing file: seqs.fasta

Fetching... Please wait. Done.
aa
             1        .         .         .         .         .         60 
pdb|1AKE|A   MRIILLGAPGAGKGTQAQFIMEKYGIPQISTGDMLRAAVKSGSELGKQAKDIMDAGKLVT
             1        .         .         .         .         .         60 

            61        .         .         .         .         .         120 
pdb|1AKE|A   DELVIALVKERIAQEDCRNGFLLDGFPRTIPQADAMKEAGINVDYVLEFDVPDELIVDRI
            61        .         .         .         .         .         120 

           121        .         .         .         .         .         180 
pdb|1AKE|A   VGRRVHAPSGRVYHVKFNPPKVEGKDDVTGEELTTRKDDQEETVRKRLVEYHQMTAPLIG
           121        .         .         .         .         .         180 

           181        .         .         .   214 
pdb|1AKE|A   YYSKEAEAGNTKYAKVDGTKPVAEVRADLEKILG
           181        .         .         .   214 

Call:
  read.fasta(file = outfile)

Class:
  fasta

Alignment dimensions:
  1 sequence rows; 214 position columns (214 non-gap, 0 gap) 

+ attr: id, ali, call

Next step, is search the PDB database for all related entries:

blast <- blast.pdb(aa)
 Searching ... please wait (updates every 5 seconds) RID = V64DZ6FU014 
 .
 Reporting 96 hits
hits <- plot(blast)
  * Possible cutoff values:    260 3 
            Yielding Nhits:    20 96 

  * Chosen cutoff value of:    260 
            Yielding Nhits:    20 

All the Blast results are here for use:

head(blast$hit.tbl)
        queryid subjectids identity alignmentlength mismatches gapopens q.start
1 Query_4738079     1AKE_A  100.000             214          0        0       1
2 Query_4738079     8BQF_A   99.533             214          1        0       1
3 Query_4738079     4X8M_A   99.533             214          1        0       1
4 Query_4738079     6S36_A   99.533             214          1        0       1
5 Query_4738079     9R6U_A   99.533             214          1        0       1
6 Query_4738079     9R71_A   99.533             214          1        0       1
  q.end s.start s.end    evalue bitscore positives mlog.evalue pdb.id    acc
1   214       1   214 1.79e-156      432    100.00    358.6211 1AKE_A 1AKE_A
2   214      21   234 2.93e-156      433    100.00    358.1283 8BQF_A 8BQF_A
3   214       1   214 3.20e-156      432    100.00    358.0401 4X8M_A 4X8M_A
4   214       1   214 4.71e-156      432    100.00    357.6536 6S36_A 6S36_A
5   214       1   214 1.05e-155      431     99.53    356.8519 9R6U_A 9R6U_A
6   214       1   214 1.24e-155      431     99.53    356.6856 9R71_A 9R71_A

The “top hits” are in the hits object. Now we can download these to our computer. Put these in a wee-sub folder (directory) called “pdbs” and use gzip to speed things up

# Download related PDB files
files <- get.pdb(hits$pdb.id, path="pdbs", split=TRUE, gzip=TRUE)
Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/1AKE.pdb.gz exists. Skipping download

Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/8BQF.pdb.gz exists. Skipping download

Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/4X8M.pdb.gz exists. Skipping download

Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/6S36.pdb.gz exists. Skipping download

Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/9R6U.pdb.gz exists. Skipping download

Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/9R71.pdb.gz exists. Skipping download

Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/8Q2B.pdb.gz exists. Skipping download

Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/8RJ9.pdb.gz exists. Skipping download

Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/6RZE.pdb.gz exists. Skipping download

Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/4X8H.pdb.gz exists. Skipping download

Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/3HPR.pdb.gz exists. Skipping download

Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/1E4V.pdb.gz exists. Skipping download

Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/5EJE.pdb.gz exists. Skipping download

Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/1E4Y.pdb.gz exists. Skipping download

Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/3X2S.pdb.gz exists. Skipping download

Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/6HAP.pdb.gz exists. Skipping download

Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/6HAM.pdb.gz exists. Skipping download

Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/8PVW.pdb.gz exists. Skipping download

Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/4K46.pdb.gz exists. Skipping download

Warning in get.pdb(hits$pdb.id, path = "pdbs", split = TRUE, gzip = TRUE):
pdbs/4NP6.pdb.gz exists. Skipping download


  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |====                                                                  |   5%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================                                |  55%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |==============================================                        |  65%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |============================================================          |  85%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |==================================================================    |  95%
  |                                                                            
  |======================================================================| 100%

These look like a hot mess

I