Assignment 4 Protein Data Bank
| categories: Assignments
Due 24 October 2012
A protein is a molecule that is a sequence of amino acid residues. The Protein Data Bank records the 3D structures known for protein molecules (85,058 as of 02 October 2012). Two examples are HIV Protease (7hvp ), an important AIDS drug target, and green fluorescent protein (1gfl), which earned its discoverers the Nobel Prize. Many types of information are stored in a PDB file; we will be interested only in lines that start with ‘ATOM’, and only in certain columns of these lines. Their format is described later.

Task
- Download this template zip file and unpack it as usual. It will create a folder A4 for you. Inside the A4 folder you'll find the two .pdb files and a script template to get you started. You should edit this script to produce the results requested below. When the TA runs your script it should produce the graphs requested and it should report how many hydrogen bonding pairs there are for each protein.
- Write a function readPDBfile(filename) that will read the atoms for a protein stored in pdb file whose name is given as the argument. Your function should return a Python tuple containing 4 values: (anum, aname, resno, coords).
- anum
- should be an array with the serial number for each atom.
- aname
- should be an array of strings giving the 4-letter atom name for each atom. The strings in aname should be in upper case.
- resno
- array with a residue sequence number for each atom
- coords
- an nx3 array with xyz coordinates (in angstroms) for each atom.
- Write a function drawCA(aname, coords) that uses Axes3d.plot to draw the Calpha backbone of the protein: it should connect atoms with name ' CA ' in sequence, and ignore all the other atoms.
- Write a function Hbonds(anum, aname, resno, coords) that looks for pairs with a Nitrogen (second letter of the name is 'N') and Oxygen (second letter is 'O') atoms whose distance is between 2.6 and 3.2 angstroms, inclusive. Such a pair is deemed to form a hydrogen bond if the residue sequence numbers of these atoms differ by at least 2. Return a list containing the pairs of atom numbers for hydrogen bonding pairs.
Opening and Reading a File
Reading a file such as‘7hvp.pdb’ (for HIV protease) is easy in Python. You can simply say:
for line in file('7hvp.pdb', 'r'):
# do something for every line
The first argument is a string giving the name of the file. The second argument to the file function is a string indicating the file mode. We're using 'r' here because we are reading the file. Other possible values include 'w' for creating a new file to write and 'a' for appending to an existing file.
If the first six characters of the line are 'ATOM ' then that line has interesting information. The numbers and names occupy fixed positions on a line, so you can extract them with indexing and convert them from strings to numbers, if necessary, in your reader.
Other hints
- Don't modify data files! Your readPDBfile should work on any of the files.
- Your reader returns many variables. To capture them all, you'll have to call it with a line something like this: a, an, rn, c = readPDBfile('7hvp.pdb').
- To plot the 3D line drawings requested use the Axes3D object like this:
fig = pylab.figure() ax = Axes3D(fig) ax.plot(x, y, z) # x, y, and z are 1D arrays giving the coordinates of the points
- String methods .upper() and .lower() can change the case of strings.
- functions int() and float() can be used to change strings to numbers.
- Compare strings with the usual comparison operators we used for numbers.
- Break the problems down into small tasks. When making your loops, think of what has to be done once (like opening the file) and what has to be done repeatedly (like processing the line.)
- I get 124 as the number of hydrogen bonding pairs for 7HVP.
Format of ATOM records
This is from the documentation of the PDB format on the rcsb web site.
COLUMNS DATA TYPE CONTENTS -------------------------------------------------------------------------------- 1- 6 Record name "ATOM " 7 - 11 Integer Atom serial number. 13 - 16 Atom Atom name. 17 Character Alternate location indicator. 18 - 20 Residue name Residue name. 22 Character Chain identifier. 23 - 26 Integer Residue sequence number. 27 AChar Code for insertion of residues. 31 - 38 Real(8.3) Orthogonal coordinates for X in Angstroms. 39 - 46 Real(8.3) Orthogonal coordinates for Y in Angstroms. 47 - 54 Real(8.3) Orthogonal coordinates for Z in Angstroms. 55 - 60 Real(6.2) Occupancy. 61 - 66 Real(6.2) Temperature factor (Default = 0.0). 73 - 76 LString(4) Segment identifier, left-justified. 77 - 78 LString(2) Element symbol, right-justified. 79 - 80 LString(2) Charge on the atom.
Example:
1 2 3 4 5 6 7 8
12345678901234567890123456789012345678901234567890123456789012345678901234567890
ATOM 1 N PRO A 1 -3.260 7.392 33.952 1.00 24.47 N
ATOM 2 CA PRO A 1 -2.317 6.655 34.795 1.00 22.40 C
ATOM 3 C PRO A 1 -0.919 6.658 34.208 1.00 20.82 C
ATOM 4 O PRO A 1 -0.802 7.111 33.058 1.00 21.07 O
ATOM 5 CB PRO A 1 -2.897 5.256 34.804 1.00 23.12 C
ATOM 6 CG PRO A 1 -4.336 5.353 34.377 1.00 24.12 C
ATOM 7 CD PRO A 1 -4.607 6.783 33.948 1.00 24.43 C
When you are finished, zip up your A4 folder and submit it on the Sakai site.