3D-footprint: help

DNA footprinting is a collection of experimental methods used to describe the sequence specificity of DNA-binding proteins. 3D-footprint is a database that employs a computational approach in order to analyse sequence recognition in protein-DNA complexes of known structure, reporting those molecular contacts that contribute most to recognition. The pipeline followed to produce each entry in the database can be summarized as a flowchart. The database contains a selection of 95% non-redundant monomeric complexes updated weekly, plus a complementary collection of redundant and multimeric complexes. Entries are named by concatenating the Protein Data Bank (PDB) structure identifier and the protein chains that take part in the complex. For instance, 1cgp_AB, is a dimeric complex made by protein chains A and B of PDB entry 1cgp. The report for each entry might include:

an interface graph which highlights contacts and nucleotides at the original interface, captured in a PDB file, responsible for specific DNA discrimination:
Atomic contacts are classified as hydrogen bonds (H), water-mediated hydrogen bonds (w) or hydrophobic interactions (V) using these geometric restraints, derived from our benchmarks and from HBPLUS:

type parameters

H max distance donor-acceptor=3.9Å | min distance H-acceptor=2.5Å | min angle=90°

w min distance=5.0Å (see benchmark)

V min distance=3.5Å | max distance=3.9Å (see benchmark)

Atomic contacts are scored by applying the interaction tables available in the download area, after distance-correcting:
```
score = score_matrix * (1 + (1-distance/3.9Å))
```
Often is not possible to automatically classify some side-chain contacts, which are still reported as 'generic' (and marked with 'x') but are not considered for readout calculations.
a footprint logo diagram with nitrogen bases depicted as circles of diameter proportional to the number of side-chain contacts observed at the original interface, with both DNA strands plotted separately. Filled circles represent indirectly readout bases. Contact counts can be converted to information content of the resulting contact position weight matrix (see below) in order to calculate the conservation (height) of the bound DNA sequence:
an interface matrix which summarizes the observed contacts between side-chain atoms (vertical axis) and nitrogen bases of nucleotides (horizontal axis). Most contacted base pairs are dark-coloured, while non-contacted nucleotides are shown in white, according to the scale bar. The secondary structure state of each interface residue is shown in a one-letter code ('H'=helix,'E'=strand,'T'=turn,'C'=coil) next to their primary sequence number, which is parenthesized:
the structure name, a link to the original entry in the PDB, with the name of the protein in parenthesis.
links to multimeric complexes in which a given entry is involved. Multimeric complexes are usually more relevant in biological terms and have more selective binding motifs.
links to a list of redundant complexes, with % of protein identity > 95, which are considered to be represented by the entry in question, or otherwise a link to a non-redundant reference complex. Redundant complexes can be important as often they provide more accurate specificity estimates, and for this reason the redundant entry with the highest information content is indicated by an arrow .
a reference complex that represents a cluster of redundant entries that are at least 95% identical in protein sequence.
the protein sequence of the complex in question with the list of (upper-case) interface residues plotted in the interface graph: > 9ant_B interface=1,42,43,46,47,50, RqtytryqtlelekefhfnryltrrrrieiahalslterqiKIwfQNrrMkwkken
Interface residues are numbered as they appear in the protein sequence. Original PDB residue numbers, which are preferred in interface graphs, can be mapped using the list of :
```
B0005	=>	1
B0046	=>	42
B0047	=>	43
B0050	=>	46
B0051	=>	47
B0054	=>	50
```
the interface signature is a string of concatenated interface residues: RKIQNM

type	parameters
H	max distance donor-acceptor=3.9Å \| min distance H-acceptor=2.5Å \| min angle=90°
w	min distance=5.0Å (see benchmark)
V	min distance=3.5Å \| max distance=3.9Å (see benchmark)

estimated binding specificities as position weight matrices (PWMs) of two types: contact and readout. Contact PWMs are calculated by adding contacts between side-chains and nitrogen bases, assuming that the DNA molecule in the complex is the cognate sequence, after Morozov's approach. Readout PWMs are instead derived both from i) the array of scored atomic interactions at the interface (direct readout), and ii) the set of sequence-dependent deformations inferred from the DNA coordinates (indirect readout), that are blended by applying the DNAPROT algorithm. In general, both approaches provide similar PWMs (see the DNAPROT paper), and mean PWMs and sequence logos are provided for each entry. However, readout PWMs are evaluated as unreliable when the number of interface atomic contacts is less than 5 or when the cognate DNA sequence has an associated PWM score below the top 80% (see benchmark) and in those cases (marked with

) the provided specificity estimates correspond exclusively to contact PWMs. In either case PWMs are evaluated in terms of their information content (IC), which can be read as a measurement of binding specificity, and those PWMs with less than 4 informative columns are said to be unspecific and discarded.

readout + contact

# IC=13.288 IC/col=0.830 n_of_columns=16

specificity:

A |   0  80   0   0  24  96  13  24  24  61   0  13   0   0   4  96
C |   0   4  96  96  24   0  13  24  24  13   0  13   0   0   4   0
G |   0   8   0   0  24   0   8  24  24  14   0  62  96  96   4   0
T |  96   4   0   0  24   0  62  24  24   8  96   8   0   0  84   0

scan!

These PWMs can be used to scan DNA sequences or even genomes following the scan! link, that takes you to a RSA-Tools form where you can paste or upload the sequences to be scanned.

a table with related DNA sequences reported in the literature, available only for non-redundant entries, such as binding sites or consensus sequences found in abstracts of relevant reserch papers. These data can be helpful to evaluate the binding preferences of related proteins or the range of variant sites bound by the same transcription factor. Sites are reported with associated STAMP E-values that measure local similarity to related multimeric and redundant complexes. Please note that it might be necessary to take complementary sequences before comparing sites:

site	source	matches (E-value)
term: ENGRAILED
TAATCC	PubMed	2hdd_A(1.19e-06), 2hdd_AB(9.77e-06)
TAATNN	PubMed	2hdd_A(1.37e-03)
TAATTA	PubMed	2hdd_A(1.37e-03), 1hdd_CD(1.65e-04), 3hdd_AB(4.91e-08)
TTAATT	PubMed	2hdd_A(1.37e-03), 3hdd_AB(1.09e-04)
TTAATTGCAT	PubMed	3hdd_AB(1.50e-05)

a dendrogram of similar interfaces, available only for non-redundant entries. This distance tree is based on the observed structural similarity between DNA-binding domains, and interfaces are aligned showing pairs of interacting amino acids chains and nucleotide bases. Only N-ring (purine/pyrimidine) heavy atom contacts less than 4.5Å away are considered here. For instance, RA stands for arginine in contact with adenine (only bases are colored). Note that this alignment format allows only one base (the closest one) per interface residue, overlooking the cases in which a single residue contacts several bases. SCOP superfamilies of reported complexes can be seen by mousing over the tree leaves, while structure-based sequence logos are clickable L links:


  RCNTFTMA--RAHGNT--SA------YA--   L                 +3f27_D    
                                                   +-1 
  KCNT--MAMA--EGSAACNT----------   L             +-3 +2lef_A    
                                                 ! ! 
  RCNTFTMA--RA--NGSCST------YA--   L           +-4 +1gt0_D    
                                               ! ! 
  RTNGFTIA--RA--NASGST------YAPG   L         +-5 +-1j47_A    
                                             ! ! 
  RTSAYCMT------FT--------------   L   +-----6 +---1j5n_A    
                                       !     ! 
  ----YC--VT----FC--SA----------   L   !     !   +1ckt_A    
                                     --7     +---2 
  RC--YTMALTRA--VCTGAA--VCEC----   L   !         +1qrv_A    
                                       ! 
  RCNAFTIA--RG--NC--SALT----YAKT   L   +----------2gzk_A

The same data can be displayed as a matrix of homologous interface contacts, in which MAMMOTH structural alignment -ln(Evalues) are shown. This is a vertical alignment with one column per complex, where each column shows the aligned equivalent protein residue and the contacted nucleotide. Residues marked with * are interface residues in the reference complex. For instance:

0208 E* ECRG------SCKG--HG----RG--EC 0.91

means that residue E(Glu) 208 is aligned to 7 equivalent residues, two of which are E that contact C nucleotides.

home

credits & help