Welcome to PCTraFF Server
What is pointwise mutual information?
Transcription factors (TFs) bind to the DNA at specific binding sites (TFBSs) in order to regulate gene transcription. The knowledge of these TFBSs on promoters provides important information which TFs are involved in the regulation of which genes. In higher organisms gene expression is not dependent on single TFs but on the interaction between them. Further, interacting TFs are likely to bind next to each other on the DNA. The identification of interacting TFs can therefore be based on the fact, that their respective TFBSs are placed more often next to each other than by pure chance. This aspect can be calculated using pointwise mutual information (PMI). The PMI of two TFBS ta and tb is defined as
The PCTraFF algorithm
The detection of potential interacting TFs is based on their TFBSs that are determined using MATCH^{TM}. Afterwards, PCTraFF calculates significant TFBS pairs and an interaction
network is created based on this pairs. As an optional, final step, the Marcovclustering algorithm determines clusters of TFBSs that are very likely to form a kind of interaction modules. < mehmet....
MATCH^{TM}
MATCH^{TM} is a tool that takes DNA sequences as input and searches for potential TFBSs using a library of PWMs. For each potential TFBS MATCH calculates two scores, the matrix
similarity score (MSS) and the core similarity score (CSS), further, for each PWM a threshold is given regarding each score. If the MSS and the CSS value of a TFBS exceed their respective threshold, the binding site is said to be significant. MATCH^{TM} outputs for each sequence a list of potential TFBSs including their MSS and CSS values as well as their position on the sequences.
Scanning the sequences for TFBSs we used a library of PWMs proposed by Deyneko et al. including PWMs from TRANSFAC (release 2014.1).
PCTraFF
For the detection of potential interacting TFs we applied PCTraFF. The PCTraFF algorithm consists of six phases:
Phase 1: Construction and filtering of the TFBSsequence matrix
The construction of the TFBSsequence matrix M is based on the frequency of predicted TFBSs in each sequence.
The rows of the matrix correspond to the sequences while the
columns correspond to the TFBSs. An entry in the matrix at position (i,j) corresponds to the frequency of TFBS t_{j} in sequence s_{i}.
Afterwards M is filtered in order to reduce: i) the bias of highly represented TFBSs in all sequences, ii) the noisy effect of false signals arising from insufficient data.
For this purpose, the standard deviation σ of the entire matrix M based on its column sums is calculated and all columns are eliminated that have a column sum greater than 3 x σ. Further, the zero percentile of all columns is calculated and a column is removed, if it consists of more zero entries than average.
Phase 2: Identification of important TFBSs in each sequence
On the basis of the filtered matrix M, the importance of ech TFBS in the context of the entire sequence set is determined using the pointwise mutual information.
The pointwise mutual information between sequence s_{i} and TFBS t_{j} (PMI_{st}) is defined as
A positive PMI(s_{i};t_{j}) indicates that TFBS t_{j} occurs in sequence s_{i} more often than by chance and is therefore important for s_{i}. In the following steps we consider only those TFBSs for a sequence that are important for it.
Phase 3: Filter to avoid overlaps
The MATCH^{TM} program predicts all potential TFBSs for the given PWMlibrary. Therefore, it is possible that some TFBSs are overlapping each other. Overlapping TFBSs of the same type can result in their overestimation in the analysis. To avoid this, overlapping TFBSs of the same type are filtered based on their distance to the TSS. Thus, only that TFBS is taken into account in the following analysis steps that has a closer distance to TSS compared to its overlapping partner.
Filtering procedure of the overlap filter. Overlapping TFBSs of the same type (marked in red cycles) are filtered in a way that the TFBS survives which is closer to TSS. 
Phase 4: Construction of TFBS pairs
The distance d_{tAtB } between two TFBSs t_{a} and t_{b} is defined based on their midpoints C_{tA} and C_{tB}:
The midpoint C_{tA} of a TFBS t_{A} is defined as ⌊ ^{lengthA}/_{2} ⌋ where length_{A} is the length of t_{A}. Two TFBSs form a pair, if d_{min} ≤ d_{tAtB } ≤ d_{max} where d_{min} and d_{max} are the minimal and maximal distance constrains, respectively. This distance constrains are defined by the user. A repeated number of TFBSs of the same type within a certain interval on DNA can lead to false positive counts of TFBS pairs. To avoid this overestimation of TFBS pairs one TFBS is only allowed to participate in a pair of specified TFBSs within a certain interval (predefined distance).
The problem of homotypic clusters: The TFBSs (T_{blue}) form an homotypic cluster within a certain interval on the sequence. The TFBS T_{red} is also included in this interval. According to our definition to construct TFBS pairs and by following the DNA strand in 5'3' direction: i) we consider one t_{blue} t_{red} pair in this interval indicating that an individual TFBS can only participate in one count of a specified pair (shown with black line); ii) if we consider t_{blue}t_{blue} pairs, there are two pairs within this interval (shown with blue lines). The red (dashed) lines demonstrate that the remaining t_{blue}t_{blue} and t_{blue}t_{red} pairs are not taken into account in the calculation of pointwise mutual information of this pairs. 
Phase 5: Weighted cumulative pointwise mutual information
Potential collaborating TFs are determined using the weighted cumulative pointwise mutual information (PMI_{pc}) that is based on the cooccurence of the corresponding TFs.
The PMI(t_{A},t_{B}) between two TFBSs t_{A} and t_{B} is defined as
where p(t_{a},t_{b}) is the joint probability, p(t_{a}) and p(t_{b}) are the marginal probabilities for t_{a} and t_{b}, respectively. In order to reduce the known susceptibility of PMI to low number counts, the PMI is weighted and obtained as
where w_{s} is the weight of sequence s that is calculated using the number of TFBS pairs N_{s} of sequence s.
Finally, the weighted cumulative pointwise mutual information PMI_{pc} is calculated by summing up the PMI^{s}_{p}(t_{a};t_{b})values over all sequences as
Phase 6: Background noise reduction of TFBSs using average product correction
The background noise is reduced using the average product correction (APC) procedure. In this procedure, the background noise is estimated by APC and afterwards substracted from the original PMI_{pc} value resulting in the final PMI^{APC}_{pc}(t_{a};t_{b}) value for TFBS pair t_{a} and t_{b}.
By transforming the correct PMI_{pc}^{APC}(t_{a};t_{b})values into zscore, a TFBS pair is considered to be significant, if the pair has a zscore ≥ z where z is defined by the user.
Markov clustering algorithm
The Markov clustering algorithm (MCL) is an algorithm that is able to separate highflow regions from lowflow regions and thereby identifying densely connected TFBSs in a network. Let N:= (Ν,Ε) be the representation of the transcription factor interaction network as an adjacency matrix. Two elements (v_{i},v_{j} ∈ Ν) of N are connected by an edge e_{(vi,vj)} belonging to Ε, if and only if the corresponding TFBS pair was identified by PCTraFF. Further, w(v_{i},v_{j}) denotes the weight of an edge e_{(vi,vj)}, which represents the zscore of the TFBS pair (v_{i},v_{j}) calculated by PCTraFF.
The adjacency matrix A is then converted into a row stochastic "Markov" matrix M_{nxn}, where m_{ixj} represents the transition probability between nodes v_{i} and v_{j} in the network under study. In order to detect densely connected TFBSs in the network, the MCL can be applied on M. The basic intuition of MCL was based on a simulation of stochastic flows on the underlying interaction network to separate highflow regions from lowflow regions. To this end, Expand and Inflate operations were applied on M until M reaches its steady state. While the Expand operation corresponds to matrix multiplication (M=MxM), the Inflate operation is used to increase the contrast between higher and lower probability transitions by taking each entry m_{ixj} in M to the power of inflation parameter r > 1 (can be given by the user). Finally, M is renormalized into a row stochastic matrix.
TRANSFAC
TRANSFAC is a database for eucaryotic transcription factors, their DNA binding sites and some further information. The binding sites provided by TRANSFAC are represented as position weight matrices (PWMs) and can be visually displayed as logo plots. All logo plots shown in this web server are taken from TRANSFAC.

Logoplot of PWM V$CREB_Q2. Transcription factor binding sitest (TFBSs) that are represented by PWMs in TRANSFAC can simply be displayed as logoplots. In a logoplot the hight of a nucleotide at a certain position corresponds to its likelihood at this position. 
Some TFs use to haven very similar TFBSs and therefore, a PWM can represent the binding site for more than one TF. In this webserver, we display for each TFBS the TFs that it represents together with the gene description taken from TRANSFAC.
TFClass
TFClass is a classification of human transcription factors and their mouse homologs that is based on their DNAbinding domains. There are four general classification levels (superclass, class, family, subfamily) and two instantiation levels (genus and molecular species). In total, TFClass comprises nine superclasses, 40 classes and 111 families.
Server Inputs
The basis for the detection of TFBS pairs is a set of sequences. This set of sequences can either be directly uploaded in FASTA format by the user or can be specified as a list of HGNC gene symbols that are provided by the user. By providing a list of HGNC symbols, the user has the opportunity to specify the promoter regions (up and downstram of transcription start site (TSS)), as well as the genome release (hg19 or hg38).

Fasta format of two sequences. The sequences under analysis have to be provided in fasta format. All sequences of the set are listed in one file and each sequence is discriped in one line introduced by ">" that is written above the sequence. 

Server input. In the field "Select input data type" defines the user whether he provides a set of gene symbols or a set of sequences. By providing a set of gene symbols, the user can decide in the field "Select a database" which genome release he prefers. In "Parameters" the user can for example decide which maximal distance the TFs building a pair can have, which zscore should be taken as a significance threshold. 
Server Results

Results. The first information in the result page is the recapitulation of the input details. 

Results summary and static network. In the results summary, the total number of significant pairs is given in combination with the number of pairs that have experimentally evidence as well as the number of not yet validated pairs. Further, a static network is provided with TFBSs as nodes and edges representing the potential interactions between them. 

Pair list with logoplot . The PCTraFF significant pairs are ranked according to their zscore and if available, listed with experimental evidence. Clicking on the "info" button, the logoplots for both TFBSs of the corresponding pair appears together with the TFs binding to that TFBS. 

Interactive Network: Graph info Provided information are the number of nodes and edges as well as the hubs of the network with its top three collabortaion partners, the zscore and if available, the experimental reference. 

Node properties Selecting a node provides information about that node, like the corresponding logo plot, the TFClass classification of the related TFs as well as the TRANSFAC description of these TFs. 

Edge properties Selecting an edge, the related TFBS pair is shown in combination with its distance distribution in the sequence set. Additionally, for that genes are listed, in which the selected pair occurs most frequently. For each gene, the preferred distance, for the selected pair, in the gene is represented. 

Markov clustering network. The results of the Markov clustering algorithm are also presented in an interactive network. 