ChIA-BERT: prediction of CTCF-mediated chromatin loops identified by Chromatin Interaction Analysis with Paired-End Tag (ChIA-PET) from DNA sequence

Mateusz ChiliƄski

supervisor: Dariusz Plewczynski



The spatial architecture of the human genome is considered to play a major role in controlling biological processes in a cell. The spatially close DNA regions, while linearly distal, can interact with each other, thus regulating the expression of genes. One of the experimental methods for the identification of statistically important 3D interactions of chromatin fiber is ChIA-PET, which observes chromatin loops mediated by CCCTC-binding factor proteins. However, not always an experimentally identified chromatin loops are possible to obtain.


That is why multiple statistical learning algorithms have been proposed to simulate in-sillico 3D genomics experiments. We have developed ChIA-BERT, a deep learning algorithm based on transformers. We can predict from DNA sequence chromatin loops mediated by CTCF with an accuracy of up to 78%. The machine learning algorithm uses as input two DNA sequence segments that are interacting, and as the negative set, we use random segments of the remaining genome that are not interacting in 3D space.


Our results show clearly that the modern-day deep learning methods can predict chromatin looping from the DNA sequence. The proposed approach can have a major impact on creating in-sillico statistical models extrapolating the knowledge gathered from molecular biology experiments. The improvements in the in silico predictions from DNA sequence have a major impact on functional studies allowing to predict the effect of mutations on gene expression.