1、Leading EdgeReviewThe Human Transcription FactorsSamuel A.Lambert,1,9Arttu Jolma,2,9Laura F.Campitelli,1,9Pratyush K.Das,3Yimeng Yin,4Mihai Albu,2Xiaoting Chen,5Jussi Taipale,3,4,6,*Timothy R.Hughes,1,2,*and Matthew T.Weirauch5,7,8,*1Department of Molecular Genetics,University of Toronto,Toronto,ON,
2、Canada2Donnelly Centre,University of Toronto,Toronto,ON,Canada3Genome-Scale Biology Program,University of Helsinki,Helsinki,Finland4Division of Functional Genomics and Systems Biology,Department of Medical Biochemistry and Biophysics,Karolinska Institutet,Solna,Sweden5Center for Autoimmune Genomics
3、and Etiology(CAGE),Cincinnati Childrens Hospital Medical Center,Cincinnati,Ohio,USA6Department of Biochemistry,Cambridge University,Cambridge CB2 1GA,United Kingdom7Divisions of Biomedical Informatics and Developmental Biology,Cincinnati Childrens Hospital Medical Center,Cincinnati,Ohio,USA8Departme
4、nt of Pediatrics,University of Cincinnati College of Medicine,Cincinnati,Ohio,USA9These authors contributed equally*Correspondence:ajt208cam.ac.uk(J.T.),t.hughesutoronto.ca(T.R.H.),Matthew.Weirauchcchmc.org(M.T.W.)https:/doi.org/10.1016/j.cell.2018.01.029Transcription factors(TFs)recognize specific
5、DNA sequences to control chromatin and transcrip-tion,forming a complex system that guides expression of the genome.Despite keen interest inunderstanding how TFs control gene expression,it remains challenging to determine how the pre-cise genomic binding sites of TFs are specified and how TF binding
6、 ultimately relates to regulationof transcription.This review considers how TFs are identified and functionally characterized,prin-cipally throughthelensofacatalogofover1,600likelyhumanTFsandbindingmotifsfortwo-thirdsof them.Major classes of human TFs differ markedly in their evolutionary trajectori
7、es and expres-sion patterns,underscoring distinct functions.TFs likewise underlie many different aspects ofhuman physiology,disease,and variation,highlighting the importance of continued effort to under-stand TF-mediated gene regulation.IntroductionTranscriptionfactors(TFs)directlyinterpretthegenome
8、,perform-ingthefirststepindecodingtheDNAsequence.Manyfunctionasmaster regulators and selector genes,exerting control overprocesses that specify cell types and developmental patterning(Lee and Young,2013)and controlling specific pathways suchas immune responses(Singh et al.,2014).In the laboratory,TF
9、scan drive cell differentiation(Fong and Tapscott,2013)andeven de-differentiation and trans-differentiation(Takahashi andYamanaka,2016).Mutationsin TFsandTF-binding sitesunderliemany human diseases.Their protein sequences,regulatory re-gions,andphysiologicalrolesareoftendeeplyconservedamongmetazoans
10、(Bejerano et al.,2004;Carroll,2008),suggesting thatglobal gene regulatory networks may be similarly conserved.And yet,there is high turnover in individual regulatory sequences(Weirauch and Hughes,2010),and over longer timescales,TFsduplicate and diverge.The same TF can regulate different genesin dif
11、ferent cell types(e.g.,ESR1 in breast and endometrial celllines Gertz et al.,2012),indicating that regulatory networksare dynamic even within the same organism.Determining howTFs are assembled in different ways to recognize binding sitesand control transcription is daunting yet paramount to under-st
12、anding their physiological roles,decoding specific functionalproperties ofgenomes,andmappinghowhighlyspecificexpres-sion programs are orchestrated in complex organisms.This review considers our current understanding of TFs andtheir global functions to provide context for thinking about howTFs work i
13、ndividually and as an ensemble.We also provide acatalog of the human TF complement and a comprehensiveassessment of whether a DNA-binding motif is known foreach TF.We use this catalog to survey human TF function,expression,and evolution,highlighting the roles played by TFsin human disease,including
14、the effect of variation within TF pro-teins and TF-binding sites.A comprehensive review of?1,600proteins is impossible;instead,we attempt to exemplifyemerging trends and techniques,as well as shortcomings inexisting data.Historically,the term transcription factor has been applied todescribe any prot
15、ein involved in transcription and/or capableof altering gene-expression levels.In the current vernacular,however,the term is reserved for proteins capable of(1)bindingDNA in a sequence-specific manner and(2)regulating transcrip-tion(Figure 1A)(Fulton et al.,2009;Vaquerizas et al.,2009).TFscan have 1
16、,000-fold or greater preference for specific bindingsequences relative to other sequences(Damante et al.,1994;Geertz et al.,2012).Because TFs can act by occluding theDNA-binding site of other proteins(e.g.,the classic lambda,lac,andtrprepressorsPtashne,2011),theabilitytobindtospe-cificDNAsequencesal
17、oneisoftentakenasanindicatorofabilityto regulate transcription.These proteins cannot be understood functionally withoutaccompanying detailed knowledge of the DNA sequences theybind.TF DNA-binding specificities are frequently summarizedas motifsmodels representing the set of related,short650Cell 172,
18、February 8,2018 2018 Elsevier Inc.(legend on next page)Cell 172,February 8,2018651sequences preferred by a given TF,which can be used to scanlonger sequences(e.g.,promoters)to identify potential bindingsites.Determining a DNA-binding motif is often the first steptoward detailed examination of the fu
19、nction of a TF becauseidentification of potential binding sites provides a gateway tofurtheranalyses.Ourabilitytogeneratebothmotifsandgenomicbinding sites has improved dramatically over the last decade,leading to an unprecedented wealth of data on TF-DNA interac-tions.To develop the current TF catal
20、og,we have drawn heavilyupon motif collections such as TRANSFAC(Matys et al.,2006),JASPAR(Mathelier et al.,2016),HT-SELEX(Jolma et al.,2013;Jolma et al.,2015;Yin et al.,2017),UniPROBE(Hume et al.,2015),and CisBP(Weirauch et al.,2014),along with previouscatalogs of human TFs(Fulton et al.,2009;Vaquer
21、izas et al.,2009;Wingender et al.,2015).Thereistypically onlyapartialoverlapbetweenexperimentallydeterminedbindingsitesinthegenomeandsequencesmatchingthe motif;moreover,even experimentally determined bindingsites are relatively poor predictors of genes that the TFs actuallyregulate(Cusanovich et al.
22、,2014).At the same time,motifmatches are often among the most enriched sequences in aChIP-seq(chromatin immunoprecipitation sequencing)dataset,indicating that intrinsic DNA-binding specificity is important forTF binding in vivo.In retrospect,this outcome should havebeen expected:most TF-binding site
23、s are small(usually612 bases)and flexible,so a typical human gene(20 kb)willcontain multiple potential binding sites for most TFs(WunderlichandMirny,2009).Well-establishedconceptssuchascooperativ-ity and synergy between TFs provide a ready solution to thisdeficit in specificitymost human TFs have to
24、 work together toget anything donebut the details of their interactions and rela-tionships are generally lacking.The biochemical effects of TFssubsequent to binding DNA are also largely unmapped andknown to be context dependent.As a result,decoding howgeneregulationrelatestoTF-bindingmotifsandgenese
25、quencesremains a major practical challenge;the resulting frustration hasbeen embodied in the term futility theorem(Wasserman andSandelin,2004).How Transcription Factors Are IdentifiedThe major TF families in eukaryotes,such as C2H2-zinc finger(ZF),Homeodomain,basichelix-loop-helix(bHLH),basic leucin
26、ezipper(bZIP),and nuclear hormone receptor(NHR),were initiallydescribed in the 1980s(reviewed in Johnson and McKnight1989).Knowledge of binding sites,often identified by methodssuch as DNase footprinting or mobility shift,led to identificationof the particular binding proteins using N-terminal pepti
27、desequencing,phage libraries,or one-hybrid screening.Similar-ities in amino acid composition and structure were then notedamong different DNA-binding proteins.New DNA-binding pro-teins continue to be identified by experimental methods(e.g.,one-hybrid assays see Reece-Hoyes and Marian Walhout(2012),D
28、NA affinity purification-mass spectrometry reviewedin Tacheny et al.(2013),and protein microarrays Hu et al.,2009 can screen for new DNA-binding proteins),but today,most known and putative TFs have instead been identified bysequence homology to a previously characterized DNA-bindingdomain(DBD),which
29、 is also used to classify the TF(see Weir-auch and Hughes 2011 for review).With the possible exceptionof the very simple AT-hook(Aravind and Landsman,1998),allextant examples of DBDs are assumed to be derived from asmall set of common ancestors representing the major DBDfolds,with the families arisi
30、ng by duplication.There are?100known eukaryotic DBD types,which are cataloged in Pfam(Finn et al.,2016),SMART(Letunic et al.,2015)or Interpro(Finn et al.,2017)as hidden Markov models(HMMs),which areused to scan protein sequences for these domains.DBD struc-tures in complex with DNA are currently ava
31、ilable in the ProteinDataBank(PDB)(Bermanetal.,2000)formostfamilies ofhumanTFs,with AP2,BED-ZF,CP2,SAND,and NRF being notable ex-ceptions.To date,all but a handful of well-characterizedmammalian TFs contain a known DBD(Fulton et al.,2009).It islikelythatadditional DBDsremain tobediscovered;forexampl
32、e,extended homologous regions in polycomb-like proteins wererecently found to bind motifs containing CG dinucleotides(Liet al.,2017).Care must be taken when inferring function based only on ahomologymatchtoaDBDbecause notallinstances ofthesedo-mains will necessarily bind specific DNA sequences.The C
33、ERS/Lass-type Homeodomains,for example,are not likely to beDNA-binding proteins at all;they instead appear to have beenco-opted to function in sphingolipid synthesis(Mesika et al.,2007).Likewise,only a subset of Myb/SANT,HMG,and ARIDdomain-containing proteins bind specific DNA sequences.Inaddition,d
34、omains with similar names should not be confused.For example,C2H2-ZFs and CCCH-ZFs are structurally andevolutionarily distinct,and while C2H2-ZFs generally bind dou-ble-stranded DNA,CCCH-ZFs typically bind single-strandedRNA(reviewed in Font and Mackay 2010).Determining TF DNA-Binding MotifsMotifsar
35、etypically displayedasasequencelogo(SchneiderandStephens,1990),which in turn represents an underlying table orposition weight matrix(PWM)of relative preference of the TFfor each base in the binding site(Stormo and Zhao,2010).Ateach base position,each of the four bases has a score,andmultiplying thes
36、e scores for each base of a sequence yields apredicted relative affinity of the TF to that sequence.In manycases,these logos reflect strong preference to one or a smallnumber of related sequences,although they can also representweak base preferences that nonetheless contribute to binding.In addition
37、,complications can arise that are not captured by aFigure 1.The Human Transcription Factor Repertoire(A)Schematic of a prototypical TF.(B)Number of TFsand motifstatusfor each DBD family.Insetdisplays the distribution of the numberof C2H2-ZF domains for classes of effector domains(KRAB,SCAN,or BTB do
38、mains);Classic indicates the related and highly conserved SP,KLF,EGR,GLI GLIS,ZIC,and WT proteins.(C)DBD configurations of human TFs.In the network diagram,edge width reflects the number of TFs with each combination of DBDs.(D)Number of auxiliary(non-DNA-binding)domains(from Interpro)present in TFs,
39、broken down by DBD family.652Cell 172,February 8,2018PWM:there may be dependencies among base positions(Bulyketal.,2002;Jolmaetal.,2013),forexample,duetoDNAshapeordeformability(Rohs et al.,2009);the TF may have multiple bind-ing modes(e.g.,different physical configurations of the proteinleading to s
40、eparate,distinct motifs)(Badis et al.,2009);cooper-ative interactions may influence the sites bound by a TF(Jolmaet al.,2015);or DNA methylation can impact binding,positivelyornegatively(Yinetal.,2017).Toaccountforthesecomplexities,morecomplicatedmodelshavebeendeveloped,e.g.,thatincor-poratepreferen
41、cestodinucleotidesandhigher-orderk-mers(re-viewed in Slattery et al.2014),with improvement in accuracydepending on the TF and its family.In many cases,however,the improvement is minor or even undetectable,especiallywhen comparing across different datasets(Weirauch et al.,2013),and the PWM remains th
42、e most commonly used modelfor analysis of TF binding.Hereafter,we use the term motifto signify PWM.The sequence preferences and binding sites of TFs canbe assessed by a wide variety of techniques both in vitroand in vivo(reviewed in Jolma and Taipale 2011);Table 1 out-lines the most prevalent method
43、s and their attributes.As apredictor of relative binding affinity,motifs are most accuratelyobtained from quantitative affinity measurements for a largenumber of sequences,preferably using purified proteins andDNA(Stormo and Zhao,2010).Nonetheless,motifs for manywell-studied proteins were initially
44、obtained from very few se-quences(e.g.,dozens of Sanger reads)and used in thousandsof subsequent studies(Mathelier et al.,2016;Matys et al.,2006),illustrating the utility of even approximate descriptionsof binding ability.ChIP-seq(Johnsonetal.,2007)has revolutionized thestudy ofTF-binding sites in v
45、ivo by enabling the genome-wide identifica-tion of region occupied by a TF of interest.The semiquantitativemeasurements obtained have several limitations with regard tomotif derivation,however.First,binding is influenced by chro-matin statemany TFs bind almost exclusively in open chro-matinaswellasb
46、iasesinthesequencecontentofthegenome.Second,ChIP-seq can clearly detect indirect binding,which canlead to identification of motifs for proteins other than the oneChIPped(Wang et al.,2013;Worsley Hunt and Wasserman,2014).Third,due to the use of cross-linkers,ChIP does not mea-sure equilibrium binding
47、.Finally,ChIP data is highly dependenton antibody qualitymany antibodies cross-react,and ChIP-grade antibodies are not available for many TFs.It is thus oftenhelpful to use prior knowledge regarding the motif expectedfor example,the C2H2-ZF recognition code(which relatesDNA-contacting residues to pr
48、eferred base positions in thebinding site Najafabadi et al.,2015)can be used to restrictthe analysis to those motifs that resemble computational-basedspecificity predictions.Some of these issues are in theoryaddressed by higher resolution approaches such as ChIP-exo(ChIP with exonuclease digestion)(
49、Rhee and Pugh,2011),butrelatively few examples are currently available.In summary,we now appear to possess the tools needed toidentify TF motifs globally.Having these motifs,however,isonly a first step in decoding the functions of these proteins ingene regulation;we outline additional complexities i
50、n thefollowing sections.TF Cooperativity and Interactions with NucleosomesBoth theoretical arguments and practical observations indicatethat metazoan TFs must,in general,work together to achieveneeded specificity in both DNA binding and effector functionhence the futility theorem(Reiter et al.,2017;