Please use linux today if possible! Introduction to

Please use linux today if possible! Introduction to

Please use linux today if possible! Introduction to Molecular Biology Databases Alinda Nagy & Hedi Hegyi, PhD , Institute of [email protected] Budapest The BioSapiens Permanent School of Bioinformatics Budapest, Sept 4-8, 2006

Databases What is a database? A database is a structured collection of information. (An organized array of information.) A database consists of basic objects called records or entries. Each record consists of fields, which hold defined data that is related to that

record. For example, a protein database would typically have proteins as records and protein properties as fields (i.e. name, length, sequence, taxonomical origin, etc.) Noam Kaplan What is a database? A database is searchable (index) -> table of contents

A database is updated periodically (release) -> new edition A database is cross-referenced (hyperlinks) > links with other db

Why Databases? The purpose of databases is not merely to collect and organize data, but mainly to allow advanced data retrieval. A query is a method to retrieve information from the database. The organization of each record into predetermined fields allows us to use queries on fields. Example: Find all human proteins that are enzymes and have a length of 1000-1200 aa.

Noam Kaplan Databases on the Internet Biological databases often have a web interface, which allows the user to send queries to the database. Some databases can be accessed by different web servers, each offering a different interface. request query

result web page User Web server Database Noam Kaplan

server Databases on the Internet Information system Query system Storage System Data Francis Ouellette

Databases on the Internet Information system Query system Storage System Data - GenBank flat file PDB file

Interaction Record Title of a book Book Francis Ouellette Databases on the Internet Information system Query system Storage System

Data - Boxes - Oracle - MySQL - PC binary files - Unix text files - Bookshelves Francis Ouellette

Databases on the Internet Information system Query system - A List you look at A catalogue indexed files SQL

grep Storage System Data Francis Ouellette Databases on the Internet Information system Query system

Storage System Data - The UBC library - Google - Entrez (NCBI) - SRS (Sequence Retrieval System) Francis Ouellette

Database download Nearly all biological databases are available for download as simple text files. A local version of the database removes limitations on how you process the data. Processing data in files requires some minimal computer-programming skills. PERL is an easy programming language that can be used for extraction and analysis of data from files. Noam Kaplan

Tour of the major molecular biology databases There is a tremendous amount of information about biomolecules in publicly available databases. Today, we will just look at some of the main databases and what kind of information they contain. Exercises will give you a little practice at

browsing databases. List of molecular biology databases List of molecular biology databases Nucleic Acids Research publishes an annual database issue. The 2006 update of the online Molecular Biology Database Collection includes 858 databases Large Growth in the Number of Biological Databases NAR Database Issue 1000 900 Number of databases 800

700 600 500 400 300 200 100 0 1996 1997

1998 1999 2000 2001 Year

2002 2003 2004 2005 2006 types

Organism s Mouse chromosome X Lei Liu from the Mouse Genome Informatics project Genome

maps types Organism s Genome maps DNA sequences RNA sequences

...AATGGTACCGATGACCTGGAGCTTGGTTCGA... Lei Liu types Organism s Genome maps

DNA sequences RNA sequences Protein sequences ...TRLRPLLALLALWPPPPARAFVNQHLCGSHLVEA... Lei Liu types

Organism s Genome maps DNA sequences RNA sequences RNA structures

Protein sequences Protein structures PDB entry 1CIS P.Osmark, P.Sorensen, F.M.Poulsen Lei Liu types Organism s

DNA motifs RNA expression Genome maps DNA sequences RNA sequences RNA

structures Protein sequences Protein structures Protei n motifs Lei Liu

Types of molecular biology databases 14 main NAR categories: Nucleotide Sequence RNA sequence Protein sequence Structure Genomics (non-vertebrate) Metabolic and Signaling Pathways Human and other Vertebrate Genomes Human Genes and Diseases

Microarray Data and other Gene Expression Proteomics Resources Other Organelle Plant Immunological Resources are Becoming More Diverse NAR Database Categories 2004

2006 NAR 2006 A Closer Look Genome scale databases have proliferated Traditional sequence databases are now a small part Databases around new

specific data types are emerging Pathway and disease orientated databases are emerging Database searches Using a database How to get information out of a database: Summaries: how many entries, average or

extreme values Browsing: no targeted information to retrieve Search: looking for particular information Searching a database: Must have a key that identifies the element(s) of the database that are of interest. Name of gene Sequence of gene Other information Larry Hunter

Searching sequence databases Start from sequence, find information about it Many kinds of input sequences Could be amino acid or nucleotide sequence Genomic or mRNA/cDNA or protein sequence Complete or fragmentary sequences Exact matches are rare (even uninteresting in

many cases), so often goal is to retrieve a set of similar sequences. Both small (mutations) and large (required for function) differences within similar can be interesting. Larry Hunter What might we want to know about a sequence? Is this sequence similar to any known genes? How close is the best match?

Significance? What do we know about that gene? Genomic (chromosomal location, allelic information, regulatory regions, etc.) Structural (known structure? structural domains? etc.) Functional (molecular, cellular & disease) Evolutionary information:

Is this gene found in other organisms? What is its taxonomic tree? Larry Hunter What can be discovered about a gene by a database search? A little or a lot, depending on the gene Evolutionary information: homologous genes,

taxonomic distributions, allele frequencies, synteny, etc. Genomic information: chromosomal location, introns, UTRs, regulatory regions, shared domains, etc. Structural information: associated protein structures, fold types, structural domains Expression information: expression specific to particular tissues, developmental stages, phenotypes, diseases, etc. Functional information: enzymatic/molecular function,

pathway/cellular role, localization, role in diseases Larry Hunter NCBI and Entrez NCBI and Entrez One of the most useful and comprehensive sources of databases is the NCBI (National Center for Biotechnology Information), part of the NIH (National Institute of Health).

NCBI provides interesting summaries, browsers for genome data, and search tools Entrez is their database search interface Can search on gene names, sequences, chromosomal location, diseases, keywords, ... Larry Hunter BLAST: Searching with a sequence Goals is to find other sequences that are

more similar to the query than would be expected by chance (and therefore are homologous). Can start with nucleotide or amino acid sequence, and search for either (or both) Many options E.g. ignore low information (repetitive) sequence, set significance critical value Defaults are not always appropriate: READ THE NCBI EDUCATION PAGES! Larry Hunter

Major choices: Larry Hunter

Translation Database Filters Restrictions Matrix Larry Hunter Larry Hunter Close hit: Rat ADH alpha

Larry Hunter Distant hit: Human sorbitol dehydrogenase Larry Hunter Parameters (at bottom!)

Larry Hunter Click on: Larry Hunter Larry Hunter BLAST searches online Sequences:


BLAST output for ENSP00000002501 BLAST output for ENSP00000314902 BLAST output for ENSP00000314902

Take home messages There are a lot of molecular biology databases, containing a lot of valuable information Not even the best databases have everything (or the best of everything) These databases are moderately well crosslinked, and there are linker databases Sequence is a good identifier, maybe even better than gene name! Larry Hunter

Protein sequence databases General sequence databases (e.g. UniProt) Protein properties (e.g. PFD Protein Folding Database) Protein localization and targeting (e.g. NPD - Nuclear Protein Database)

Protein sequence motifs and active sites (e.g. BLOCKS, InterPro, PROSITE, PRINTS) Protein domain databases; protein classification (e.g. InterPro, ProDom, SMART, Pfam) Databases of individual protein families (e.g. Histone Database)

UniProt ( The Universal Protein Resource) Wu CH et al. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 2006 Jan 1;34(Database issue):

D187-91. Margaret Dayhoff The first protein database was created by Margaret Dayhoff, calledThe Atlas of Protein Sequences. It was a book. The Atlas of Protein

Sequences Dayhoff had the idea that a compilation of all protein sequences in the literature into one resource would be a useful research tool. She and her co-workers collected all known sequences and published them together. Then, when a new sequence was obtained, there was a single resource available for determining its relationship to other known sequences.

What is UniProt What is UniProt The world's most information on proteins. comprehensive catalog

of Central repository of protein sequence and function. Created by joining the information contained in Swiss-Prot, TrEMBL, and PIR. Collaboration between EBI (European Bioinformatics Institiute), SIB (Swiss Institute of Bioinformatics) and PIR (DDBJ to join). Funded mainly by NIH. Three database components:

UniProt Knowledgebase (UniProtKB) UniProt Reference Clusters (UniRef) What is UniProt 1. UniProt Knowledgebase (UniProtKB): central access point for extensive curated protein information, including function, classification, and cross-reference comprising the manually annotated UniProtKB/Swiss-Prot section and the automatically annotated UniProtKB/TrEMBL section 2. UniProt Reference Clusters (UniRef):

combines closely related sequences into a single record to speed searches speed similarity searches via sequence space compression by merging sequences that are 100% (UniRef100), 90% (UniRef90) or 50% (UniRef50) identical 3. UniProt Archive (UniParc): comprehensive repository, reflecting the history of all protein sequences stores all publicly available protein sequences, containing the What is UniProt

The UniProt databases collect both protein sequences obtained through experimental determination and protein sequences derived from the translation of nucleotide sequences (which were predicted or determined to codify for a protein). Amino acid sequence determined through experimenta l analysis

GeneBank DDBJ EMBL Nucleotide sequence databases Protein sequence databases PIR

SWISSPROT TrEMBL Validated Enriched of specific information UniProt Goals High level of annotation Minimal redundancy High level of integration with other databases Complete and up-to-date

Annotation concepts UniParc: No annotation UniProtKB: Annotated UniRef: No annotation, just description line of UniProtKB or UniParc master entry in the cluster for use in FASTA files

Minimal redundancy UniParc: All sequences that are 100% identical over their entire length are merged into a single entry, regardless of species. UniParc represents each protein sequence once and only once, assigning it a unique identifier. UniParc cross-references the accession numbers of the source databases. UniProtKB: Aims to describe in a single record all protein products derived from a certain gene (or genes

if the translation from different genes in a genome leads to indistinguishable proteins) from a certain species. UniRef: Merges sequences automatically. Integration with other databases UniParc: Linked back to source records UniProtKB:

Linked to >60 other databases UniRef: UniRef clusters link back to UniProtKB and UniParc records in the cluster Complete and up-to-date UniParc: All publically available protein sequences, updated every 2 weeks (05/06, Rel 8.0: 7.116.519 entries) UniProtKB:

All suitable stable protein sequences, updated every 2 weeks (05/06, Rel 8.0: 3.170.612 entries) UniRef: All protein sequences in the UniProtKB and in UniParc useful for sequence similarity searches, updated every 2 weeks (05/06, Rel 8.0: 3.511.676 UniRef100, 2.254.474 UniRef90, 1.148.123 UniRef50 entries) An example

An example An example An example An example Exercise 1 Text search 1. Go to EXPASY. Click "UniProt Knowledgebase (SwissProt and TrEMBL) and then search for human cochlin.

Notice that there is a wealth of information about this protein. Furthermore, there are many links to sequence analysis tools (some of which you will learn later) and some other nice features. Note that this is merely a graphical display of the original UniProtKB/SwissProt database entry (which is in text). 2. Try to answer all of the questions below. 1. Which year was the NMR structure of the LCCL domain determined? 2. Where is the protein expressed? 3. Which diseases are associated with the protein?


3. Paste the sequence into the query sequence window and adjust the options as necessary. You won't need to specify advanced options, but you should choose a program and database. For simplicity, use e.g. the UniProtKB database. 4. Run the search and identify the protein. Use the link provided to see the UniProtKB/SWISS-PROT report. Exercise 2 BLAST search 5. Now, try to answer all of the questions below. 1. What is the SWISS-PROT primary accession number?

2. What is the common name of the protein? 3. What is the gene called? 4. Which year was the crystal structure of the catalytic domain determined? Name the first author. 5. Does the enzyme require a co-factor to function? If so, what? 6. Name the most common disease that arises as a result of deficiency of this enzyme. 7. How many amino acid residues are there in the protein? 8. What is the molecular weight of the protein?

Patterns and Profiles, Protein Motifs and Domains InterPro - an integrated database of protein families, domains, motifs and functional sites. Blocks - multiply aligned ungapped segments for the most highly conserved regions of proteins. Motif - a server that scans databases to find motifs or patterns and that can generate sequence profiles.

Pfam - multiple sequence alignments and HMMs of protein domains and families. PRINTS - database of groups of conserved motifs, or protein fingerprints. ProDom - protein domain families automatically generated from SWISS-PROT and TrEMBL. PROSITE - database of protein families and domains defined by functional sites, patterns and profiles. SMART - Simple Modular Architecture Research Tool for the identification of domains. COGS database - clusters of sequences determined by

comparing sequences from whole genomes. InterPro (Integrated resource of Protein Families, Domains and Sites) Mulder NJ et al. (2005) InterPro, progress and status in 2005. Nucleic Acids Res. 33 (Database Issue): D201-5.

What is InterPro Secondary protein databases on functional sites and domains are vital resources for identifying distant relationships in novel sequences, and hence for predicting protein function and structure. Unfortunately, these signature databases do not share the same formats and nomenclature, and each database has its own strengths and weaknesses.

Thus, for best results, search strategies should ideally combine all of them. What is InterPro InterPro is a collaborative project aimed at providing an integrated layer on top of the most commonly used signature databases by creating a unique, non- redundant characterization of a given protein family, domain or functional site. Integrates PROSITE, PRINTS, Pfam, ProDom, SMART, TIGRFAMs, PIR superfamily,

SUPERFAMILY, Gene3D and PANTHER databases and the addition of others is scheduled. Has cross-references to the BLOCKS database as well as many specialized protein family and protein structure databases. InterPro The latest release of InterPro (12.1) contains 12,953 entries, with 78% coverage of all proteins in UniProtKB. Each entry has annotation provided in the

name, GO mapping and abstract fields, and all matches against the Swiss-Prot and TrEMBL components of UniProt are precomputed and available for viewing in different formats. Protein 3D structural information is integrated from MSD, CATH and SCOP, and this data is available in the match views to provide an at a glance comparison of sequence and structural domains. InterPro

Dataflow scheme InterProScan result PROSITE Database of protein families and domains PROSITE consists of a large collection of biologically

meaningful signatures that are described as patterns or profiles that help to reliably identify to which known protein family (if any) a new sequence belongs the latest version (release 19.11) contains 1329 patterns and 552 profile entries each signature is linked to a documentation providing information on the protein family or domain detected by the signature: origin of its name, taxonomic occurrence, domain architecture, function, 3D structure, main

characteristics of the sequence, domain size and some references PRINTS PRINTS The PRINT database is a compendium of protein fingerprints. A fingerprint is a group of conserved sequence motifs that together provide

diagnostic signatures for protein families. Fingerprints are diagnostically more powerful than single motifs by making use of the biological context inherent in a multiple-motif method. The fingerprinting method is a reliable technique for detecting members of large, highly divergent protein super-families. PFAM PFAM Database of multiple sequence alignments and HMMs of protein domains and families. Profile hidden Markov models are statistical models of the primary structure consensus of a sequence family. The construction and use of Pfam is tightly tied to the HMMER software package.

PFAM Composed of two sets of families: Pfam-A: curated part containing over 8296 protein families Pfam-B: automatically generated supplement containing a large number of small families taken from the PRODOM database that do not overlap with PfamA (lower quality) PFAM

Each family has the following data: A seed alignment which is a hand edited multiple alignment representing the family. Hidden Markov Models (HMM) derived from the seed alignment which can be used to find new members of the domain and also take a set of sequences to realign them to the model. One HMM is in ls mode (global) the other is an fs mode (local) model. A full alignment which is an automatic alignment of all the examples of the domain using the two HMMs to find and then align the sequences

Annotation which contains a brief description of the domain, links to other databases and some Pfam specific data. To record how the family was constructed. A PFAM entry A PFAM entry, contd PFAM searches PFAM results

PRODOM PRODOM Database of protein domain families automatically generated from SWISSPROT and TrEMBL databases by sequence comparison. Useful for analysing the domain arrangements of complex protein

families and the homology relationships in modular proteins. Contains (release 2003.1) 144,444 domain families containing two or more individual domains. SMART Simple Modular Architecture Research Tool

SMART Allows the identification and annotation of protein domains and the analysis of domain architectures. The current release has more than 600 domain families represented among nuclear, signalling and extracellular proteins. Extensive annotation for each domain family is available, providing information on function, subcellular localization, phyletic distribution and tertiary structure, links to OMIM in cases where a

human disease is associated with one or more mutations in a particular domain. Exercise 3 Domain search 1. Go to the PROSITE site. 2. Under "Tools for PROSITE" choose ScanProsite. 3. Paste the sequence below into the box and tick the Option "Exclude patterns with a high probability of occurrence" (to find very common patterns will not tell you much about your protein). MWAPRCRRFWSRWEQVAALLLLLLLLGVPPRSLALPPIRYSHAGICPNDMNPNLWVDAQSTCRRECETDQECETYEKCCPNVCGTKSCVAARYMDVKGKKGPVGMPKEAT



4. Search Pfam. 1. Which domains are found? 2, What may be the function of this protein? Exercise 5: Blast searches on your computer 1. download blast-2.2.14-ia32-linux.tar.gz file from 2. Make a subdirectory in your home directory: mkdir ~/blast 3. Move the blast file there:

mv blast-2.2.14-ia32-linux.tar.gz ~/blast/ 4. Go to the blast directory: cd ~/blast/ 4. unzip the file: gunzip blast-2.2.14-ia32-linux.tar.gz 5. unpack it: tar xvf blast-2.2.14-ia32-linux.tar Exercise 5: Blast searches, contd 6. Get the first 100 human proteins in Swissprot: - go to

- click on Start - unmark TREMBL, to search only in Swissprot -press Continue Exercise 5: Blast searches, contd Select in the first Info line Organism and type in human Press Do Query, this will retrieve all human proteins in Swissprot in batches of 100

Exercise 5: Blast searches, contd Press save Exercise 5: Blast searches, contd 1. Change view to FastaSeqs 2. Change Sequence Format to fasta 3. Press SAVE

Exercise 5: Blast searches, contd 6. Save file e.g. as 100seq.fa 7. Format your database of 100 sequences to make it searchable by blast: ~/blast/blast-2.2.14/bin/formatdb i 100seq.fa 8. Now you have a searchable database, you can search with an input sequence of your choice. E.g. make a file from the first sequence in 100seq.fa, grab the first sequence with the mouse and type cat > seq1.fa

and paste it into the file, then press 9. Now you have an input sequence and a database, type: ~/blast/blast-2.2.14/bin/blastall p blastp i seq1.fa d 100seq.fa o seq1-vs-100seq.blastp 10. After it finished running (it will be ready immediately) you will get your output in seq1-vs-100seq.blastp file. If you invoke the blastall program without the switches it will list all the options you can use.

Recently Viewed Presentations

  • JUST VOCAB - Brookings School District

    JUST VOCAB - Brookings School District

    JUST VOCAB AMPHIBIANS Chapters 42 Joining of an egg & sperm outside the female's body _____ Kind of development in which offspring are born/hatch looking like their parents only smaller _____ Kind of circulatory system in which blood is contained...


    Electrode potential against hydrogen reference electrode. Indicating the electrode concentration in a solution. Both measures electrode potential with probes. Preliminary knowledge on a electrochemical cell required. oxidation state: charges on an ion, assuming a perfect ionic bonding. oxidation-reduction: Increase-decrease of...
  • Hqmc Liaison Dfas-kc

    Hqmc Liaison Dfas-kc

    Government Travel Charge Card Program Marine Corps Day GSA Conference 2011 Las Vegas, NV Headquarters, U.S. Marine Corps Programs & Resources Department
  • Lecture meet on IndAS Day 3 Presented by

    Lecture meet on IndAS Day 3 Presented by

    Non monetary assets/liabilities, income and expenditure are indexed from the date of transaction to reporting date (Measuring Unit Current) Monetary items are already at MUC and hence not required to be indexed. Restate FS of current as well as previous...
  • Sin título de diapositiva -

    Sin título de diapositiva -

    El diccionario de la WBS y los enunciados detallados del trabajo proporcionan una identificación de los entregables. * Estimación de Costos - Entradas WBS VS. OBS * Organization Manufacturing Engineering Testing CAD-CAM-CAE WBS Paquete de Trabajo Cuenta de Control OBS...
  • 763 - Acquiring Knowledge

    763 - Acquiring Knowledge

    Objective Thoughtful Deliberation. This talk is a result of objectively consideration of all sides of Distance Education. After gathering information and accumulating experienceusing DE in SWE 642 (F 2010), SWE 632 (S 2011), and several invited talks
  • New Technology File System (cont'd.) - Elgin Community College

    New Technology File System (cont'd.) - Elgin Community College

    Universal Disk Format. Universal Disk Format (UDF) File system defined by the Optical Storage Technology Association (OSTA) OSTA was created to promote the use of recordable optical technologies and products. Developed as a standard to allow file interchange between different...
  • Developing Crva Strategy the Future of The Caravan Industry

    Developing Crva Strategy the Future of The Caravan Industry

    CAFE - CCC Coffee, Cakes & Couch Area Fast Service, Quick Coffee Adjacent to Gaming Area Comfortable Couches - TV watching Attracts Ladies Market and Gaming Orphans ENTERTAINMENT OFFER Poker Mahjong Bingo Bridge Keno Tab CROSS SELL TO GAMING ATTRACTS...