Chado is a relational database schema that underlies many GMOD installations. It is capable of representing many of the general classes of data frequently encountered in modern biology such as sequence, sequence comparisons, phenotypes, genotypes, ontologies, publications, and phylogeny. It has been designed to handle complex representations of biological knowledge and should be considered one of the most sophisticated relational schemas currently available in molecular biology. The price of this capability is that the new user must spend some time becoming familiar with its fundamentals.
Chado is a modular schema for handling all kinds of biological data. It is intended to be used as both a primary datastore schema as well as a warehouse-style schema. The modules currently in chado are:
|Companalysis||data from computational analysis|
|Contact||people and groups|
|Database (db)||references to external databases|
|Controlled Vocabulary (cv)||controlled vocabularies and ontologies|
|Expression||summarized RNA and protein expresssion|
|Genetic||genetic data and genotypes|
|Library||descriptions of molecular libraries|
|Map||maps without sequence|
|Publication (pub)||publications and references|
|Sequence||sequences and sequence features|
|Stock||specimens and biological collections|
|WWW||generic classes for web interfaces|
Chado is a community-designed database schema and as such, the community has developed a wealth of documentation. If you would like more information, it likely exists within one of the pages below. However, if it doesn't feel free to contact the community by filing an issue on GitHub.
|Table / View||Referenced||Foreign Keys||Columns||Type||Comments|
This represents the scanning of hybridized material. The output of this process is typically a digital image of an array.
Multiple monochrome images may be merged to form a multi-color image. Red-green images of 2-channel hybridizations are an example of this.
Parameters associated with image acquisition.
An analysis is a particular type of a computational analysis; it may be a blast of one sequence against another, or an all by all blast, or a different kind of analysis altogether. It is a single unit of computation.
Associate a term from a cv with an analysis.
Links an analysis to dbxrefs.
Provenance. Linking table between analyses and the publications that mention them.
Computational analyses generate features (e.g. Genscan generates transcripts and exons; sim4 alignments generate similarity/match features). analysisfeatures are stored using the feature table from the sequence module. The analysisfeature table is used to decorate these features, with analysis specific attributes. A feature is an analysisfeature if and only if there is a corresponding entry in the analysisfeature table. analysisfeatures will have two or more featureloc entries, with rank indicating query/subject
General properties about an array. An array is a template used to generate physical slides, etc. It contains layout information, as well as global array properties, such as material (glass, nylon) and spot dimensions (in rows/columns).
Extra array design properties that are not accounted for in arraydesign.
An assay consists of a physical instance of an array, combined with the conditions used to create the array (protocols, technician information). The assay can be thought of as a hybridization.
A biomaterial can be hybridized many times (technical replicates), or combined with other biomaterials in a single hybridization (for two-channel arrays).
Link assays to projects.
Extra assay properties that are not accounted for in assay.
A biomaterial represents the MAGE concept of BioSource, BioSample, and LabeledExtract. It is essentially some biological material (tissue, cells, serum) that may have been processed. Processed biomaterials should be traceable back to raw biomaterials via the biomaterialrelationship table.
Relate biomaterials to one another. This is a way to track a series of treatments or material splits/merges, for instance.
Link biomaterials to treatments. Treatments have an order of operations (rank), and associated measurements (unittype_id, value).
Extra biomaterial properties that are not accounted for in biomaterial.
This table is different from other prop tables in the database, as it is for storing information about the database itself, like schema version
Different array platforms can record signals from one or more channels (cDNA arrays typically use two CCD, but Affymetrix uses only one).
Model persons, institutes, groups, organizations, etc.
Model relationships between contacts
A contact can have any number of slot-value property tags attached to it. This is an alternative to hardcoding a list of columns in the relational schema, and is completely extensible.
A controlled vocabulary or ontology. A cv is composed of cvterms (AKA terms, classes, types, universals - relations and properties are also stored in cvterm) and the relationships between them.
Additional extensible properties can be attached to a cv using this table. A notable example would be the cv version
A term, class, universal or type within an ontology or controlled vocabulary. This table is also used for relations and properties. cvterms constitute nodes in the graph defined by the collection of cvterms and cvterm_relationships.
In addition to the primary identifier (cvterm.dbxref_id) a cvterm can have zero or more secondary identifiers/dbxrefs, which may refer to records in external databases. The exact semantics of cvterm_dbxref are not fixed. For example: the dbxref could be a pubmed ID that is pertinent to the cvterm, or it could be an equivalent or similar term in another ontology. For example, GO cvterms are typically linked to InterPro IDs, even though the nature of the relationship between them is largely one of statistical association. The dbxref may be have data records attached in the same database instance, or it could be a "hanging" dbxref pointing to some external database. NOTE: If the desired objective is to link two cvterms together, and the nature of the relation is known and holds for all instances of the subject cvterm then consider instead using cvterm_relationship together with a well-defined relation.
A relationship linking two cvterms. Each cvterm_relationship constitutes an edge in the graph defined by the collection of cvterms and cvterm_relationships. The meaning of the cvterm_relationship depends on the definition of the cvterm R refered to by type_id. However, in general the definitions are such that the statement "all SUBJs REL some OBJ" is true. The cvterm_relationship statement is about the subject, not the object. For example "insect wing part_of thorax".
The reflexive transitive closure of the cvterm_relationship relation.
Additional extensible properties can be attached to a cvterm using this table. Corresponds to -AnnotationProperty- in W3C OWL format.
A cvterm actually represents a distinct class or concept. A concept can be refered to by different phrases or names. In addition to the primary name (cvterm.name) there can be a number of alternative aliases or synonyms. For example, "T cell" as a synonym for "T lymphocyte".
A database authority. Typical databases in bioinformatics are FlyBase, GO, UniProt, NCBI, MGI, etc. The authority is generally known by this shortened form, which is unique within the bioinformatics and biomedical realm. To Do - add support for URIs, URNs (e.g. LSIDs). We can do this by treating the URL as a URI - however, some applications may expect this to be resolvable - to be decided.
An external database can have any number of slot-value property tags attached to it. This is an alternative to hardcoding a list of columns in the relational schema, and is completely extensible. There is a unique constraint, dbprop_c1, for the combination of db_id, rank, and type_id. Multivalued property-value pairs must be differentiated by rank.
A unique, global, public, stable identifier. Not necessarily an external reference - can reference data items inside the particular chado instance being used. Typically a row in a table can be uniquely identified with a primary identifier (called dbxref_id); a table may also have secondary identifiers (in a linking table
Metadata about a dbxref. Note that this is not defined in the dbxref module, as it depends on the cvterm table. This table has a structure analagous to cvtermprop.
Represents a feature of the array. This is typically a region of the array coated or bound to DNA.
Sometimes we want to combine measurements from multiple elements to get a composite value. Affymetrix combines many probes to form a probeset measurement, for instance.
An element on an array produces a measurement when hybridized to a biomaterial (traceable through quantification_id). This is the base data from which tables that actually contain data inherit.
Sometimes we want to combine measurements from multiple elements to get a composite value. Affymetrix combines many probes to form a probeset measurement, for instance.
The environmental component of a phenotype description.
The expression table is essentially a bridge table.
Extensible properties for expression to cvterm associations. Examples: qualifiers.
A feature is a biological sequence or a section of a biological sequence, or a collection of such sections. Examples include genes, exons, transcripts, regulatory regions, polypeptides, protein domains, chromosome sequences, sequence variations, cross-genome match regions such as hits and HSPs and so on; see the Sequence Ontology for more. The combination of organism_id, uniquename and type_id should be unique.
Links contact(s) with a feature. Used to indicate a particular person or organization responsible for discovery or that can provide more information on a particular feature.
Associate a term from a cv with a feature, for example, GO annotation.
Additional dbxrefs for an association. Rows in the feature_cvterm table may be backed up by dbxrefs. For example, a feature_cvterm association that was inferred via a protein-protein interaction may be backed by by refering to the dbxref for the alternate protein. Corresponds to the WITH column in a GO gene association file (but can also be used for other analagous associations). See http://www.geneontology.org/doc/GO.annotation.shtml#file for more details.
Secondary pubs for an association. Each feature_cvterm association is supported by a single primary publication. Additional secondary pubs can be added using this linking table (in a GO gene association file, these corresponding to any IDs after the pipe symbol in the publications column.
Extensible properties for feature to cvterm associations. Examples: GO evidence codes; qualifiers; metadata such as the date on which the entry was curated and the source of the association. See the featureprop table for meanings of type_id, value and rank.
Links a feature to dbxrefs.
Extensible properties for feature_expression (comments, for example). Modeled on feature_cvtermprop.
Linking table between features and phenotypes.
Provenance. Linking table between features and publications that mention them.
Property or attribute of a feature_pub link.
Features can be arranged in graphs, e.g. "exon part_of transcript part_of gene"; If type is thought of as a verb, the each arc or edge makes a statement Subject Verb Object. The object can also be thought of as parent (containing feature), and subject as child (contained feature or subfeature). We include the relationship rank/order, because even though most of the time we can order things implicitly by sequence coordinates, we can not always do this - e.g. transpliced genes. It is also useful for quickly getting implicit introns.
Provenance. Attach optional evidence to a feature_relationship in the form of a publication.
Extensible properties for feature_relationships. Analagous structure to featureprop. This table is largely optional and not used with a high frequency. Typical scenarios may be if one wishes to attach additional data to a feature_relationship - for example to say that the feature_relationship is only true in certain contexts.
Provenance for feature_relationshipprop.
Linking table between feature and synonym.
The location of a feature relative to another feature. Important: interbase coordinates are used. This is vital as it allows us to represent zero-length features e.g. splice sites, insertion points without an awkward fuzzy system. Features typically have exactly ONE location, but this need not be the case. Some features may not be localized (e.g. a gene that has been characterized genetically but no sequence or molecular information is available). Note on multiple locations: Each feature can have 0 or more locations. Multiple locations do NOT indicate non-contiguous locations (if a feature such as a transcript has a non-contiguous location, then the subfeatures such as exons should always be manifested). Instead, multiple featurelocs for a feature designate alternate locations or grouped locations; for instance, a feature designating a blast hit or hsp will have two locations, one on the query feature, one on the subject feature. Features representing sequence variation could have alternate locations instantiated on a feature on the mutant strain. The column:rank is used to differentiate these different locations. Reflexive locations should never be stored - this is for -proper- (i.e. non-self) locations only; nothing should be located relative to itself.
Provenance of featureloc. Linking table between featurelocs and publications that mention them.
Links contact(s) with a featuremap. Used to indicate a particular person or organization responsible for constrution of or that can provide more information on a particular featuremap.
Links a featuremap to the organism(s) with which it is associated.
A featuremap can have any number of slot-value property tags attached to it. This is an alternative to hardcoding a list of columns in the relational schema, and is completely extensible.
Property or attribute of a featurepos record.
A feature can have any number of slot-value property tags attached to it. This is an alternative to hardcoding a list of columns in the relational schema, and is completely extensible.
Provenance. Any featureprop assignment can optionally be supported by a publication.
In cases where the start and end of a mapped feature is a range, leftendf and rightstartf are populated. leftstartf_id, leftendf_id, rightstartf_id, rightendf_id are the ids of features with respect to which the feature is being mapped. These may be cytological bands.
Genetic context. A genotype is defined by a collection of features, mutations, balancers, deficiencies, haplotype blocks, or engineered constructs.
Links contact(s) with a library. Used to indicate a particular person or organization responsible for creation of or that can provide more information on a particular library.
The table library_cvterm links a library to controlled vocabularies which describe the library. For instance, there might be a link to the anatomy cv for "head" or "testes" for a head or testes library.
Links a library to dbxrefs.
Links a library to expression statements.
Attributes of a library_expression relationship.
library_feature links a library to the clones which are contained in the library. Examples of such linked features might be "cDNA_clone" or "genomic_clone".
Attributes of a library_feature relationship.
Attribution for a library.
Relationships between libraries.
Provenance of library_relationship.
Linking table between library and synonym.
Tag-value properties - follows standard chado model.
Attribution for libraryprop.
This table is for storing extra bits of MAGEml in a denormalized form. More normalization would require many more tables.
This is the core table for the natural diversity module, representing each individual assay that is undertaken (this is usually not an entire experiment). Each nd_experiment should give rise to a single genotype or phenotype and be described via 1 (or more) protocols. Collections of assays that relate to each other should be linked to the same record in the project table.
An analysis that is used in an experiment
Cross-reference experiment to accessions, images, etc
Linking table: experiments to the genotypes they produce. There is a one-to-one relationship between an experiment and a genotype since each genotype record should point to one experiment. Add a new experiment_id for each genotype record.
Linking table: experiments to the phenotypes they produce. There is a one-to-one relationship between an experiment and a phenotype since each phenotype record should point to one experiment. Add a new experiment_id for each phenotype record.
Used to group together related nd_experiment records. All nd_experiments should be linked to at least one project.
Linking table: experiments to the protocols they involve.
Linking nd_experiment(s) to publication(s)
Part of a stock or a clone of a stock that is used in an experiment
Cross-reference experiment_stock to accessions, images, etc
Property/value associations for experiment_stocks. This table can store the properties such as treatment
An nd_experiment can have any number of slot-value property tags attached to it. This is an alternative to hardcoding a list of columns in the relational schema, and is completely extensible. There is a unique constraint, stockprop_c1, for the combination of stock_id, rank, and type_id. Multivalued property-value pairs must be differentiated by rank.
The geo-referencable location of the stock. NOTE: This entity is subject to change as a more general and possibly more OpenGIS-compliant geolocation module may be introduced into Chado.
Property/value associations for geolocations. This table can store the properties such as location and environment
A protocol can be anything that is done as part of the experiment.
Property/value associations for protocol.
A reagent such as a primer, an enzyme, an adapter oligo, a linker oligo. Reagents are used in genotyping experiments, or in any other kind of experiment.
Relationships between reagents. Some reagents form a group. i.e., they are used all together or not at all. Examples are adapter/linker/enzyme experiment reagents.
The organismal taxonomic classification. Note that phylogenies are represented using the phylogeny module, and taxonomies can be represented using the cvterm module or the phylogeny module.
organism to cvterm associations. Examples: taxonomic name
Extensible properties for organism to cvterm associations. Examples: qualifiers
Links an organism to a dbxref.
Attribution for organism.
Specifies relationships between organisms that are not taxonomic. For example, in breeding, relationships such as "sterile_with", "incompatible_with", or "fertile_with" would be appropriate. Taxonomic relatinoships should be housed in the phylogeny tables.
Tag-value properties - follows standard chado model.
Attribution for organismprop.
A summary of a set of phenotypic statements for any one gcontext made in any one publication.
A phenotypic statement, or a single atomic phenotypic observation, is a controlled sentence describing observable effects of non-wild type function. E.g. Obs=eye, attribute=color, cvalue=red.
Comparison of phenotypes e.g., genotype1/environment1/phenotype1 "non-suppressible" with respect to genotype2/environment2/phenotype2.
phenotype to cvterm associations.
A phenotype can have any number of slot-value property tags attached to it. This is an alternative to hardcoding a list of columns in the relational schema, and is completely extensible. There is a unique constraint, phenotypeprop_c1, for the combination of phenotype_id, rank, and type_id. Multivalued property-value pairs must be differentiated by rank.
Phenotypes are things like "larval lethal". Phenstatements are things like "dpp-1 is recessive larval lethal". So essentially phenstatement is a linking table expressing the relationship between genotype, environment, and phenotype.
This is the most pervasive element in the phylogeny module, cataloging the "phylonodes" of tree graphs. Edges are implied by the parent_phylonode_id reflexive closure. For all nodes in a nested set implementation the left and right index will be between the parents left and right indexes.
For example, for orthology, paralogy group identifiers; could also be used for NCBI taxonomy; for sequences, refer to phylonode_feature, feature associated dbxrefs.
This linking table should only be used for nodes in taxonomy trees; it provides a mapping between the node and an organism. One node can have zero or one organisms, one organism can have zero or more nodes (although typically it should only have one in the standard NCBI taxonomy tree).
This is for relationships that are not strictly hierarchical; for example, horizontal gene transfer. Most phylogenetic trees are strictly hierarchical, nevertheless it is here for completeness.
Global anchor for phylogenetic tree.
Tracks citations global to the tree e.g. multiple sequence alignment supporting tree construction.
A phylotree can have any number of slot-value property tags attached to it. This is an alternative to hardcoding a list of columns in the relational schema, and is completely extensible.
Standard Chado flexible property table for projects.
Links an analysis to a project that may contain multiple analyses. The rank column can be used to specify a simple ordering in which analyses were executed.
Linking table for associating projects and contacts.
project_dbxref links a project to dbxrefs.
This table is intended associate records in the feature table with a project.
Linking table for associating projects and publications.
Linking table for relating projects to each other. For example, a given project could be composed of several smaller subprojects
This table is intended associate records in the stock table with a project.
Procedural notes on how data was prepared and processed.
Parameters related to a protocol. For example, if the protocol is a soak, this might include attributes of bath temperature and duration.
A documented provenance artefact - publications, documents, personal communication.
Handle links to repositories, e.g. Pubmed, Biosis, zoorec, OCLC, Medline, ISSN, coden...
Handle relationships between publications, e.g. when one publication makes others obsolete, when one publication contains errata with respect to other publication(s), or when one publication also appears in another pub.
An author for a publication. Note the denormalisation (hence lack of _ in table name) - this is deliberate as it is in general too hard to assign IDs to authors.
An author on a publication may have a corresponding entry in the contact table and this table can link the two.
Property-value pairs for a pub. Follows standard chado pattern.
Quantification is the transformation of an image acquisition to numeric data. This typically involves statistical procedures.
There may be multiple rounds of quantification, this allows us to keep an audit trail of what values went where.
Extra quantification properties that are not accounted for in quantification.
Any stock can be globally identified by the combination of organism, uniquename and stock type. A stock is the physical entities, either living or preserved, held by collections. Stocks belong to a collection; they have IDs, type, organism, description and may have a genotype.
stock_cvterm links a stock to cvterms. This is for secondary cvterms; primary cvterms should use stock.type_id.
Extensible properties for stock to cvterm associations. Examples: GO evidence codes; qualifiers; metadata such as the date on which the entry was curated and the source of the association. See the stockprop table for meanings of type_id, value and rank.
stock_dbxref links a stock to dbxrefs. This is for secondary identifiers; primary identifiers should use stock.dbxref_id.
A stock_dbxref can have any number of slot-value property tags attached to it. This is useful for storing properties related to dbxref annotations of stocks, such as evidence codes, and references, and metadata, such as create/modify dates. This is an alternative to hardcoding a list of columns in the relational schema, and is completely extensible. There is a unique constraint, stock_dbxrefprop_c1, for the combination of stock_dbxref_id, rank, and type_id. Multivalued property-value pairs must be differentiated by rank.
Links a stock to a feature.
Links a featuremap to a stock.
Simple table linking a stock to a genotype. Features with genotypes can be linked to stocks thru feature_genotype -> genotype -> stock_genotype -> stock.
Links a stock with a library.
Provenance. Linking table between stocks and, for example, a stocklist computer file.
For germplasm maintenance and pedigree data, stock_relationship. type_id will record cvterms such as "is a female parent of", "a parent for mutation", "is a group_id of", "is a source_id of", etc The cvterms for higher categories such as "generative", "derivative" or "maintenance" can be stored in table stock_relationship_cvterm
Provenance. Attach optional evidence to a stock_relationship in the form of a publication.
The lab or stock center distributing the stocks in their collection.
Stock collections may be respresented by an external online database. This table associates a stock collection with a database where its member stocks can be found. Individual stock that are part of this collction should have entries in the stock_dbxref table with the same db_id record
stockcollection_stock links a stock collection to the stocks which are contained in the collection.
The table stockcollectionprop contains the value of the stock collection such as website/email URLs; the value of the stock collection order URLs.
A stock can have any number of slot-value property tags attached to it. This is an alternative to hardcoding a list of columns in the relational schema, and is completely extensible. There is a unique constraint, stockprop_c1, for the combination of stock_id, rank, and type_id. Multivalued property-value pairs must be differentiated by rank.
Provenance. Any stockprop assignment can optionally be supported by a publication.
A synonym for a feature. One feature can have multiple synonyms, and the same synonym can apply to multiple features.
A biomaterial may undergo multiple treatments. Examples of treatments: apoxia, fluorophore and biotin labeling.