March Release ------------- Finally, the March release is available. The April release should follow in a few days. Scroll down for details of some schema changes that may affect you. Our apologies for the late arrival of the March release, and for the slow turnover of releases in general. Building the full GO database requires parsing in and storing a large amount of data - the current release has over 5 million associations, 1 million genes and gene products, nearly one fifth of which have sequences associated. The build process consists of a chain of automated scripts which load the various files into the database - this can take a few days. At the moment the build process is highly sensitive to the environment around it - a database server becoming unavailable or an NFS problem can short circuit the whole process, requiring manual intervention and restarting everything from the beginning. This can be further compounded by problems with the external source data files, which are out of control - data that does not adhere to the GO formatting conventions will cause further problems. This has caused a few problems in the past - the January release contain no definitions in the 'assocdb' slice of the data, only in the 'termdb' slice. Apologies for letting this slip through. We are taking a number of steps to fix this: * Construction of a tool that helps automate, manage and QC the build and release process * Assigning more human resources to the release process (at the moment, most of us have commitments to various other projects) * In the long term we will switch to using Postgresql instead of MySQL, which in our experience provides additional safety features (transactions, foreign key integrity) As a number of our users only required the 'termdb' distribution (i.e. no gene product associations) we are re-starting the daily cycle for this data, see below: -- Release cycle changes: The monthly releases should be more timely in future. In addition, there will be an automated daily release of the 'termdb' distribution. This contains just the controlled vocabularies themselves, no gene products. There is no manual QC of this release - this will hopefully not be a problem as the source data for this is much more tightly controlled, so the chance of weird file messing up the database is relatively low. A number of people have had difficulties locating the database schema, either in cvs or as part of the code release. We are now including an extra file: go_YYYYMM-schema-mysql.sql.gz Which contains the (MySQL ported) schema used in this release. You can still get the perl utilities, full SQL directory split into modules including oracle tips from this file: go_YYYYMM-utilities-src.tar.gz -- Schema changes: As of this release, rhere is a new module, 'go-audit' - this tracks which files were loaded and when, the times each of the go terms were loaded and also stores metadata about the current release. comments have been added to the term_definition table dbxrefs for the terms have been distinguished from definition dbxrefs (see the GO docs on www.geneontology.org for explanation) -- Data changes: Previously the species table was used just for storing the NCBI taxonomy ID. As of now, the full species table will be populated, allowing you to query by either common name (eg 'fruitfly') or genus/species.