HIRIS: HIV-1 Reservoirs Integration Sites “high-rise”

Using the per-gene GFF exports

A gene-specific annotation file is produced for every gene which overlaps with an integration site in the database. The file format is GFF3 (“generic feature format”, version 3), a plain-text, tabular format for keeping sequence annotations.

You can use these files to visually annotate a gene sequence with integration sites from the database, using software such as Geneious.

The files use positions relative to the gene start, but expect the sequence they annotate to be in the direction of the chromosome. This means the GFF files pair nicely with partial chromosome sequences as downloaded directly from GenBank.

For example, the BACH2 gene is on chromosome 6 from base 89,926,528 to 90,296,912. Downloading a FASTA or GenBank file at the link will produce a file suitable for annotating with the matching GFF file.

The NCBI Gene data source within Geneious also downloads suitable sequence documents.

Fields

Several fields are provided for each annotation. Below is a a brief description of the fields as used by our GFF3 files.

Field Note
seq_id Formatted as “gene name - gene id”
source The name of the data source for the integration site
type Always proviral_location, a Sequence Ontology term
start The integration site in gene-relative but chromosome-oriented coordinates
stop Always equal to start, as required by the GFF3 specification for zero-length features
score The multiplicity of the integration site
strand Orientation of the virus relative to the chromosome
phase Unused
attributes Name is the subject, or if there’s no subject, the integration environment.
ID is a meaningless number unique to the file.