Documentation — HIRIS: HIV-1 Reservoirs Integration Sites

Using the per-gene GFF exports

A gene-specific annotation file is produced for every gene which overlaps with an integration site in the database. The file format is GFF3 (“generic feature format”, version 3), a plain-text, tabular format for keeping sequence annotations.

You can use these files to visually annotate a gene sequence with integration sites from the database, using software such as Geneious.

The files use positions relative to the gene start, but expect the sequence they annotate to be in the direction of the chromosome. This means the GFF files pair nicely with partial chromosome sequences as downloaded directly from GenBank.

For example, the BACH2 gene is on chromosome 6 from base 89,926,528 to 90,296,912. Downloading a FASTA or GenBank file at the link will produce a file suitable for annotating with the matching GFF file.

The NCBI Gene data source within Geneious also downloads suitable sequence documents.

Fields

Several fields are provided for each annotation. Below is a a brief description of the fields as used by our GFF3 files.

Field	Note
`seq_id`	Formatted as “gene name - gene id”
`source`	The name of the data source for the integration site
`type`	Always `proviral_location`, a Sequence Ontology term
`start`	The integration site in gene-relative but chromosome-oriented coordinates
`stop`	Always equal to `start`, as required by the GFF3 specification for zero-length features
`score`	The multiplicity of the integration site
`strand`	Orientation of the virus relative to the chromosome
`phase`	Unused
`attributes`	`Name` is the subject, or if there’s no subject, the integration environment.
	`ID` is a meaningless number unique to the file.

HIRIS: HIV-1 Reservoirs Integration Sites “high-rise”

Using the per-gene GFF exports

Fields