Table Reference — HIRIS: HIV-1 Reservoirs Integration Sites

Tables

The columns of the primary data tables are described below. These tables make up the backbone of the database.

integration

The primary fact table in the database. Each row records a single occurrence of a provirus detected at the given integration site in the given subject by the given source data. Sites which are detected multiple times in a given subject are represented as multiple rows. Technical replicates are excluded if they’re from single genome sequencing rather than bulk consensus sequencing.

Column name	Data type	Attributes
`source_name`	`varchar`	not null refers to another table
`environment`	`integration_environment`	not null
`sample`	`jsonb`
`ltr`	`ltr_end`
`landmark`	`landmark`
`location`	`integer`
`orientation_in_landmark`	`orientation`
`sequence`	`text`
`sequence_junction`	`integer`
`note`	`text`
`info`	`jsonb`

ncbi_gene

Column name	Data type	Attributes
`ncbi_gene_id`	`integer`	not null
`name`	`text`	not null
`type`	`gene_type`

ncbi_gene_location

Column name	Data type	Attributes
`ncbi_gene_id`	`integer`	not null refers to another table
`landmark`	`landmark`	not null
`gene_start`	`integer`	not null
`gene_end`	`integer`	not null
`gene_range`	`int4range`	not null
`gene_orientation`	`orientation`	not null

source

Column name	Data type	Attributes
`source_name`	`varchar`	not null
`document`	`jsonb`
`revision`	`jsonb`

Views

Views are different representations of the data contained within the tables above. They pull together data from multiple tables in order to summarize it or make it easier to analyze. Views are also used to produce the downloadable datasets.

Note that integration sites which are covered by multiple genes will be reported once per gene in the "Integration summary with annotated genes" and "Summary by gene" views.

integration_gene_summary

The basis for the "Integration summary with annotated genes" downloadable dataset.

Column name	Data type	Attributes
`environment`	`integration_environment`
`subject`	`text`
`ncbi_gene_id`	`integer`
`gene`	`text`
`gene_type`	`gene_type`
`landmark`	`landmark`
`location`	`integer`
`orientation_in_landmark`	`orientation`
`orientation_in_gene`	`text`
`multiplicity`	`bigint`
`source_names`	`varchar[]`
`pubmed_ids`	`integer[]`

integration_genes

Column name	Data type	Attributes
`source_name`	`varchar`
`environment`	`integration_environment`
`sample`	`jsonb`
`ltr`	`ltr_end`
`landmark`	`landmark`
`location`	`integer`
`orientation_in_landmark`	`orientation`
`sequence`	`text`
`sequence_junction`	`integer`
`note`	`text`
`info`	`jsonb`
`ncbi_gene_id`	`integer`
`gene`	`text`
`gene_type`	`gene_type`
`orientation_in_gene`	`text`

integration_summary

The basis for the "Integration summary" downloadable dataset.

Column name	Data type	Attributes
`environment`	`integration_environment`
`subject`	`text`
`landmark`	`landmark`
`location`	`integer`
`orientation_in_landmark`	`orientation`
`multiplicity`	`bigint`
`source_names`	`varchar[]`
`pubmed_ids`	`integer[]`

sample_fields_metadata

Column name	Data type	Attributes
`field`	`text`
`count`	`bigint`
`values`	`bigint`
`sources`	`varchar[]`
`environments`	`text[]`

summary_by_gene

The basis for the "Summary by Gene" downloadable dataset.

Column name	Data type	Attributes
`environment`	`integration_environment`
`ncbi_gene_id`	`integer`
`gene`	`text`
`gene_type`	`gene_type`
`subjects`	`bigint`
`unique_sites`	`bigint`
`proliferating_sites`	`bigint`
`total_in_gene`	`numeric`

Glossary

Data type and attribute terms are defined below. You should also refer to the Data usage guidelines in the manual for more information.

landmark: The chromosome or RefSeq accession in which the reported integration site is found. See the section on landmarks in the manual.
orientation: Either F or R (forward or reverse) to indicate the direction of the provirus with respect to the chromosome/landmark
ltr_end: Either 5p or 3p to indicate which viral LTR was sequenced out of to determine the IS
integration_environment: Either in vivo or in vitro indicating where the integration event took place. Refer to the Data usage guidelines in the manual for more information.
gene_type: A fixed set of names for various classes of genes as used by NCBI Gene.
text: An arbitrary blob of text, which may be very short or very long
varchar: A short bit of text, usually limited in length
integer: A whole number
bigint: A whole number which may be very large
int4range: A range between two whole numbers, useful for checking region overlap, containment, adjacency, and other similar operations
jsonb: A JSON document, which can store many optional fields and values. These are used to record information which varies from source to source. Subject and PubMed ID (if any) are stored in a jsonb column, for example.
text[]
varchar[]
integer[]: An array (or list) of text, varchar, or integer values, respectively. For tables exported as CSV files, values will be separated by a pipe character |.
not null: Indicates a column which cannot be empty. All rows will have a value for the column.
refers to another table: Indicates a column whose value refers to a row in another table. These form the relationships between rows and tables in a relational database.

Sample document fields

If you’re writing SQL to query the database, you’re likely going to want to extract data fields out of the sample JSON document associated with each integration site observation. The document contains optional fields such as subject, pcr, pubmed_id. Below is an example query which demonstrates how to extract those fields into a column in a custom SQL query:

SELECT *,
       sample->>'subject'   as subject,
       sample->>'pcr'       as pcr,
       sample->>'pubmed_id' as pubid
  FROM integration

The following table summarizes what sample fields are currently in use:

Field	Sources	Environments	# of times present	# of unique values
`pubmed_id`	NCI-RID SherrillMix-2013 Sunshine-2016-JVI Wagner-2014-Science Wang-2007	in vitro in vivo	60,329	11
`tissue`	NCI-RID SherrillMix-2013 Sunshine-2016-JVI Wang-2007	in vitro in vivo	57,103	12
`tissue_url`	NCI-RID SherrillMix-2013 Sunshine-2016-JVI	in vitro	1,540	4
`original_id`	NCI-RID SherrillMix-2013	in vitro in vivo	15,163	12,953
`subject`	NCI-RID Wagner-2014-Science	in vivo	3,260	35
`years_on_art`	NCI-RID Wagner-2014-Science	in vivo	2,691	20
`provirus_activity`	SherrillMix-2013	in vitro	12,431	3
`pcr`	Wagner-2014-Science	in vivo	531	531
`genbank`	NCI-RID	in vivo	461	461
`clone`	NCI-RID	in vitro in vivo	37	37
`disease`	NCI-RID	in vivo	34	3

HIRIS: HIV-1 Reservoirs Integration Sites “high-rise”