HIRIS: HIV-1 Reservoirs Integration Sites “high-rise”

Tables

The columns of the primary data tables are described below. These tables make up the backbone of the database.

integration

The primary fact table in the database. Each row records a single occurrence of a provirus detected at the given integration site in the given subject by the given source data. Sites which are detected multiple times in a given subject are represented as multiple rows. Technical replicates are excluded if they’re from single genome sequencing rather than bulk consensus sequencing.

Column name Data type Attributes
source_name varchar not null refers to another table
environment integration_environment not null
sample jsonb
ltr ltr_end
landmark landmark
location integer
orientation_in_landmark orientation
sequence text
sequence_junction integer
note text
info jsonb

ncbi_gene

Column name Data type Attributes
ncbi_gene_id integer not null
name text not null
type gene_type

ncbi_gene_location

Column name Data type Attributes
ncbi_gene_id integer not null refers to another table
landmark landmark not null
gene_start integer not null
gene_end integer not null
gene_range int4range not null
gene_orientation orientation not null

source

Column name Data type Attributes
source_name varchar not null
document jsonb
revision jsonb

Views

Views are different representations of the data contained within the tables above. They pull together data from multiple tables in order to summarize it or make it easier to analyze. Views are also used to produce the downloadable datasets.

Note that integration sites which are covered by multiple genes will be reported once per gene in the "Integration summary with annotated genes" and "Summary by gene" views.

integration_gene_summary

The basis for the "Integration summary with annotated genes" downloadable dataset.

Column name Data type Attributes
environment integration_environment
subject text
ncbi_gene_id integer
gene text
gene_type gene_type
landmark landmark
location integer
orientation_in_landmark orientation
orientation_in_gene text
multiplicity bigint
source_names varchar[]
pubmed_ids integer[]

integration_genes

Column name Data type Attributes
source_name varchar
environment integration_environment
sample jsonb
ltr ltr_end
landmark landmark
location integer
orientation_in_landmark orientation
sequence text
sequence_junction integer
note text
info jsonb
ncbi_gene_id integer
gene text
gene_type gene_type
orientation_in_gene text

integration_summary

The basis for the "Integration summary" downloadable dataset.

Column name Data type Attributes
environment integration_environment
subject text
landmark landmark
location integer
orientation_in_landmark orientation
multiplicity bigint
source_names varchar[]
pubmed_ids integer[]

sample_fields_metadata

Column name Data type Attributes
field text
count bigint
values bigint
sources varchar[]
environments text[]

summary_by_gene

The basis for the "Summary by Gene" downloadable dataset.

Column name Data type Attributes
environment integration_environment
ncbi_gene_id integer
gene text
gene_type gene_type
subjects bigint
unique_sites bigint
proliferating_sites bigint
total_in_gene numeric

Glossary

Data type and attribute terms are defined below. You should also refer to the Data usage guidelines in the manual for more information.

landmark
The chromosome or RefSeq accession in which the reported integration site is found. See the section on landmarks in the manual.
orientation
Either F or R (forward or reverse) to indicate the direction of the provirus with respect to the chromosome/landmark
ltr_end
Either 5p or 3p to indicate which viral LTR was sequenced out of to determine the IS
integration_environment
Either in vivo or in vitro indicating where the integration event took place. Refer to the Data usage guidelines in the manual for more information.
gene_type
A fixed set of names for various classes of genes as used by NCBI Gene.
text
An arbitrary blob of text, which may be very short or very long
varchar
A short bit of text, usually limited in length
integer
A whole number
bigint
A whole number which may be very large
int4range
A range between two whole numbers, useful for checking region overlap, containment, adjacency, and other similar operations
jsonb
A JSON document, which can store many optional fields and values. These are used to record information which varies from source to source. Subject and PubMed ID (if any) are stored in a jsonb column, for example.
text[]
varchar[]
integer[]
An array (or list) of text, varchar, or integer values, respectively. For tables exported as CSV files, values will be separated by a pipe character |.
not null
Indicates a column which cannot be empty. All rows will have a value for the column.
refers to another table
Indicates a column whose value refers to a row in another table. These form the relationships between rows and tables in a relational database.

Sample document fields

If you’re writing SQL to query the database, you’re likely going to want to extract data fields out of the sample JSON document associated with each integration site observation. The document contains optional fields such as subject, pcr, pubmed_id. Below is an example query which demonstrates how to extract those fields into a column in a custom SQL query:

SELECT *,
       sample->>'subject'   as subject,
       sample->>'pcr'       as pcr,
       sample->>'pubmed_id' as pubid
  FROM integration

The following table summarizes what sample fields are currently in use:

Field Sources Environments # of times present # of unique values
pubmed_id
  • NCI-RID
  • SherrillMix-2013
  • Sunshine-2016-JVI
  • Wagner-2014-Science
  • Wang-2007
  • in vitro
  • in vivo
60,329 11
tissue
  • NCI-RID
  • SherrillMix-2013
  • Sunshine-2016-JVI
  • Wang-2007
  • in vitro
  • in vivo
57,103 12
tissue_url
  • NCI-RID
  • SherrillMix-2013
  • Sunshine-2016-JVI
  • in vitro
1,540 4
original_id
  • NCI-RID
  • SherrillMix-2013
  • in vitro
  • in vivo
15,163 12,953
subject
  • NCI-RID
  • Wagner-2014-Science
  • in vivo
3,260 35
years_on_art
  • NCI-RID
  • Wagner-2014-Science
  • in vivo
2,691 20
provirus_activity
  • SherrillMix-2013
  • in vitro
12,431 3
pcr
  • Wagner-2014-Science
  • in vivo
531 531
genbank
  • NCI-RID
  • in vivo
461 461
clone
  • NCI-RID
  • in vitro
  • in vivo
37 37
disease
  • NCI-RID
  • in vivo
34 3