The columns of the primary data tables are described below. These tables make up the backbone of the database.
The primary fact table in the database. Each row records a single occurrence of a provirus detected at the given integration site in the given subject by the given source data. Sites which are detected multiple times in a given subject are represented as multiple rows. Technical replicates are excluded if they’re from single genome sequencing rather than bulk consensus sequencing.
Column name | Data type | Attributes |
---|---|---|
source_name | varchar |
not null refers to another table |
environment | integration_environment |
not null |
sample | jsonb |
|
ltr | ltr_end |
|
landmark | landmark |
|
location | integer |
|
orientation_in_landmark | orientation |
|
sequence | text |
|
sequence_junction | integer |
|
note | text |
|
info | jsonb |
Column name | Data type | Attributes |
---|---|---|
ncbi_gene_id | integer |
not null |
name | text |
not null |
type | gene_type |
Column name | Data type | Attributes |
---|---|---|
ncbi_gene_id | integer |
not null refers to another table |
landmark | landmark |
not null |
gene_start | integer |
not null |
gene_end | integer |
not null |
gene_range | int4range |
not null |
gene_orientation | orientation |
not null |
Column name | Data type | Attributes |
---|---|---|
source_name | varchar |
not null |
document | jsonb |
|
revision | jsonb |
Views are different representations of the data contained within the tables above. They pull together data from multiple tables in order to summarize it or make it easier to analyze. Views are also used to produce the downloadable datasets.
Note that integration sites which are covered by multiple genes will be reported once per gene in the "Integration summary with annotated genes" and "Summary by gene" views.
The basis for the "Integration summary with annotated genes" downloadable dataset.
Column name | Data type | Attributes |
---|---|---|
environment | integration_environment |
|
subject | text |
|
ncbi_gene_id | integer |
|
gene | text |
|
gene_type | gene_type |
|
landmark | landmark |
|
location | integer |
|
orientation_in_landmark | orientation |
|
orientation_in_gene | text |
|
multiplicity | bigint |
|
source_names | varchar[] |
|
pubmed_ids | integer[] |
Column name | Data type | Attributes |
---|---|---|
source_name | varchar |
|
environment | integration_environment |
|
sample | jsonb |
|
ltr | ltr_end |
|
landmark | landmark |
|
location | integer |
|
orientation_in_landmark | orientation |
|
sequence | text |
|
sequence_junction | integer |
|
note | text |
|
info | jsonb |
|
ncbi_gene_id | integer |
|
gene | text |
|
gene_type | gene_type |
|
orientation_in_gene | text |
The basis for the "Integration summary" downloadable dataset.
Column name | Data type | Attributes |
---|---|---|
environment | integration_environment |
|
subject | text |
|
landmark | landmark |
|
location | integer |
|
orientation_in_landmark | orientation |
|
multiplicity | bigint |
|
source_names | varchar[] |
|
pubmed_ids | integer[] |
Column name | Data type | Attributes |
---|---|---|
field | text |
|
count | bigint |
|
values | bigint |
|
sources | varchar[] |
|
environments | text[] |
The basis for the "Summary by Gene" downloadable dataset.
Column name | Data type | Attributes |
---|---|---|
environment | integration_environment |
|
ncbi_gene_id | integer |
|
gene | text |
|
gene_type | gene_type |
|
subjects | bigint |
|
unique_sites | bigint |
|
proliferating_sites | bigint |
|
total_in_gene | numeric |
Data type and attribute terms are defined below. You should also refer to the Data usage guidelines in the manual for more information.
landmark
orientation
F
or R
(forward or reverse) to indicate
the direction of the provirus with respect to the chromosome/landmark
ltr_end
5p
or 3p
to indicate which viral LTR was
sequenced out of to determine the IS
integration_environment
in vivo
or in vitro
indicating where
the integration event took place. Refer to the Data usage guidelines in
the manual for more information.
gene_type
text
varchar
integer
bigint
int4range
jsonb
jsonb
column, for example.
text[]
varchar[]
integer[]
text
,
varchar
, or
integer
values, respectively. For tables exported as
CSV files, values will be separated by a pipe character
|
.
If you’re writing SQL to query the database, you’re likely going to want to
extract data fields out of the sample
JSON document
associated with each integration site observation. The document
contains optional fields such as subject
,
pcr
, pubmed_id
. Below is an example query
which demonstrates how to extract those fields into a column in a
custom SQL query:
SELECT *, sample->>'subject' as subject, sample->>'pcr' as pcr, sample->>'pubmed_id' as pubid FROM integration
The following table summarizes what sample
fields are
currently in use:
Field | Sources | Environments | # of times present | # of unique values |
---|---|---|---|---|
pubmed_id |
|
|
60,329 | 11 |
tissue |
|
|
57,103 | 12 |
tissue_url |
|
|
1,540 | 4 |
original_id |
|
|
15,163 | 12,953 |
subject |
|
|
3,260 | 35 |
years_on_art |
|
|
2,691 | 20 |
provirus_activity |
|
|
12,431 | 3 |
pcr |
|
|
531 | 531 |
genbank |
|
|
461 | 461 |
clone |
|
|
37 | 37 |
disease |
|
|
34 | 3 |