Data Management and Sharing¶

As a NIAID-funded resource, BV-BRC complies with the NIH Policy for Data Management and Sharing (Notice Number: NOT-OD-21-013), which promotes the management and sharing of scientific data generated from NIH-funded or conducted research.

Data Types¶

BV-BRC is an integrated data and analysis resource designed to support genomic and related infectious disease research for viral and bacterial pathogens. As such, the primary data type is genomic sequences, primarily ingested from public repositories such as NCBI GenBank. BV-BRC reannotates genomes with curated subsystem data for consistency, curates and standardizes metadata, and formats the data to enable the integrated tools to perform comparative analysis computations. BV-BRC also annotates genomes from assembled reads collected from SRA for specialized phenotypic characteristics, such as antimicrobial resistance (AMR). Annotations may be derived from computations such as gene and other feature prediction algorithms, subsystem and other functional groupings, and phenotype prediction such as AMR. Appropriate references to the computations performed are available in the user documentation.

Data from expression studies have also been curated and structured for comparative analyses. Additional data are incorporated from various resources, such as protein structures from PDB as well as computed structures, e.g., from AlphaFold, to augment annotations and comparative analyses. Further, BV-BRC has integrated data sets from other NIAID programs, such as the Systems Biology Centers (SBCs). Additional metadata from serology and surveillance efforts may also be associated with genome sequences and other data types. Data from these resources has appropriate provenance information to enable traceability to the source.

The table below summarizes the BV-BRC data types and their primary sources.

Primary Data Types	Related and Derived Data Types	Data Resources
Taxonomy	Reference organisms	NCBI Taxonomy, ICTV
Genomes	Clinical and environmental metadata, AMR / AVR phenotypes, QC results	GenBank, SRA
Genomic Features	Genes, RNAs, misc. features, GO terms, EC numbers, Pathways, Protein families, Subsystems, Virulence factors, AMR genes, Drug targets, essential genes	GenBank, PATRIC and IRD/ViPR (legacy BRC resources), GO, KEGG, Reactome, SEED, VFDB, Victors, CARD, NDARO, DrugBank, TTD
Comparative Genomics	Protein families, Orthologs, MSA, Gene trees, Metabolic pathways, Subsystems, Phylogenetic trees	PATRIC and IRD/ViPR (legacy BRC resources), KEGG, SEED
Protein Domains and Sequence Features	Variant types	UniProt, IRD/ViPR (legacy BRC resources)
Antimicrobial / Antiviral Agents	Drug targets	PubChem, DrugBank, ATC, PDB
Antimicrobial / Antiviral Resistance Data	Standardized AMR/AVR phenotypes	NIAID Genomics Centers for Infectious Diseases (GCIDs), BioSamples, publications
Clinical Studies	Clinical metadata, AMR/AVR phenotypes	NIAID GCIDs / Systems Biology Centers (SBCs), publications, others
Transcriptomics and Proteomics	BAM / WIG files, gene expression, differential expression, gene clusters, enrichment analysis	GEO, ArrayExpress, PRIDE, SBCs
Protein-Protein Interactions	Protein interaction networks, protein-drug interaction networks	IntAct, BIND, MINT, DIP, STRING, others
Protein 3D Structures	Sequence conservation location, immune epitope location	PDB, MMDB, NIAID Structural Genomics Centers (SGCs)
Phenomics	Organism attributes	Publications
Host Response and Host-Pathogen Interactions	BAM / WIG files, gene expression, differential expression, gene clusters, enrichment analysis	SRA, ArrayExpress, NIAID SBCs, IntAct, BIND, MINT, DIP, others
Epidemiology and Surveillance	MLST, SNP / SNV / in-del annotation, SNP / transmission trees	SRA, MLST, PubMLST, CDC, CEIRR DPCC, JCVI
Immunology and Serology	Organism attributes, phenotypes, epitopes	ImmPort, CEIRR

In addition to ingested and processed data, users may elect to assemble and annotate their own sequenced genomes and make them publicly available through the BV-BRC resource. In these cases, BV-BRC team members work with the user to assess genome quality, verify metadata, and include appropriate attribution.

Related Tools, Software and/or Code The data in BV-BRC are available through the website user interface and integrated searches and tools BV-BRC website, through the command line interface (CLI), data API, and from the FTP site. These are all provided freely to all users. All software developed by the project team in the system is available as open source in the BV-BRC GitHub repository.

In the event of discontinuation of the resource, the databases, systems, and software code will be archived and transferred to NIAID or its designee.

Standards¶

Where available, BV-BRC uses established standards with wide adoption in the target research communities. These include common file formats for sequence data (fasta in .fa, .fasta, .faa, .fna, .afa, .xmfa, .embl), aligned sequence data (.bai, .bam), tabular, e.g., expression, data (.csv, .txt, .xls, .xlsx), genome features (.gff, .gtf), phylogenetic trees (Newick, PhyloXML), variant calls (.vcf, .vcf.gz), structured data (.xml, .json), and others.

BV-BRC uses metadata standards and ontologies to sufficiently characterize corresponding data to enable comparison, integration, and reuse in multiple analysis environments. The BV-BRC provides services and tools that accept and produce data using standard data formats, thus enabling data exchange and reuse among resources. We also provide open APIs that allow programmatic access to data and models within the resource, facilitate scripting of workflows that can programmatically interact with BV-BRC services, and can be used to by the community to develop new approaches for data analysis.

For example, the genome metadata schema used in BV-BRC is based on the NIAID Human Pathogen and Vector Sequencing Metadata Standards, which include Project and Sample Application Standards and Clinical Metadata Standards. These standards were derived from the metadata standards used by Genomic Standards Consortium’s minimal information (MIxS) and NCBI’s BioSample/BioProjects checklists. We use The Systems Biology Metadata Standard developed to capture consistent metadata about experiments designed to assess host responses to pathogen infection. The BV-BRC team collaborated with the other BRCs and with the NIAID sponsored Systems Biology for Infectious Diseases program through the Data Dissemination Working Group to ensure that the standard covered the key experimental metadata.

Data Preservation, Access, and Associated Timelines BV-BRC is, in essence, a knowledgebase for bacterial and viral genomic and related data associated with infectious diseases. It pulls, integrates, and adds value to data from other public databases and repositories, as described in the Data Types section above.

Except for data that users have uploaded or generated in their private workspaces, all data in BV-BRC are publicly available. Data in private workspaces are entirely under the control of the user who owns the workspace. The user is responsible for complying with any NIAID policies for data release and deposition in public repositories for their private data.

All genomes and other primary data types have persistent unique identifiers. The data in BV-BRC are retained for the life of the resource. Exceptions include replacement of outdated genomes from GenBank.

Access, Distribution, or Reuse Considerations¶

All public data in BV-BRC is freely available without restriction on its use. The resource does not include any PHI, PII, or other data covered by HIPAA.

Data Management and Sharing¶

Data Types¶

Standards¶

Access, Distribution, or Reuse Considerations¶

Oversight of Data Management and Sharing¶