The CAMDA Contest Challenges
For 2021, we present:
- The Disease Maps to modelling COVID-19 Challenge – with expert-curated molecular mechanistic maps for COVID-19 and large scale expression profiles
- The Literature AI for Drug Induced Liver Injury Challenge – with biomedical publications curated by FDA experts on DILI
- The Metagenomic Phage Forensics of Anti-Microbial Resistance Challenge – with meta-genomics profiles for phage based anti-microbial predictions
- The Hi-Res Cancer Data Integration Challenge – with anonymized sequencing reads and expression estimates from unique personal genomic regions
CAMDA encourages an open contest, where all analyses of the contest data sets are of interest, not limited to the questions suggested here. There is an
online forum
for the free discussion of the contest data sets and their analysis, in which you are encouraged to participate.
We look forward to a lively contest!
Disease Maps to Modelling COVID-19
Given the recent events of COVID-19 pandemic we expect a steadily increasing flood of data from cells or patients infected with SARS-CoV-2 in the upcoming years. Incidents are still rising world-wide despite current vaccination campaigns, and several initiatives are arising worldwide to collect data systematically. These studies have great value by themselves. Moreover, a joint effort of all the research institutions participating in the Disease Maps initiative has allowed the compilation of a COVID-19 mechanistic map that captures our current knowledge of the disease. This permits first models of the cellular response to infection from a mechanistic perspective. Mechanistic pathway models can then provide a causal bridge from variations in gene activity or integrity to consequential changes in phenotype, making these models a useful tool for the identification of deregulated mechanisms and functions in the search for candidate targets or intervention points that might reverse the phenotype or slow down progression of the disease. We challenge participants to expand our mechanistic understanding of COVID-19 and can also test the most promising ideas experimentally.
Possible approaches include (but are not restricted to):
- Improved functional annotation of the current COVID-19 map will support more accurate inference in the context of COVID-19. Improved definition of COVID-characteristic processes will help our understanding of the disease and its progression, the cellular mechanisms involved, and help identify new ways to counter or minimize the effects of these processes.
- Expanding the COVID-19 map could exploit text-mining, protein-protein or other interactions, interaction or similarity networks, etc.
- Application of modelling to identify new therapeutic targets and drug candidates or predict disease outcomes, such as response to treatments or risk of developing severe symptoms
Analysis suggestions:
- What known and new disease consequences can you identify?
- How can you efficiently filter which functions may be more relevant for the processes described?
- How can molecular footprints from patient data be used to extract disease patterns?
- What are the singular and common disease mechanisms of action in SARS-CoV infection and other comparable viruses?
Check out and download the Disease Maps COVID-19 network in GPML, SBML, SBGN-ML or SIF format.
In order to download a Disease Maps COVID-19 sub-map you would need to scroll in and select the sub-map you are interested in (example: PAMP signalling associated submap), then you can right click on the map and select the option that best suits you (GPML, SBML, SBGN-ML). The Simple Interaction File (SIF) resulting from each map Disease Maps COVID-19 sub-map can be downloaded from here
We provide a collection of recent relevant gene expression profile studies after registration and login. As always in CAMDA challenges, however, you can use any other datasets in addition or instead, including any other data modalities (transcriptomic, genomic, proteomic, GWAS, etc) as long as these are also available to colleagues.
Please sign up to announcements from the CAMDA general forum for alerts.
Please read and accept the data download agreement for access.
Literature AI for Drug Induced Liver Injury
Unexpected Drug-Induced Liver Injury (DILI) still is one of the main killers of promising novel drug candidates. It is a clinically significant disease that can lead to severe outcomes such as acute liver failure and even death. It remains one of the primary liabilities in drug development and regulatory clearance due to the limited performance of mandated preclinical models even today. The free text of scientific publications is still the main medium carrying DILI results from clinical practice or experimental studies. The textual data still has to be analysed manually. This process, however, is tedious and prone to human mistakes or omissions, as results are very rarely available in a standardized form or organized form. There is thus great hope that modern techniques from machine learning or natural language processing could provide powerful tools to better process and derive the underlying knowledge within free form texts. The pressing need to faster process potential drug candidates in the current COVID epidemic combined with recent advances in Artificial Intelligence for text processing make this Challenge particularly topical.
We have compiled a large set of PubMed papers relevant to DILI (positives) to be contrasted with a challenging set of unrelated papers (negatives). Both titles and abstracts have been collected. Can you build a classifier using modern AI or NLP techniques to identify the relevant papers?
- The positive reference data set comprises of ~14,000 DILI related papers referenced in the NIH LiverTox database, which have been validated by a panel of DILI experts. This positive reference is split 50:50 into one part released for the challenge and one part withheld part for final performance testing.
- This is complemented by a realistic, non-trivial negative reference set of ~14,000 papers that is highly enriched in manuscripts that are not relevant to DILI but where obvious negatives and any positives we could identify have been removed by filtering for keywords and through well established language models, followed by a selective manual review by DILI experts at the FDA. This negative reference is also split 50:50 into one part released for the challenge and one part withheld part for final performance testing.
Together, this thus recreates the problem faced by human experts: After the obvious, easy negatives and positives have been removed by basic algorithms, how can we identify true positives and negatives for the less obvious cases?
The released data should be used for both training and (nested) cross-validation to avoid over-fitting. Participants will then receive independent performance scores from the withheld additional test data.
Considering that the overall prevalence of DILI relevant papers is very low when considering all manuscripts in PubMed, we will also provide another independent performance score where the negative reference set has been expanded considerably to provide an assessment of how well the models can be applied to larger candidate collections that are naturally highly unbalanced.
Data are provided in the form of text tables. Both files contain paper titles and abstracts (where available).
Please sign up to announcements from the CAMDA toxicity forum for alerts.
Please read and accept the data download agreement for access.
We thank the Institute of Advanced Research in Artificial Intelligence (IARAI) for its support in the preparation of this Challenge.
Metagenomic Phage Forensics of Anti-Microbial Resistance
Phages or bacteriophages are the most abundant viruses on planet, infecting bacteria. A systematic characterization of their striking variety, however, has been challenging and became only viable in the age of metagenomics. Recently, the idea of phage therapy has been rediscovered by researchers in academic and pharmaceutical industry as a potential alternative to classical antibiotics in the world of modern medicine. Moreover, a better understanding of phages and information about how they spread their genetic material can be of great value to public health. Rather than fight outbreaks of superbugs as they emerge, the aim would be to prevent of outbreaks or nip them in the bud.
The co-evolution of microbes and their viruses, analyses of correspondences, gene transfer, and other mechanisms of spread are yet to be explored systematically, especially on metagenomic scale. Such knowledge, however, would help in the prediction of anti-microbial resistance events from metagenomic samples collected in strategical monitoring areas. It seems that understanding relations between viruses and their hosts will be critical in that, as anti-microbial resistance can indeed spread through phages. In this CAMDA challenge we thus explore the systematic characterization of metagenomic samples with the aim of finding phages and pro-phages that may be associated with anti-microbial resistance and compile a `resistome'.
We provide a dataset containing
- 62 samples with high level of AMR genes
- 62 samples with low level of AMR genes
Samples are placed in 124 tar compressed folders with names corresponding to their ID's and AMR class (high or low). Within samples you will find:
- fastq files compressed with dsrc
- binary alignments of AMR genes (some samples may not have BAM files)
- all sorts of contextual tabularized metadata regarding those genes
Depth of sequencing may vary and we cannot wait to hear what do you think about this property in terms of phage metagenomics!
Data is based on an initial large-scale analysis of anti-microbial resistance of the MetaSUB International Consortium.
Questions of interest in this exploratory study include (but are not limited to):
Analysis suggestions:
- Discovery of phages and pro-phages in metagenomic samples with relevance to anti-microbial resistance (AMR)
- Establishing species that may be infected by detected phages, horizontal AMR gene transfers etc.
- Advancement of algorithms for the identification and discovery of phages and pro-phages from bacterial genomes, especially from metagenomic samples; assessment of such novel algorithms (performance, validation, …)
- Applications and assessments of relation mining for the occurrence of phages and bacteria in the context of AMR
The FASTQ files containing raw metagenomics reads of aforementioned samples are made available for the first time with the corresponding metadata and results of MetaSUB AMR analysis.
Please sign up to announcements from the CAMDA metagenomics forum for alerts.
Please read and accept the data download agreement for access.
Hi-Res Cancer Data Integration Challenge
There is an amazing comprehensive collection of matched genomic, transcriptomic, and epigenomic molecular patient profiles that characterizes the complex changes that occur in cancers. The most prominent data sets are provided by the Genomic Data Commons (GDC, formerly through the TCGA). The main goal of this challenge is to develop and demonstrate novel methods for gaining novel biological insights or improving support for Precision Medicine, as show cased for data from cancer patients. Innovation can build on
- Individual human genomic sequence not found in the standard reference genome - We provide reads matched to the standard reference genome plus over 300Mb of novel human genomic sequence for hundreds of real patients.
- High resolution expression profiling - Anonymized read level data allow the exploration of aberrations in splicing and regulation of alternative gene transcripts. We will also provide extended expression level profiles covering these novel and unique gene transcripts, making the challenge also accessible for colleagues who do not want to work with the read level data.
- A more meaningful integrated analysis of the multiple matched molecular profiles and complementary clinical patient data.
This just presents a unique opportunity to examine algorithm performance in a real-world clinical setting! We know that many approaches work well on some data-sets yet not on others. We here challenge you to demonstrate a unified single approach that matches or outperforms the current state-of-the-art for
- Breast cancer
and for at least one of the less well studied
- Lung Adenocarcinoma
- Kidney Renal Clear Cell Carcinoma
Please visit and participate in the open CAMDA data integration forum for free discussion related to this contest.
Analysis suggestions:
Biological:
- What known and new disease mechanisms can you identify?
- How can the integration of matched molecular profiles and patient data yield a more meaningful readout, including likely causal changes?
- What can we learn about the role of aberrant splicing and regulation of alternative gene transcripts in cancer?
- How can individual human genomics sequence aid Precision Medicine and the development of personalized rational drug treatment plans?
Technical:
- Can we apply approaches and insights developed from one type of cancer (e.g., a common, well studied cancer) to other diseases (e.g., less-well studied cancers)?
- How large a distortion is observed from restriction of gene expression readout to the standard human reference sequence (vs mapping to individual human genome sequences)?
Contest data comprises raw and pre-processed data from matched molecular profiles with complementary clinical information.
For convenience, we provide a local copy of the data. In addition, anonymized RNA-seq read level data are now available.
Please sign up to announcements from the CAMDA data integration forum for alerts.
Please read and accept the data download agreement for access.
STAY CONNECTED
Tweet