INF: A data analysis ecosystem for the CRC: Omics data processing and management
The INF project will provide CRC1588 scientists with a collaborative omics data processing and analysis ecosystem as well as the data management infrastructure required by so large a consortium. Support will be built on well-tested methods for data storage & management and standardized data processing workflows. The project will further bundle omics data analysis expertise, and encompass more tailored analyses such as integrating heterogeneous data from public databases and semantic domain knowledge and generating interpretable diagnostic models for in-depth analysis of experimental results. The advent of genome-scale omics technologies requires new measures and resources to store, retrieve, analyze and ultimately understand these data. In addition to individually adapted solutions for different types of omics data and their cross-platform and cross-species meta-analyses, automated and standardized workflows are needed that efficiently process the increasing amounts of data in a well-defined manner. The Charité/BIH Core Unit Bioinformatics (CUBI) in cooperation with several CRC1588 PIs have built extensive expertise in omics data processing, quality control and management, of utmost importance for every project in CRC1588. Reproducible processing, FAIR management and solid data security to protect the patient are essential prerequisites. In the INF project, the CUBI will address these challenges with (i) CRC-specific efficient data stewardship and data management services (Task 1) and (ii) best practice omics data processing and research collaborations on data processing, integration and meta-analyses (Task 2). To leverage the interdisciplinary nature of CRC1588, we will extend our infrastructure for shared access to omics data towards also encompassing the bioinformatic analysis of these data (Task 3). We will also provide new proof-of-concept data access technologies that empower genotype-phenotype association discovery in single-cell and spatial omics datasets towards the creation of a Neuroblastoma Single-Cell Atlas (Task 4). Specific key core services provided in INF will be (i) a unified omics CRC1588 database, comprising existing and newly created omics datasets (ii) professional data stewardship and reproducible best practice omics data processing to create high-quality datasets as a basis for cross-omics data integration and phenotype association, (iii) implementation of an internally accessible, interactive and reproducible environment for high-level data analysis and collaboration and (iv) metadata analyses across platforms and species. Services (i) and (ii) will form what we call the ‘CRC1588 Data Hub’ to provide data access to CRC1588 researchers.
A shared PhD student co-supervised by the J.P. Junker (C01) and D. Beule will start with existing neuroblastoma single-cell data and then implement a tailored data integration method (Task 4) that will establish connections between zebrafish experimental data from project C01 (in cooperation with S. Haas) on highly resolved neuroblastoma cell states with data from human cells, patients and mouse models. The resulting classification of neuroblastoma states will enable re-evaluation of published data and gradually grow as it encompasses other CRC1588 project datasets and modalities beyond gene expression to form the Neuroblastoma Single-Cell Atlas (Task 4). The PhD student will need a strong background in bioinformatics with good know-how in algorithms and statistics as well as good programming skills.
PhD positions and place of work: 1 computational with experience analyzing single-cell and bulk omics data and good knowledge of Python, Git, Slurm, multivariate statistics and machine learning (PI Dieter Beule, BIH@Charité, Berlin)