MLFPM Symposium 2022

The final event of the ITN will be our symposium that will take place on October 18-19, 2022 at the Auditorium of the Max Planck Institute of Psychiatry in Munich.

Livestream

The livestream for the second day is available at https://youtu.be/vbzlxDK3XNg

Schedule

Time	Tuesday, October 18	Wednesday, October 19
09:00-09:15	Opening Remarks: Karsten Borgwardt
09:15-10:00	Student session opening: Joaquin Dopazo	Company session opening: Volker Tresp
10:00-10:30	MLFPM presentation: Diane Duroux	MLFPM presentation: Vesna Barros
10:30-11:00	MLFPM presentation: Pradeep Eranti	MLFPM presentation: Christopher Heje Grønbech
11:00-11:30	COFFEE BREAK	COFFEE BREAK
11:30-12:00	MLFPM presentation: Lucas Miranda	MLFPM presentation: Giulia Muzio
12:00-13:00	Session keynote: Gitta Kutyniok	Session keynote: Jean-Philippe Vert
13:00-14:00	LUNCH	LUNCH
14:00-14:45	Student workshop	MD session opening: Michal Rosen-Zvi
14:45-15:15	Student workshop	MLFPM presentation: Emese Sükei
15:15-15:45	MLFPM presentation: Bowen Fan	General public session keynote: Nikolaos Koutsouleris
15:45-16:15	MLFPM presentation: Giovanni Visonà	General public session keynote: Nikolaos Koutsouleris
16:15-16:45	COFFEE BREAK	COFFEE BREAK
16:45-17:15	MLFPM presentation: Ndèye Maguette Mbaye	General public session keynote: Bastian Rieck
17:15-17:45	MLFPM presentation: Anastassia Kolde	General public session keynote: Bastian Rieck
17:45-18:15	MLFPM presentation: Pelin Gündoğdu

Presentations

More information on the speakers and talks (including titles and abstracts of the presentations) will be available here soon.

Student session

Keynote talk: Causal modeling and machine learning applied to massive drug repurposing in rare diseases

Joaquin Dopazo (Fundación Pública Andaluza Progreso y Salud)

Tuesday, October 18, 09:15-10:00

Abstract: TBA

MLFPM Presentation: Heterogeneity between networks in systems biomedicine: detection and remediation

Diane Duroux (University of Liège)

Tuesday, October 18, 10:00-10:30

Abstract: Most elements around us are not isolated but work as systems with connected characteristics. Networks are an informative representation of such systems since they consider the interactions between features. In the meantime, accessible data are increasingly complex, producing highly heterogeneous networks. In life sciences, and in particular in the study of complex diseases, this variability complicates replication studies and the interpretation of findings. I will present how network-based approaches can be used to overcome the hurdles induced by the heterogeneity between biological systems. I will focus on two problems for which the unit of analysis can be presented as a network and where heterogeneity is observed: epistasis detection and disease subtyping.

MLFPM Presentation: Integration of multi-omics data and disease-related phenotypes to better understand the biological mechanisms underlying disease

Pradeep Eranti (Université de Paris)

Tuesday, October 18, 10:30-11:00

Abstract: Multifactorial diseases, including cancers, cardiovascular diseases, asthma and allergic diseases, stem from the effects and interactions of multiple genetic and environmental factors. With the technological advancements in detecting genetic variants across the genome at low cost, Genome-Wide Association Studies (GWAS) have successfully identified thousands of genetic variants (SNPs) associated with many diseases. However, these variants altogether explain only a part of the genetic component of the disease, which may be partly because disease susceptibility results from the joint and interactive effects of many genetic factors. The biological mechanisms involved in disease can be better deciphered by characterizing the inter-relationships among genetic factors. One strategy is to conduct network-based analysis by integrating the outcomes of genome-wide association studies and protein-protein interaction networks (or any other biological network) to identify interconnected genetic factors influencing disease.

We extended this network analysis-based strategy by applying it to multiple omics data (genomics and epigenomics) of related outcomes to better understand the mechanisms underlying allergy. We applied our strategy to summary-level data from a genome-wide association study (GWAS) of allergic diseases (AD) and from an epigenome-wide association study (EWAS) of Immunoglobulin E (IgE) levels, a key intermediate phenotype in allergy. We used the STRING protein-protein interaction network as background and the SigMod method to identify modules (sub-networks) enriched in trait-associated genes. We then investigated the relationships between AD-GWAS and IgE-EWAS gene modules. We observed shared biological mechanisms underlying AD-GWAS and IgE-EWAS gene modules. More than 70% of AD-GWAS module genes directly interacted with genes from the IgE-EWAS module (this proportion being significant at P<10^-5 using 100,000 permutations). Our proposed strategy helped us better understand the biological mechanisms underlying allergic diseases, and this strategy could be applied to other multifactorial diseases.

MLFPM Presentation: Deep clustering of motion tracking data – An unsupervised profiling of Chronic Social Defeat Stress

Lucas Miranda (Max Planck Institute of Psychiatry)

Tuesday, October 18, 11:30-12:00

Abstract: Severe stress exposure increases the risk of stress-related disorders, such as major depressive disorder (MDD). An essential characteristic of MDD is the impairment of social functioning and lack of social motivation. Along these lines, chronic social defeat stress (CSDS) is an established animal model for MDD research, which induces a cascade of physiological and social behavioral changes.

In this talk, we take advantage of current markerless pose estimation tools, and aim to profile CSDS using state of the art unsupervised motion tracking analyses. In the context of behavioral data, we set to explore the current status of the field of time series clustering using deep neural networks, and apply these tools to describe how differences are stronger upon a novel social encounter and fade with time due to habituation. Moreover, we describe and apply current explainability methods to our unsupervised models, highlighting the importance of understanding the patterns our models retrieve in a robust way.

Keynote talk: On the Path to Reliable AI

Gitta Kutyniok (Ludwig-Maximilians Universtät, Munich)

Tuesday, October 18, 12:00-13:00

Abstract: Artificial intelligence is currently leading to one breakthrough after the other, both in public life with, for instance, autonomous driving and speech recognition, and in the sciences in areas such as medical diagnostics or molecular dynamics. However, one current major drawback is the lack of reliability of such methodologies.

In this lecture we will first provide an introduction into this vibrant research area, focussing specifically on deep neural networks. We will then present some recent advances, in particular, concerning explainability. Finally, we will discuss fundamental limitations of deep neural networks and related approaches in terms of computability, and how these can be circumvented in the future on the path to truly reliable AI.

WORKSHOP

All MLFPM fellows

Tuesday, October 18, 14:00-15:15

Abstract: Organized entirely by students, this session consists of a workshop with two interactive sessions.

In the first part, we will give an introduction to explainability in machine learning, and highlight its importance in medical applications. As an example, we will focus on chest X-ray image classification, and apply methods such as saliency maps and Grad-CAM to identify the image regions in which classifiers focus the most when making decisions. Analysis will be performed in python.

In the second part, we will focus on a genomic application. Starting from Genome-Wide Association Study (GWAS) and epistasis statistics, we will show how networks can be used to visualize and disentangle the genetic architecture of complex diseases. Analyses will be performed in R.

MLFPM Presentation: Prediction of recovery from multiple organ dysfunction syndrome in pediatric sepsis patients

Bowen Fan (ETH Zürich)

Tuesday, October 18, 15:15-15:45

Abstract: Sepsis is a leading cause of death and disability in children globally, accounting for 3 million childhood deaths per year. In pediatric sepsis patients, the multiple organ dysfunction syndrome (MODS) is considered a significant risk factor for adverse clinical outcomes characterized by high mortality and morbidity in the pediatric intensive care unit. The recent rapidly growing availability of electronic health records (EHRs) has allowed researchers to vastly develop data-driven approaches like machine learning in healthcare and achieved great successes. In this study we develop a machine learning-based approach to predict the recovery from MODS to zero or single organ dysfunction by 1 week in advance in the Swiss Pediatric Sepsis Study cohort of children with blood-culture confirmed bacteremia. The promising results indicate that our model has the potential to be included into the EHRs system and contribute to patient assessment and triage in pediatric sepsis patient care.

MLFPM Presentation: The epigenetic landscape through the lens of machine learning

Giovanni Visonà (Max Planck Institute for Intelligent Systems)

Tuesday, October 18, 15:45-16:15

Abstract: Recent advances in sequencing techniques have shed a new light on the epigenetic mechanisms that control and regulate the biological machinery of cells. The challenges involved in the robust analysis of this data offer exciting opportunities for the application of state-of-the-art machine learning models to bridge the gap between gathering the data and deriving biological knowledge.

Applications of this multidisciplinary area of research include early disease detection and drug resistance prediction. Here, we introduce two research projects based on the application of machine learning to epigenetics. The epigenomic imputation model eDICE offers an overview of the challenges of representation learning applied to large scale epigenomic datasets, and shows as a proof of concept possible development for precision medicine applications. DecoDen shows how modeling the biological processes involved in sequencing can allow us to robustly denoise epigenomic measurements.

MLFPM Presentation: Learning from multimodal data to improve cancer treatment

Ndèye Maguette Mbaye (Armines)

Tuesday, October 18, 16:45-17:15

Abstract: Interest has increased in the use of prognosis factors as curser for breast cancer personalized treatment. For clinicians, early detection of those factors can be helpful for a good management of the disease and for the choice of an efficient treatment. Moreover, it exists a huge amount of meaningful information in pathological reports, biological measurements and clinical information in a patient journey that remain unexploited. In that context, I propose to develop and apply novel machine learning techniques to predict cancer outcome such as recurrence or survival from multi-modal breast cancer patient data (including medical notes in natural languages and the outcome of various lab analyses). Being confronted with different modalities, I work on multiple ways to integrate the modalities during the model development: the late integration which consists in treating each modality separately and the early integration which joints all modalities’ embeddings into one single modality input. And an important goal of my study being to find prognostic factors within the data’s features, all the developed models must be interpretable. The analysis of most important features will give insights of relationships between features and the outcome.

MLFPM Presentation: Martingale residual based approach for Cox modeling from high-dimensional data

Anastassia Kolde (University of Tartu)

Tuesday, October 18, 17:15-17:45

Abstract: Background: Survival modelling is a natural approach to study genetic associations to many phenotypes and it has been shown to be more sensitive than traditional regression-based methods. Still, it is not commonly used in Genome Wide Association Studies (GWAS) due to prohibitive computational cost.

Objective: The Cox proportional hazard model can be approximated with linear regression by converting the survival trait to martingale residuals, thus enabling the use of common GWAS tools. This approach has been taken in many papers; however, it is known that, theoretically, the approximation only works within certain bounds.

Method: The goal of this work is to explore the bounds experimentally through simulations in order to determine if and when the approximation is usable within GWAS and other omics data analysis settings.

MLFPM Presentation: A signaling-informed neural network for scRNA-seq annotation of known and unknown cell types

Pelin Gündoğdu (Fundación Pública Andaluza Progreso y Salud)

Tuesday, October 18, 17:45-18:15

Abstract: Single-cell RNA sequencing is increasing our understanding of the behavior of complex tissues or organs, by providing unprecedented details on the complex cell type landscape at the level of individual cells. It contains information of cell types and cell states. Cell type definition and functional annotation raise key questions to address, such as how to determine the similarity from expression profiles of cells or which cell types have an important role in diseased individuals. However, the exponential growth of scRNA-seq data causes two major computational challenges which are how to group cells and how to identify new cell types. There are many supervised and unsupervised methods that have been proposed to annotate cells. Cell-type annotation is performing well with supervised approaches except when new (unknown) cell types are present. Moreover, the interpretability of the supervised approach is the other focus of the project.

Here, we introduce SigPrimedNet an artificial neural network approach that leverages i) efficient training by means of a sparsity-inducing signaling circuits-informed layer, ii) feature representation learning through supervised training, and iii) unknown cell-type identification by fitting an anomaly detection method on the learned representation. We show that SigPrimedNet can efficiently annotate known cell types while keeping a low false-positive rate for unseen cells across a set of publicly available datasets. In addition, the learned representation acts as a proxy for signaling circuit activity measurements, which provide useful estimations of the cell functionalities.

Company session

Keynote talk: Knowledge Graph Embedding Models

Volker Tresp (Siemens & LMU München)

Wednesday, October 19, 09:15-10:00

Abstract: Knowledge graphs (KGs) describe structured symbolic data. A fact is described as a triple, like (Jack, hasDisease, Flue). A triple is related to a tuple in a database or a node in a relational Bayesian network or a Markov logic network. KGs are currently very popular in many industries for storing and querying data and for database integration. My team focuses on machine learning with KGs. We developed RESCAL, which was the first embedding model for KGs. At Siemens, we use KG embedding for recommending industrial components to our customers. My team co-developed PyKEEN, a Python package designed to train and evaluate knowledge graph embedding models. We are among the first teams to study temporal KGs and scene graphs. Current projects focus on rule extraction from temporal KGs, multimodal KGs, and KGs for foundation models. A bit more explorative are our projects in quantum KGs and KGs for understanding natural intelligence.

MLFPM Presentation: Virtual biopsy derived using AI-based multimodal modeling of binational breast mammography data

Vesna Barros (IBM Research)

Wednesday, October 19, 10:00-10:30

Abstract: Computational models based on artificial intelligence (AI) are being increasingly used to diagnose malignant breast lesions. However, assessment from radiological images of the specific pathology lesion subtypes, as detailed in the results of biopsy procedures, remains a challenge. In this retrospective study, we developed an AI-based model to identify breast lesion subtypes through mammography images and linked electronic health records labeled with histopathology information. We collected 26,569 images from 9,234 women imaged with digital mammography for pre-training the algorithms. The training data included individuals who had at least one year of clinical and imaging history followed by biopsy-based histopathologic diagnosis from March 2013 to November 2018. A model that combines convolutional neural networks with supervised learning algorithms was independently trained on data from 2,120/1,642 women in Israel/USA to make breast lesion predictions. For predicting malignancy in the Israel/USA held-out sets (containing 220/126 patients’ examinations with ductal carcinoma in situ or invasive cancer), the algorithms obtained an AUC of 0.88 (95% CI: 0.85, 0.91) and 0.80 (95% CI: 0.74, 0.85), respectively (P = .006). Our results offer supporting evidence that artificial intelligence applied to clinical and mammography images can identify breast lesion subtypes when the data is sufficiently large, which may help assess diagnostic workflow and reduce biopsy sampling errors.

MLFPM Presentation: Covariate removal, data set integration, and counterfactual reconstruction for scRNA-seq data using deep generative methods

Christopher Heje Grønbech (Qlucore)

Wednesday, October 19, 10:30-11:00

Abstract: Single-cell RNA-sequencing (scRNA-seq) data sets are large, high-dimensional, and complex. Various methods to reduce the dimensionality and clarify the interactions between the genes have been applied to these data sets to analyse and visualise them. Deep generative methods such as variational auto-encoders (VAEs) have had great success in doing so. VAEs are able to encode data into lower-dimensional (latent) representations. These latent representations are ordered according to the inherent structure in the data, and for scRNA-seq data this is usually according to cell type.

However, scRNA-seq data usually come from different experiment batches and different donors, and the quality of each cell differs. These are different sources of variation, and they muddle the latent representations. So we have explored different VAE models incorporating clinical information, technical variation, and quality metrics to remove them from the latent representations. Some of these models are simply conditioned on the sources of variation, while the others use two latent representations: one to model the scRNA-seq data and one to remove the sources of variation. The two latent representations were examined in both hierarchical and separate configurations. We have also taken a step further by applying the same techniques to integrate multiple data sets.

Since VAEs are generative, they can also generate reconstructions of the original data or even generate new data. So we have investigated whether changing a source of variation for certain cells, alter the gene expressions of the reconstructed cell. This counterfactual reconstruction could for instance be used to inspect differences in gene expressions between a diseased cell and a healthy version of that same cell.

MLFPM Presentation: Discovering genetic associations via network-based approach

Giulia Muzio (ETH Zurich)

Wednesday, October 19, 11:30-12:00

Abstract: The search for associations between genetic markers and complex traits allowed for the identification of a multitude of trait-related genetic variants. However, most of them explain only a small fraction of observed phenotypic variation. To overcome this limitation, it is possible to aggregate the effects of several genetic markers and, hence, to test entire genes or other functional units for their association to a trait. Another strategy, namely the network-based genome-wide association studies, foresees testing (sub)networks of genes, but suffers from a vast search space and an inherent multiple testing problem. In fact, current methods are either based on greedy feature selection or neglect doing a multiple testing correction, resulting in the risk of missing relevant associations or leading to an abundance of false positive findings, respectively. To address these shortcomings, we propose networkGWAS, a computationally efficient and statistically sound approach to network-based genome-wide association studies using mixed models and neighborhood aggregation. By means of circular and degree-preserving network permutation schemes, networkGWAS allows for population structure correction and for well-calibrated p-values and is capable of detecting known associations on both semi-simulated common variants from A. thaliana and simulated rare variants from H. sapiens. Additionally, networkGWAS identifies neighborhoods of genes involved in stress-related biological processes on a stress-induced phenotype from S. cerevisiae, and a subnetwork of known and new type II diabetes-related genes on a type II diabetes phenotype from the Estonian Biobank.

Keynote talk: Deep learning for biological sequences

Jean-Philippe Vert (Owkin)

Wednesday, October 19, 12:00-13:00

Abstract: Deep neural networks are increasingly used to analyze biological sequences, including DNA, RNA and proteins, leading to promising applications in annotation, classification, structure prediction or generation. While the architectures of deep neural networks for biosequences have been so far largely borrowed from the field of natural language processing, I will discuss in this presentation some specificities of biosequences that deserve specific methodological developments, in particular 1) how to transform a biosequence as a sequence of tokens, 2) how to incorporate some known symmetries of biosequences in the architecture of the model, and 3) how to solve tasks which are specific to biosequences such as learning to align.

MD Session

Keynote lecture: AI technologies to accelerate biomarker discovery: breast and renal cancer as a case study

Michal Rosen-Zvi

Wednesday, October 19, 14:00-14:45

Abstract: Tumors are complex ecosystems with heterogeneous cancer and immune cell type populations and diverse phenotypic . Cancer staging is a method to categorize the cancer type and the extent of its spread in the body; stage at diagnosis varies a lot. Breast cancer was the most diagnosed cancer in 2020, with an estimated 2.3 million cases. Renal cancer is among the 10 most common cancers in both men and women, with estimated close to 0.5 million new cases in 2020. Recently developed AI technologies were shown to achieve high quality results in interpreting medical images aiming at diagnosing of breast and renal cancer. This talk is about the potential use of such technologies in cancer staging and as a mean to discover novel prognosis biomarkers.

MLFPM Presentation: Data Mining for Mental Health: a.k.a. Digital Phenotyping

Emese Sükei (Universidad Carlos III de Madrid)

Wednesday, October 19, 14:45-15:15

Abstract: Smartphones and wrist-wearable devices have penetrated our lives in recent years. According to published statistics, nearly 84% of the world’s population owns a smartphone, and almost 10% own a wearable device today (2022). These devices continuously generate various data sources from multiple sensors and apps, creating our digital phenotypes. This opens new research opportunities, particularly in mental health care, which has previously relied almost exclusively on self-reports of mental health symptoms. During this presentation, first, I will talk about the concept of digital phenotyping, its promises when applied to medical uses (providing a passive estimation of PROMs), and the challenges that such data sources present (missing data, heterogeneous temporal data). In the second part of the talk, I will present machine learning solutions for modelling patient outcomes in the mental health field – such as emotional state, anxiety severity and functioning disability – based on their digital phenotypes.

General public session

Keynote talk: Machine Learning as the engine of Precision Psychiatry: Where do we stand and where do go from here?

Nikolaos Koutsouleris (Max Planck Institute of Psychiatry)

Wednesday, October 19, 15:15-16:15

Abstract: The embedding of machine learning methods in psychiatric research has stirred the hope that objective markers of mental disorders are within reach, facilitating a more precise and preventive treatment of these heterogeneous conditions. I will discuss the state-of-the-art of predictive data science in psychiatry by exemplifying supervised, comparative and subtyping approaches in psychiatric machine learning using recent findings from the Personalised Prognostic Tools for Early Psychosis Management study (PRONIA; www.pronia.eu).

Keynote talk: A Good Scale is Hard To Find: Shape Analysis Using Topology

Bastian Rieck (Helmholtz Center Munich)

Wednesday, October 19, 16:45-17:45

Abstract: Our world is full of phenomena that happen at different spatial and temporal scales. If we pick the wrong scale, we might miss the forest for the trees—and vice versa! In recent years, methods from computational topology have started to emerge as one way to address the challenging task of picking the ‘right’ scale: Instead of enforcing one specific scale when analysing data, such methods afford an analysis of *all* scales inherent to data sets.

In this talk, I will outline the general utility and expressivity of topological machine learning methods, i.e. methods combining a rigorous mathematical underpinning with the flexibility of modern deep learning architectures. I will provide both a theoretical as well as an applied view on this topic by showcasing how to employ topological methods to solve inverse problems, dealing with image reconstruction tasks in fluorescence microscopy.