Abstract
Comparing the genetic material, the genomes, of many diverse eukaryotes can help us to understand eukaryotic evolution. By comparing genomes, researchers found that the ancestor to all eukaryotes already had a complex repertoire of genes, despite that this ancestor in all likelihood was a simple cell or community of cells.
... read more
Unexpectedly, it seems that the evolution of complex eukaryotes and their genomes is not driven by gaining new genes, but duplication and loss of existing genes.
With computational comparative genomics research, we can do large scale evolutionary analyses. High gene loss is an unexpected trend coming out of this type of research done in the last decade. However, it has also become clear that genome data and annotation is not perfect. In Chapter 2, we explore the effect of gene prediction errors on gene loss estimates in eukaryotes. This chapter shows that, overall, gene prediction is done reliably. However, there are certain gene absences that are suspicious and have a higher chance of not being absent at all. These absences increase gene loss estimates. We propose a way to spot these false absences.
To be able to do large scale comparative genomics, analyse genome evolution and the evolution of genes, we need to infer orthologs on a large scale. For this, we need automated orthology methods. In Chapter 3, we benchmark multiple automatic orthology inference methods by their ability to recapitulate several observations and trends in eukaryotic genome evolution, including gene loss, the gene content of the last eukaryotic common ancestor, protein interaction prediction using phylogenetic profiles (co-occurrence of genes), and the overlap with manually curated orthologous groups. This chapter shows that most orthology methods are able to recapitulate trends in eukaryotic genome evolution. At the same time, orthologous groups inferred by these methods differ between methods and show imperfect overlap with manually curated orthologous groups.
In Chapter 4, we build on the results of Chapter 3. Namely, in Chapter 3 we were able to predict protein interactions quite well using phylogenetic profiles. Phylogenetic profiling shows good promise to predict and study functional relationships in prokaryotes. It has proven harder to obtain high performance in eukaryotic species. In Chapter 4, we analyse the effect of multiple metaparameter choices that can be made in phylogenetic profiling. Chapter 4 sheds light on the differences in reported performance amongst phylogenetic profile approaches and reveals on a more fundamental level for which types of protein interactions phylogenetic profiling has most promise when applied to eukaryotes.
In Chapter 5, we do not only want to understand why certain choices in metaparameters influence phylogenetic profiling in eukaryotes, but also increase the performance further of protein interaction prediction using phylogenetic profiles with machine learning. In Chapter 5, we uncover how to use the Random Forest algorithm to increase protein interaction prediction in eukaryotes further, using phylogenetic profiles.
Chapter 6 ends the thesis with a discussion on the discoveries, challenges, and outlook of the studies I presented here.
show less