R.ROSETTA: an interpretable machine learning framework
- Prelegent(ci)
- Mateusz Garbulowski & Jan Komorowski
- Afiliacja
- epartment of Cell and Molecular Biology, Uppsala University, Sweden
- Termin
- 26 marca 2021 14:15
- Informacje na temat wydarzenia
- meet.google.com/jbj-tdsr-aop
- Seminarium
- Seminarium badawcze Zakładu Logiki: Wnioskowania aproksymacyjne w eksploracji danych
R.ROSETTA: an interpretable machine learning framework
Mateusz Garbulowski1 and Jan Komorowski1,2,3,4
1Department of Cell and Molecular Biology, Uppsala University, Uppsala, Sweden;
2Swedish Collegium for Advanced Study, Uppsala, Sweden;
3Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland;
4Washington National Primate Research Center, Seattle, WA, USA
Machine learning incorporates methods that assist bioinformatics analyses in prediction of samples of various origin. The main goal of such analyses is to achieve the highest possible accurateness. However, in several applications, it is important to understand the mechanisms behind making the predictions. Accordingly, interpretable machine learning (IML) has been introduced as a concept that allows for transparent classification procedure. Such interpretable classifiers have been extensively used in omics studies to discover patterns that reflect biological mechanisms. Importantly, another benefit of the transparency are measurements that describe the model components and can be further used for estimating statistics. Thus, it is important to equip IML models with statistical measurements that can be used for evaluating and pruning the models.
We developed the R.ROSETTA package, which is an R adaptation of the ROSETTA framework that is based on rough set theory. We moved the ROSETTA framework to the R environment in order to adopt and integrate the tool within a well-known statistical computing environment and reach a greater scientific community. Our package allows for constructing and analyzing rule-based models in a user-friendly way. A substantial goal of our work was to improve rule-based modelling on the level of interpretability, accessibility and quality. Among others, we have implemented functions such as (1) balancing uneven distribution of classes with undersampling, (2) estimating rule P values and other measures of significance, (3) retrieving samples that correspond to rules called support sets, (4) enhancing the prediction of external datasets, (5) merging of rule-based models and (6) assisting the visualization of rule or model. The R.ROSETTA package is publicly available at https://github.com/komorowskilab/R.ROSETTA.
To illustrate the usage of the package, we applied it to transcriptome datasets from autism case–control studies. The comparison with the state-of-the-art R packages for rule- and decision trees-based IML revealed that R.ROSETTA produced models with comparable quality and computation time. However, most of the packages are not equipped with novel functions included in R.ROSETTA. We also showed an application of R.ROSETTA in exhaustive analysis of autism spectrum disorder (ASD) subtypes. We demonstrated that the package allows (1) creating balanced rule-based models, (2) pruning with statistical properties of rules (3) merging multiple independent datasets and (4) revealing dissimilarities between ASD subtypes based on support sets. The final results and conclusions were supported with the analysis of rule-based networks constructed with the VisuNet tool.
Link to meeting: https://meet.google.com/jbj-tdsr-aop