Combining the best of two worlds: Python & R

April 22, 2022

Outline

Motivation for combining R and Python
Levels of integration
- Break analysis into homogeneous chunks
- Use a “bridge”
- Truly integrated workflow

Why Python or R for data science?

Both
- open-source programming language with a large community
- new libraries or tools are added continuously
- interface to compiled language libraries
Python
- general purpose programming language
- clean, efficient, modular design (software deployment)
- scipy, scVelo, scikit-learn, Keras, PyTorch
R
- statistical modeling and data analysis language
- great for visualization
- Bioconductor (edgeR, DESeq2, scran, scater), tidyverse, ggplot2, shiny

Many “Python versus R” comparisons out there…

Which #superheroe are you?(#batman Vs. #Superman) == (#R Vs. #Python)? #datascience @roopamu https://t.co/B1gO8MT1Zr pic.twitter.com/GR3pUiZ6rS
— Antoine (@AntoineTrdc) November 1, 2015

Why combining the two?

any data scientist constantly combines tools from various sources
unlikely that all functionality is available from a single source
nice to have fewer scripts / steps / intermediate files
use case (why I started looking into combining R and Python):
- primarily use R/BioC for analysis of single cell data
- want to make use of scVelo python package for RNA-velocity analysis
- the velociraptor package did not exist yet

Levels of integration

Break into homogeneous chunks
- each chunk is pure R or pure python
- chunks are run separately from each other
- input and output is read from and stored in intermediate files
Use a “bridge”
- primarily work with one language
- use a specialized package (e.g. rpy2) that allows calling the 2nd language from the first
Truly integrated workflow
- use a single script or notebook
- run it through a pair of connected R and python processes
- objects are shared between these processes (no need for input/output files)
- this is possible both using RStudio/reticulate and Jupyter/rpy2

Approach 1: break into pure R/python chunks

can be organized using workflow tools
- (make, Snakemake, knime, custom pipelines, …)
Advantages
- flexible, can combine any tool (scripts, binaries, …)
Disadvantages
- no real integration
- need to store state in intermediate files
- need for cut-and-glue code

Approach 2: Use a “bridge”

made possible by “bridge” packages
- Call python from R: reticulate
- Call R from python: rpy2
Advantages
- easy to use (primarly use one language)
Disadvantages
- indirect access to “other” language
- need to learn bridge package syntax

Display conventions (also for exercises)

I will use background colors to indicate code from different languages:

# R code
R.version.string

## [1] "R version 4.1.2 (2021-11-01)"

# python code
import sys
sys.version

## '3.9.12 | packaged by conda-forge | (main, Mar 24 2022, 23:34:28) \n[GCC 10.3.0]'

# shell script (bash)
echo ${BASH_VERSION}

## 4.2.46(2)-release

Example: Calling python from R using reticulate

library(reticulate)
os <- import("os")
os$listdir(".")

##  [1] "02_exercise_rstudio_problem-based.html"
##  [2] "styles.css"                            
##  [3] "None-requirements.txt"                 
##  [4] "ideas_2022.txt"                        
##  [5] "04_exercise_jupyter.html"              
##  [6] ".ipynb_checkpoints"                    
##  [7] "01_introduction.html"                  
##  [8] "requirements.txt"                      
##  [9] "FMI_python_config.R"                   
## [10] "01_introduction.Rmd"                   
## [11] ".Rhistory"                             
## [12] "02_exercise_rstudio_problem-based.Rmd" 
## [13] "sinfo-requirements.txt"                
## [14] "04_exercise_jupyter.ipynb"             
## [15] "03_exercise_rstudio.html"              
## [16] "R_requirements.R"                      
## [17] "figures"                               
## [18] "03_exercise_rstudio.Rmd"               
## [19] "pythonenv"                             
## [20] "rstudio_reticulate_examples.Rmd"       
## [21] "nohup.out"

import os
os.listdir(".")

## ['02_exercise_rstudio_problem-based.html', 'styles.css', 'None-requirements.txt', 'ideas_2022.txt', '04_exercise_jupyter.html', '.ipynb_checkpoints', '01_introduction.html', 'requirements.txt', 'FMI_python_config.R', '01_introduction.Rmd', '.Rhistory', '02_exercise_rstudio_problem-based.Rmd', 'sinfo-requirements.txt', '04_exercise_jupyter.ipynb', '03_exercise_rstudio.html', 'R_requirements.R', 'figures', '03_exercise_rstudio.Rmd', 'pythonenv', 'rstudio_reticulate_examples.Rmd', 'nohup.out']

reticulate type conversions

R	Python	Examples
Single-element vector	Scalar	1, 1L, TRUE, “foo”
Multi-element vector	List	c(1.0, 2.0, 3.0), c(1L, 2L, 3L)
List of multiple types	Tuple	list(1L, TRUE, “foo”)
Named list	Dict	list(a = 1L, b = 2.0), dict(x = x_data)
Matrix/Array	NumPy ndarray	matrix(c(1,2,3,4), nrow = 2, ncol = 2)
Data Frame	Pandas DataFrame	data.frame(x = c(1,2,3), y = c(“a”, “b”, “c”))
Function	Python function	function(x) x + 1
Raw	Python bytearray	as.raw(c(1:10))
NULL, TRUE, FALSE	None, True, False	NULL, TRUE, FALSE

reticulate used in other packages

many packages use reticulate in the background to bridge to Python
prominent examples are the R packages Tensorflow and Keras that implement an R API strongly resembling the original Python APIs (Tensorflow and Keras), and use reticulate for Python calls and object translation
for package developers, the basilisk Bioconductor package may be interesting: It provides a mechanism to distribute a controlled Python environment and is used for example in:
- velociraptor
- zellkonverter

Example: Calling R from python using rpy2

import rpy2.robjects as robjects
pi = robjects.r['pi']
pi[0]

## 3.141592653589793

pi

## [1] 3.141593

rpy2 type conversions

rpy2’s object conversion system is complex and powerful:
- a lower-level interface (implemented as protocols): rpy2.rinterface
- a higher-level interface using converter functions: rpy2.robjects.conversion.Converter()
- custom converters can be implemented for new objects
for details see: https://rpy2.github.io/doc/latest/html/robjects_convert.html

Approach 3: Integrated workflow

use a single script or notebook
- use a pair of connected R and python processes
- processes can share objects similarly as with “bridge” approach
- supported by RStudio: rmarkdown + reticulate, see also blog post
- supported by Jupyter: rpy2.ipython.rmagic
Advantages
- easy to use
- can mostly use native code
- limited need to learn specific syntax
Disadvantages
- increased complexity of environment (ok once it is setup)

Comment: “I don’t like notebooks”

presentation by Joel Grus at JupyterCon 2018:
- link to slides (Google docs)
- presentation refer to Jupyter notebooks, but applies equally to RStudio notebooks
in a nutshell:
- possible to run chunks out of order and to have an inconsistent state (shown output is not what you would get upon rerun)
- the hidden state makes it difficult to understand what’s going on
- better: use markdown (compiled in order, no bad surprises)

Example: RStudio markdown

Code cells are declared in a header:

The special objects r and py can be used to access the “other side”:

Example: Jupyter notebook

Code cells are declared to contain R by starting with %R (single line) or %%R (multiple lines) (details):

%R [-i INPUT] [-o OUTPUT] [...] [code [code ...]]

Files and Exercises

day1_python_and_R/01_introduction.html
these slides
day1_python_and_R/02_exercise_rstudio_problem-based.html
our exercises (RStudio)
day1_python_and_R/03_exercise_rstudio.html
[optional] self-study RStudio
day1_python_and_R/04_exercise_jupyter.html
[optional] self-study Jupyter

Thank you!