April 22, 2022

Outline

  • Motivation for combining R and Python
  • Levels of integration
    • Break analysis into homogeneous chunks
    • Use a “bridge”
    • Truly integrated workflow

Why Python or R for data science?

  • Both
    • open-source programming language with a large community
    • new libraries or tools are added continuously
    • interface to compiled language libraries
  • Python

    • general purpose programming language
    • clean, efficient, modular design (software deployment)
    • scipy, scVelo, scikit-learn, Keras, PyTorch
  • R

    • statistical modeling and data analysis language
    • great for visualization
    • Bioconductor (edgeR, DESeq2, scran, scater), tidyverse, ggplot2, shiny

Many “Python versus R” comparisons out there…

Why combining the two?

  • any data scientist constantly combines tools from various sources
  • unlikely that all functionality is available from a single source
  • nice to have fewer scripts / steps / intermediate files
  • use case (why I started looking into combining R and Python):
    • primarily use R/BioC for analysis of single cell data
    • want to make use of scVelo python package for RNA-velocity analysis
    • the velociraptor package did not exist yet

Levels of integration

  1. Break into homogeneous chunks
    • each chunk is pure R or pure python
    • chunks are run separately from each other
    • input and output is read from and stored in intermediate files
  2. Use a “bridge”
    • primarily work with one language
    • use a specialized package (e.g. rpy2) that allows calling the 2nd language from the first
  3. Truly integrated workflow
    • use a single script or notebook
    • run it through a pair of connected R and python processes
    • objects are shared between these processes (no need for input/output files)
    • this is possible both using RStudio/reticulate and Jupyter/rpy2

Approach 1: break into pure R/python chunks

  • can be organized using workflow tools
    • (make, Snakemake, knime, custom pipelines, …)
  • Advantages
    • flexible, can combine any tool (scripts, binaries, …)
  • Disadvantages
    • no real integration
    • need to store state in intermediate files
    • need for cut-and-glue code

Approach 2: Use a “bridge”

  • made possible by “bridge” packages
  • Advantages
    • easy to use (primarly use one language)
  • Disadvantages
    • indirect access to “other” language
    • need to learn bridge package syntax

Display conventions (also for exercises)

I will use background colors to indicate code from different languages:

# R code
R.version.string
## [1] "R version 4.1.2 (2021-11-01)"
# python code
import sys
sys.version
## '3.9.12 | packaged by conda-forge | (main, Mar 24 2022, 23:34:28) \n[GCC 10.3.0]'
# shell script (bash)
echo ${BASH_VERSION}
## 4.2.46(2)-release

Example: Calling python from R using reticulate

library(reticulate)
os <- import("os")
os$listdir(".")
##  [1] "02_exercise_rstudio_problem-based.html"
##  [2] "styles.css"                            
##  [3] "None-requirements.txt"                 
##  [4] "ideas_2022.txt"                        
##  [5] "04_exercise_jupyter.html"              
##  [6] ".ipynb_checkpoints"                    
##  [7] "01_introduction.html"                  
##  [8] "requirements.txt"                      
##  [9] "FMI_python_config.R"                   
## [10] "01_introduction.Rmd"                   
## [11] ".Rhistory"                             
## [12] "02_exercise_rstudio_problem-based.Rmd" 
## [13] "sinfo-requirements.txt"                
## [14] "04_exercise_jupyter.ipynb"             
## [15] "03_exercise_rstudio.html"              
## [16] "R_requirements.R"                      
## [17] "figures"                               
## [18] "03_exercise_rstudio.Rmd"               
## [19] "pythonenv"                             
## [20] "rstudio_reticulate_examples.Rmd"       
## [21] "nohup.out"
import os
os.listdir(".")
## ['02_exercise_rstudio_problem-based.html', 'styles.css', 'None-requirements.txt', 'ideas_2022.txt', '04_exercise_jupyter.html', '.ipynb_checkpoints', '01_introduction.html', 'requirements.txt', 'FMI_python_config.R', '01_introduction.Rmd', '.Rhistory', '02_exercise_rstudio_problem-based.Rmd', 'sinfo-requirements.txt', '04_exercise_jupyter.ipynb', '03_exercise_rstudio.html', 'R_requirements.R', 'figures', '03_exercise_rstudio.Rmd', 'pythonenv', 'rstudio_reticulate_examples.Rmd', 'nohup.out']

reticulate type conversions

R Python Examples
Single-element vector Scalar 1, 1L, TRUE, “foo”
Multi-element vector List c(1.0, 2.0, 3.0), c(1L, 2L, 3L)
List of multiple types Tuple list(1L, TRUE, “foo”)
Named list Dict list(a = 1L, b = 2.0), dict(x = x_data)
Matrix/Array NumPy ndarray matrix(c(1,2,3,4), nrow = 2, ncol = 2)
Data Frame Pandas DataFrame data.frame(x = c(1,2,3), y = c(“a”, “b”, “c”))
Function Python function function(x) x + 1
Raw Python bytearray as.raw(c(1:10))
NULL, TRUE, FALSE None, True, False NULL, TRUE, FALSE

reticulate used in other packages

  • many packages use reticulate in the background to bridge to Python
  • prominent examples are the R packages Tensorflow and Keras that implement an R API strongly resembling the original Python APIs (Tensorflow and Keras), and use reticulate for Python calls and object translation
  • for package developers, the basilisk Bioconductor package may be interesting: It provides a mechanism to distribute a controlled Python environment and is used for example in:

Example: Calling R from python using rpy2

import rpy2.robjects as robjects
pi = robjects.r['pi']
pi[0]
## 3.141592653589793
pi
## [1] 3.141593

rpy2 type conversions

Approach 3: Integrated workflow

  • use a single script or notebook
    • use a pair of connected R and python processes
    • processes can share objects similarly as with “bridge” approach
    • supported by RStudio: rmarkdown + reticulate, see also blog post
    • supported by Jupyter: rpy2.ipython.rmagic
  • Advantages
    • easy to use
    • can mostly use native code
    • limited need to learn specific syntax
  • Disadvantages
    • increased complexity of environment (ok once it is setup)

Comment: “I don’t like notebooks”

  • presentation by Joel Grus at JupyterCon 2018:
  • in a nutshell:
    • possible to run chunks out of order and to have an inconsistent state (shown output is not what you would get upon rerun)
    • the hidden state makes it difficult to understand what’s going on
    • better: use markdown (compiled in order, no bad surprises)

Example: RStudio markdown

Code cells are declared in a header:

The special objects r and py can be used to access the “other side”:

Example: Jupyter notebook

Code cells are declared to contain R by starting with %R (single line) or %%R (multiple lines) (details):
%R [-i INPUT] [-o OUTPUT] [...] [code [code ...]]

Files and Exercises

  • day1_python_and_R/01_introduction.html
    these slides
  • day1_python_and_R/02_exercise_rstudio_problem-based.html
    our exercises (RStudio)
  • day1_python_and_R/03_exercise_rstudio.html
    [optional] self-study RStudio
  • day1_python_and_R/04_exercise_jupyter.html
    [optional] self-study Jupyter

Thank you!