2024 Reproducible work in Data Science (X. de Pedro)
"Data Science. Applications to Biology and Medicine with Python and R", at IL3 - University of Barcelona. April 10th, 2024 (16-19:15h).
1. Introduction - the problems (i)
| --> | |
Obsolete Devices storing code & data | --> | Ease copying to new devices (legally also: copyleft, ...) + online repositories |
1.1. The problems (ii)
--> | ||
Software obsolescence and incompatible dependency versions | --> | Adapt to code evolution: - Controlling Package Versions ( renv ) - VCS (git, bazaar, svn...) VCS = Version Control Systems |
1.2. The problems (iii)
--> | ||
Centralization (such as Subversion VCS (svn) may increase efficiency but it also decreases Resilience ("shit happens") | --> | From Centralized VCS (such as svn) to Decentralized VCS (such as git) VCS = Version Control Systems |
1.3. The problems (iv)
| --> | |
Conflicting package versions at system level with package versions at project levels | --> | Package versions per project Environment ( renv ) |
1.4. The problem (v)
Sometimes a project was developed with a major version of a programing language (R 3.x, Python 2.x), while another project in the same server requires a different major version (R 4.x, Python 3.x) | --> | R case: from RStudio Server to Posit Workbench (former RStudio Server Pro) You can choose R version per project Python: Several approaches (conda, PyCharm, ...): see this as an example. |
1.5. The problem (vi)
--> | ||
Changes per Operating System itself (32 bit systems unsuported anymore, discontinued linux distros, ...) | --> | Virtual Machines or Containers (OVA, KVM, LXC, Docker, Pod...): you can choose OS version per container within the same server |
2. Enemies of reproducibility & adaptability
Enemies of reproducibility and adaptability (in levels): Changes / Evolution / Versions!
- Operating system and its dependencies (and their versions)
- Programming language (and its version)
- Specífic Packages (and versions) as dependencies for your Work Project
- Versions of your own code (algorythm and param variations, etc): lacking versioning system
- Readability and tidyness of your own code / routines / scripts
- Lack of documentation/help resources + steep learning curve to use it or adapt it to your context or infrastructure
3. Reproducibility & Adaptability
How to avoid reproducibility & adaptability enemies (in R & Python for Data Science):
|
|
(Level 1) Versions in OS repos & critical dependencies: curl, ssl, GDAL, Java, cpp, V8... |
Virtual Machines or Containers (VBox, KVM, LXC, Docker, Pod...) |
(Level 2) Versions in Programming language: Python 2.x vs 3.x, R 3.x vs 4.x, ... |
Python: Conda, R: RStudio/Posit Workbench General (in Linux clusters): software modules. |
(Level 3) Versions in Specific packages |
=== Py: .env, poetry R: |
(Level 4) Versions in Your own scripts |
Decentralized VCS: Git (Gitlab, Github, ...), Bazaar (Launchpad), ... Centralized VCS: CVS, SVN (Sourceforge, ...), ... VCS = Version Control system |
(Level 5) Tidy script content and organization |
Literate Coding (Scripting & Coding) / Analysis - R: Rstudio Notebooks with modern R (Tidyverse). VS Notebooks, G-Colab, ... - Python: Jupyter Notebooks, Rstudio Notebooks, VS Notebooks, G-Colab, ... ( Quarto Mardown and rendering for both and more) |
(Level 6) Help to lower the learning curve |
Documentation, Code Vignettes, Examples, Tutorials, Learning material ( learnr ), Books ( bookdown )... |
4. Reproducibility & Adaptability - Example in Posit Cloud
Example in https://posit.cloud (former RStudio Server Pro) :
- Level 1: A Container with a specific linux distro (e.g. Ubuntu Linux 20.04 Focal LTS) per project.
- Level 2: RStudio/Posit Workbench (which allows choosing R version per project)
- Level 3:
renv
for your R package collection (and specific versions) in your project - Level 4: git or svn for your scripts in your project
- Level 5: YOU (Tidyverse is your friend)
- Level 6: YOU (+ helpers:
roxygen2
,blogdown
,learnr
,bookdown
, ...)
4.1. Level 1: Virtual Machines or Containers
From:
https://kubernetes.io/docs/concepts/overview/
4.2. Level 2: RStudio-Posit Workbench
4.3. Level 3: renv - for packages
Version control in work "environments"
| |
4.3.1. Virtual environments in R with renv
4.3.2. From utils: : sessionInfo() to renv: : snapshot() + renv.lockalso fails
|
|
4.3.3. "Happy path"
For a reproducible environment
cd project_folder git init R [obrir projecte de RStudio] renv::init() # to initialize renv in your code project renv::snapshot() # to make a snapshot "picture" of the list of R packages used within the whole R project and their respective package versions q() git commit ... git push
cd project_folder git clone/git pull ... R [open same RStudio project] renv::status() # for a report on which steps are suggested for you to follow renv::restore() # to restore the package library (with the required pacckage versions) for this project [continue working in/developing your code] renv::snapshot() # to make a new snapshot "picture" (in case there are new packages and/or versions or R packages neweer or older in use in your project ;-) ) q() git commit ... git push
4.3.4. Infraestructure
Projects with renv
write and use these files in order to work:
File | Use | |
.Rprofile | Used to activate renv for new R sessions launched in the project. | |
renv.lock | The lockfile, describing the state of your project’s library at some point in time. | |
renv/activate.R | The activation script run by the project .Rprofile. | |
renv/library | The private project library. | |
renv/settings.dcf | Project settings – see ?settings for more details.
|
By default, renv
uses a package memory-cache here:
Platform | Location | |
Linux | ~/.local/share/renv
| |
macOS | ~/Library/Application Support/renv
| |
Windows | %LOCALAPPDATA%/renv
|
4.3.5. Advanced use
renv::install("packagename", version="0.1") # to install old versions from a package (useful also for discontinued packages in CRAN!). See possible package-version numbers at https://cran.r-project.org/src/contrib/Archive/yourpackage/ renv::record("packagename", version="0.1") # to save at renv.lock the specific version you need for this package renv::deactivate() # to temporarily deactivate renv in your project renv::activate() # to reactivate renv in your project renv::equip() # for special installations in MS Windows vignette("docker", package = "renv") # for a commbined use with Docker vignette("collaborating", package = "renv") # to improve collaborative use in work teams
And much more. See:
- https://rstudio.github.io/renv/articles/renv.html
- https://solutions.posit.co/envs-pkgs/environments/
4.4. Level 4: git - for code
See: https://gitlab.com/radup/curs-r-introduccio/ > Folder "codi" > 10.compartir.via.git.Rmd (or .pdf)
See also my own git recipes over some years, github cheatsheet, ...: https://seeds4c.org/git
5. More information
renv | workflowr | learnr | roxygen2 | Tidyverse |
|
6. Hands-on practical exercise
6.1. Register a free account at Posit Cloud
You can do so at:
You will need to click on a link sent to your email inbox to validate your account.
Once done, you'll see something like:
6.2. Create a Project from git repository
Enter Posit cloud and click at New Project > New Project from Git Repository
6.2.1. Visit gitlab to get clone url
Visit this code project in gitlab to get the project clone url:
https://gitlab.com/xavidp/datascience2023
6.2.2. Create project from git repo
Paste it in the Posit cloud popup window and click at OK:
6.3. Choose R 3.6.x & Run Rmd
6.3.1. Install dependencies also
6.3.2. Running Rmd will perform GNU/Linux system commands also
GNU/Linux system commands will usually be much more efficient in memory & cpu
It helps to prevent RAM bottlenecks with just 1Gb RAM on Posit Cloud Free plan
(while csv file from reduced meteorological dataset is already 0.5 Gb).
6.3.3. Display raw data
Variables are in numeric codes (not easily readable by humans in a semantic way). We lack some varaible names (or acronyms at least) for readability.
6.3.4. Transform in tidy way (i)
6.3.5. Transform in tidy way (ii) - result
6.3.6. Last code chunks
6.4. Choose R 4.2.x & Run Rmd again
Repeat the previous steps but in a R 4.2.x environment: install dependent R packages again... (new environment, but still installing from CRAN repos). renv not needed in this case still (lucky you!).
So far, so good.
6.5. Choose R 3.4.x & Run Rmd
Now let's touch some issues with R package versions in a R 3.4.x environment
In this case, the solution involves finding some valid previous package version for each conflicting R package, and using this type of commands:
|
|
6.5.1. Use renv.lock recipe (i)
Let's get renv
to the rescue. Once somebody solved these issues, and found a valid recipe of package versions for this environment, a file ./renv.lock will have been produced in the project root folder after running the command renv::snapshot()
I did this already, and I uploaded the produced renv.lock file to the manually created ./recipes/ folder in this project as a backup for you (as renv_R344.lock).
You can then copy now the ./recipes/renv_R344.lock file provided in the project as ./renv.lock in the project root folder, for renv
to be able use it.
6.5.2. Use renv.lock recipe (ii)
Run renv::init()
in the R console.
Choose restore the renv.lock package versions:
"1. Restore the project from the lockfile"
6.5.3. Use renv.lock recipe (iii)
You will be ready to go with minimum human intervention.
All R packages will be installed in the backgound to their required package versions, following the recipe that someone created for R 3.4.4. already.
The key file is the renv.lock file.
6.5.4. Use renv.lock recipe (iv) - finished
6.6. Additional info
Project (Container) goes to sleep on inactivity
Thanks
Xavier de Pedro Puente, Ph.D. - xavier.depedro@seeds4c.org
Slides available at:
https://seeds4c.org/reproduciblework2024
Unless elsewhere noted, contents of this web site are released under a Creative Commons license.