2024 Reproducible work in Data Science (X. de Pedro) | |
"Data Science. Applications to Biology and Medicine with Python and R", at IL3 - University of Barcelona. April 10th, 2024 (16-19:15h). Content at https://seeds4c.org/reproduciblework2024
|
1. Introduction - the problems (i) | ||||||
|
1.1. The problems (ii) | ||||||
|
1.2. The problems (iii) | ||||||
|
1.3. The problems (iv) | ||||||
|
1.4. The problem (v) | |||
|
1.5. The problem (vi) | ||||||
|
2. Enemies of reproducibility & adaptability | |
Enemies of reproducibility and adaptability (in levels): Changes / Evolution / Versions!
|
3. Reproducibility & Adaptability | ||||||||||||||
How to avoid reproducibility & adaptability enemies (in R & Python for Data Science):
|
4. Reproducibility & Adaptability - Example in Posit Cloud | |
Example in https://posit.cloud (former RStudio Server Pro) :
|
4.1. Level 1: Virtual Machines or Containers | |
4.2. Level 2: RStudio-Posit Workbench | |
|
4.3. Level 3: renv - for packages | |||
Version control in work "environments"
|
4.3.1. Virtual environments in R with renv | |
|
4.3.2. From utils: : sessionInfo() to renv: : snapshot() + renv.lockalso fails | ||
|
4.3.3. "Happy path" | |
For a reproducible environment
|
4.3.4. Infraestructure | ||||||||||||||||||||||||||||||
Projects with
|
4.3.5. Advanced use | |
|
4.4. Level 4: git - for code | |
See: https://gitlab.com/radup/curs-r-introduccio/ > Folder "codi" > 10.compartir.via.git.Rmd (or .pdf)
|
5. More information | ||
|
6. Hands-on practical exercise | |
|
6.1. Register a free account at Posit Cloud | |
You can do so at:
Once done, you'll see something like:
|
6.2. Create a Project from git repository | |
|
6.2.1. Visit gitlab to get clone url | |
Visit this code project in gitlab to get the project clone url:
|
6.2.2. Create project from git repo | |
|
6.3. Choose R 3.6.x & Run Rmd | |
|
6.3.1. Install dependencies also | |
|
6.3.2. Running Rmd will perform GNU/Linux system commands also | |
GNU/Linux system commands will usually be much more efficient in memory & cpu
(while csv file from reduced meteorological dataset is already 0.5 Gb).
|
6.3.3. Display raw data | |
Variables are in numeric codes (not easily readable by humans in a semantic way). We lack some varaible names (or acronyms at least) for readability.
|
6.3.4. Transform in tidy way (i) | |
|
6.3.5. Transform in tidy way (ii) - result | |
|
6.3.6. Last code chunks | |
|
6.4. Choose R 4.2.x & Run Rmd again | |
Repeat the previous steps but in a R 4.2.x environment: install dependent R packages again... (new environment, but still installing from CRAN repos). renv not needed in this case still (lucky you!). So far, so good.
|
6.5. Choose R 3.4.x & Run Rmd | ||
Now let's touch some issues with R package versions in a R 3.4.x environment
|
6.5.1. Use renv.lock recipe (i) | |
Let's get I did this already, and I uploaded the produced renv.lock file to the manually created ./recipes/ folder in this project as a backup for you (as renv_R344.lock). You can then copy now the ./recipes/renv_R344.lock file provided in the project as ./renv.lock in the project root folder, for
|
6.5.2. Use renv.lock recipe (ii) | |
Run Choose restore the renv.lock package versions:
|
6.5.3. Use renv.lock recipe (iii) | |
You will be ready to go with minimum human intervention. All R packages will be installed in the backgound to their required package versions, following the recipe that someone created for R 3.4.4. already. The key file is the renv.lock file.
|
6.5.4. Use renv.lock recipe (iv) - finished | |
|
6.6. Additional info | |
Project (Container) goes to sleep on inactivity
|
Thanks | |
Xavier de Pedro Puente, Ph.D. - xavier.depedro@seeds4c.org
|