2023 Reproducible work in Data Science (X. de Pedro) | |
|
"Data Science. Applications to Biology and Medicine with Python and R", at IL3 - University of Barcelona. Feb 27th, 2023 (16-19:15h). Content at https://seeds4c.org/reproduciblework2023
| |
1. Introduction - the problems (i) | ||||||
|
| ||||||
1.1. The problems (ii) | ||||||
|
| ||||||
1.2. The problems (iii) | ||||||
|
| ||||||
1.3. The problems (iv) | ||||||
|
| ||||||
1.4. The problem (v) | |||
|
| |||
1.5. The problem (vi) | ||||||
|
| ||||||
2. Enemies of reproducibility & adaptability | |
|
Enemies of reproducibility and adaptability (in levels): Changes / Evolution / Versions!
| |
3. Reproducibility & Adaptability | ||||||||||||||
|
How to avoid reproducibility & adaptability enemies (in R & Python for Data Science):
| ||||||||||||||
4. Reproducibility & Adaptability - Example in Posit Cloud | |
|
Example in https://posit.cloud (former RStudio Server Pro) :
| |
4.1. Level 1: Virtual Machines or Containers | |
4.2. Level 2: RStudio-Posit Workbench | |
|
| |
4.3. Level 3: renv - for packages | |||
|
Version control in work "environments"
| |||
4.3.1. Virtual environments in R with renv | |
|
| |
4.3.2. From utils: : sessionInfo() to renv: : snapshot() + renv.lockalso fails | ||
|
| ||
4.3.3. "Happy path" | |
|
For a reproducible environment Commands in terminal - Computer 1
Commands in terminal - Computer 2
| |
4.3.4. Infraestructure | ||||||||||||||||||||||||||||||
|
Projects with
| ||||||||||||||||||||||||||||||
4.3.5. Advanced use | |
| |
4.4. Level 4: git - for code | |
|
See: https://gitlab.com/radup/curs-r-introduccio/ > Folder "codi" > 10.compartir.via.git.Rmd (or .pdf)
| |
5. More information | ||
| ||
6. Hands-on practical exercise | |
|
| |
6.1. Register a free account at Posit Cloud | |
|
You can do so at:
Once done, you'll see something like:
| |
6.2. Create a Project from git repository | |
|
| |
6.2.1. Visit gitlab to get clone url | |
|
Visit this code project in gitlab to get the project clone url:
| |
6.2.2. Create project from git repo | |
|
| |
6.3. Choose R 3.6.x & Run Rmd | |
|
| |
6.3.1. Install dependencies also | |
|
| |
6.3.2. Running Rmd will perform GNU/Linux system commands also | |
|
GNU/Linux system commands will usually be much more efficient in memory & cpu
(while csv file from reduced meteorological dataset is already 0.5 Gb).
| |
6.3.3. Display raw data | |
|
Variables are in numeric codes (not easily readable by humans in a semantic way). We lack some varaible names (or acronyms at least) for readability.
| |
6.3.4. Transform in tidy way (i) | |
|
| |
6.3.5. Transform in tidy way (ii) - result | |
|
| |
6.3.6. Last code chunks | |
|
| |
6.4. Choose R 4.2.x & Run Rmd again | |
|
Repeat the previous steps but in a R 4.2.x environment: install dependent R packages again... (new environment, but still installing from CRAN repos). renv not needed in this case still (lucky you!). So far, so good.
| |
6.5. Choose R 3.4.x & Run Rmd | ||
|
Now let's touch some issues with R package versions in a R 3.4.x environment
| ||
6.5.1. Use renv.lock recipe (i) | |
|
Let's get I did this already, and I uploaded the produced renv.lock file to the manually created ./recipes/ folder in this project as a backup for you (as renv_R344.lock). You can then copy now the ./recipes/renv_R344.lock file provided in the project as ./renv.lock in the project root folder, for
| |
6.5.2. Use renv.lock recipe (ii) | |
|
Run Choose restore the renv.lock package versions:
| |
6.5.3. Use renv.lock recipe (iii) | |
|
You will be ready to go with minimum human intervention. All R packages will be installed in the backgound to their required package versions, following the recipe that someone created for R 3.4.4. already. The key file is the renv.lock file.
| |
6.5.4. Use renv.lock recipe (iv) - finished | |
|
| |
6.6. Additional info | |
|
Project (Container) goes to sleep on inactivity
| |
Thanks | |
|
Xavier de Pedro Puente, Ph.D. - xavier.depedro@seeds4c.org
| |