Packaging and containerizing of bioinformatics software: advances, challenges, and opportunities

Sharon Waymost; Serghei Mangul; Neha Rajkumar; Ram Ayyala; Heng Li; Nathan LaPierre; André Santos; Titus Brown; Casey S. Greene; Brendan Lawlor; Qiyang Hu; Jaqueline J. Brito; Thiago Mosqueiro; Benjamin J. Heil

The rapid advancement of omics and sequencing technologies has led to an accelerated growth of genomic data. Driven by this data, there are similar increases in the number of available bioinformatics software tools, often containing novel and diverse algorithms of particular interest to biomedical researchers.[1,2]. Biomedical researchers wanting to use this software often have access to high-performance clusters (“clusters”), but lack the operating system-level permissions and advanced skills required to install and run the software. None of the necessary skills are currently included in the traditional life science curriculum at major universities. Usability is further hindered by the absence of any standardization and the wide variety of software development tools employed. This is directly observable in many academic software tools lacking a user-friendly interface[5].

As the dependence of biomedical researchers on computational software continues to increase, software developers need to consider more user-friendly distribution and install methods. Modern software distribution, having already faced user-friendliness issues, is already becoming more reliant on platforms such as package managers[6,7] and containers[8]. Both promise to simplify software development while increasing the usability and reproducibility of biomedical research[8,9].

Package managers first appeared nearly thirty years ago as software developers sought to streamline the entire installation process. To install via package manager, the user must only specify the desired software, called a “package;” the download, installation, configuration, and dependencies are all handled by the package manager. Many operating systems have built-in package managers (e.g. APT), which may not be available to cluster users, while others must be downloaded and installed by the user. Some package managers (e.g. Conda[6]) are programming language agnostic, while others are designed for a particular language (e.g pip[10]).

A more recent software distribution solution is containerization. The end-user downloads a container “image” that includes the software, dependencies, and anything else necessary to run the software. Though the imaged software is not typically itself installed, many do require the installation of containerization runtime software. When run, the runtime software creates a consistent, isolated sandbox environment, then runs the imaged software inside the sandbox[11]. This sandbox design makes the images both highly portable (compatible with different computers) and easily shareable (transferable between different computers), which has already led to wide adaptation in bioinformatics[7,12].

In addition to the ease of installation and use offered by both package managers and containers, such platforms must also meet the biomedical community’s need for compatibility with high performance clusters. However, the relative performance of these package managers and containers remains unknown. Our review summarizes developing practices across the most common package managers and containers, while discussing the challenges, advantages, and limitations of using them from both the developer and the user perspectives. By taking a survey of all the available package managers and container software, there is now a comprehensive list of the different attributes of each one, informing the community so that people can make educated decisions about which package manager or container makes the most sense for their project. We also propose principles that can make packaging and containerizing of bioinformatics software more sustainable and reproducible, ultimately increasing the usability of bioinformatics software.

Discussion

Existing problems with software distribution and installation

The installation process of bioinformatics research software is typically a multi-step process, starting with the end-user locating and downloading the software. Next is the actual installation, during which the end-user must determine what dependencies are missing and resolve them by installing the required software5. Even in cases where the end-user is familiar with this process, they are typically installing on high-performance clusters where they are constrained by user-level permissions that prevent them from following standard installation procedures.

In addition to being difficult, manual installation of bioinformatics software can lead to irreproducible findings. Software versions change over time, sometimes leading to different runtime behavior. If an analysis is dependent on recently introduced features of a tool, users trying to reproduce the work with older installations will fail. Likewise, even existing functionality can change over time. For example, the input function in Python 2 is used to run code while in Python 3 it creates a command line prompt [???]. To prevent such issues without using a package manager, it is necessary to manually keep track of the versions of all dependencies.

Definitions and explanations of distribution system types

Package Managers

Package managers appeared nearly thirty years ago as software developers sought to automate and standardize the software installation process. The earliest package managers (e.g. APT and RPM) were built into Linux operating systems and lacked now common features like dependency resolution. Modern package managers are available for all major operating systems, with both integrated and user-installed options available. Some are package managers are limited to software written in a specific programming language (e.g. pip distributes software written in Python), while others are limited to sofware with a particular purpose (e.g. Bioconda distributes omics software).

Every package managers uses its own package format, which follows the general structure shown in Figure 1.

The metadata gives the package manager the necessary information to install the packaged software.

To use a package manager, a user instructs the package manager to install software, then the package manager performs all the necessary steps to install the software as shown in Figure 2. Assuming the software is not already installed, the package manager retrieves the appropriate package from the repository, a server that stores all of the available packages. The package manager then uses the package’s metadata to determine what it needs to do to complete installation. For most packages, this includes verifying the installation of all dependencies, which are other pieces of software that must be installed prior to the initial software being installed.

The package manager verifies each dependency by confirming that the dependency is already installed or installing the dependency if it is missing. For missing dependencies, the package manager retrieves the dependency’s package from the repository and starts the installation procedure for that package. Once all the dependencies are installed, the initially requested software is installed. However, the package manager often goes through several iterations of this process because every dependency can have it’s own list of dependencies, in which case each of the dependency’s dependencies must be verified through the same process.

In most cases, a package manager is able to install all of the missing dependencies, but there are two relatively common problems that a package manager cannot overcome. The first is circular dependencies, which occurs when software A has a dependency on software B and software B has a dependency on software A. A package manager cannot install A without installing B first, but B cannot be installed without installing A first. The second is conflicting versions, which occurs when software A and software B have the same dependency, but require different versions. Package managers only allow one version of software to be installed at a time, preventing both A and B from having their depedencies met simultaneously.

Administrators often block users from installing software to prevent unwanted changes to both the operating system and existing sofware. This includes blocking access to package managers built into the operating system, as well as preventing users from installing their own.

Containerization and Virtualization

One approach to sharing software involves virtualization. Virtualization is a technique that allows the creation of instances of a software environment, often called ‘images’. These images present many benefits for software sharing: only a single image file needs to be downloaded by users, and these image files can have all necessary dependencies for the software pre-installed by the developer, thereby avoiding installation difficulties for users.

Traditional virtual machines emulate operating systems and are run directly on the host’s hardware. These are sometimes known as ‘type 1 hypervisors’ [1]. Later, ‘type 2’ virtualization techniques were developed, running on top of the host operating system (OS) by translating all OS instructions within the guest image [1]. While these approaches excel in portability, they incur a significant CPU and memory overhead, prompting the development of a more recent approach called Operating System (OS)-Level Virtualization. OS-Level Virtualization runs on top of the same kernel as the host OS, allowing multiple isolated user spaces to be run in parallel and incurring lower CPU, memory, and networking overhead [1]. These user spaces, e.g. OS-level images being executed by users, are often called “containers”, but also go by a variety of other names. Examples of this technique include Docker [3], LXC [4], OpenVZ [5], Linux-VServer [6], and rkt [7].

Of these, Docker is perhaps the most widely-used platform. A key aspect of the Docker approach is its modularity: images are stored in a public repository on Docker’s website, and it is easy to use an existing image as a “base” to build a new image off of. Docker images are defined by a human-readable, plaintext “Dockerfile” in which the developer specifies whether to build off of a previous base docker image and what additional software libraries and dependencies should be installed. The Dockerfile also serves as an easy-to-read documentation of what dependencies the software is built on [8]. Docker then builds an image based on the developer’s specifications and this image can be uploaded to the Docker website.

Several limitations exist with Docker and other OS-level virtualization platforms. OS-level images are, as the name implies, limited to a specific host operating system [1]. Security concerns have been raised about the vulnerability of containers to compromise, denial of service, and privilege escalation attacks [2]. Deployment on high-performance computing clusters is a promising possibility but presents substantial engineering challenges having to do with maximizing resource utilization and allocating resources effectively [9].

Methods

5 Google Trends 10/ was used to assess the relative average popularity of each package manager. Package managers were omitted when Google Trends did not have enough data to return a result. The method detailed in “How do you compare large number of items in Google Trends?” 11/ was used to overcome Google Trends’ limitation of five search terms.

6 For package managers introduced with a major publication, the number of citations for their publication was found via Google Scholar 12/ and recorded. Package managers without a major publication were excluded.

7 Each package manager’s name was used as a search term on Google Scholar 12/ and the number of results was recorded. Package manager names that returned results for things other than the package manager were omitted.

8 The initial list of all available sofware for each package manager was compiled by parsing its online repository with WebScraper 13/ (AppImage 14, Bioconda 15, Bioconductor 16, BioContainers [???], CRAN 17, Docker 18, EasyBuild 19, Flatpak 20, GNU Guix 21, omicX 22, pip 23, Singularity 24, Snap 25, Spack 26, Vagrant 27, and Zero Install 28) or by querying the package manager directly (APT, Biocontainers, Conda, Homebrew). Each list converted to a set with pandas 29/ and NumPy 30/, which removed all capitalization and duplicates. The number of entries in a package manager’s set was taken to be the total number of packages available for that package manager.

9 The sets obtained in 8 were compared with the set intersection function to find the number of commonalities between all pairs of package managers.

?? The sets obtained for Bioconda, Bioconductor, and omicX in 8 were compared with the set intersection function to find the number of commonalities between all combinations of the three package managers.

11 The sets obtained for Bioinformatics Tools and the non-omics package managers in 8 were compared with the set intersection function to find the number of commonalities between Bioconda and the non-omics package managers.

12 The sets obtained for Bioinformatics Tools and the non-omics package managers in 8 were compared with the set intersection function to find the number of commonalities between Bioinformatics Tools and the non-omics package managers.

13 The sets obtained for omicX and the non-omics package managers in 8 were compared with the set intersection function to find the number of commonalities between omicX and the non-omics package managers.

Glossary

Acknowledgements

Author Contributions

References

[???] https://docs.python.org/3.0/whatsnew/3.0.html

1. Performance Evaluation of OS-Level Virtualization Solutions for HPC Purposes on SoC-Based Systems
David Beserra, Manuele Kirsch Pinheiro, Carine Souveyet, Luiz Angelo Steffenel, Edward David Moreno
2017 IEEE 31st International Conference on Advanced Information Networking and Applications (AINA) (2017-03)
DOI: 10.1109/aina.2017.73

2. Security of OS-level virtualization technologies: Technical report
Elena Reshetova, Janne Karhunen, Thomas Nyman, N. Asokan
arXiv:1407.4245 [cs] (2014-07-16) http://arxiv.org/abs/1407.4245

3. Empowering App Development for Developers | Docker
Docker
(2020-09-03) https://www.docker.com/

4. Linux Containers - LXC - Introduction https://linuxcontainers.org/lxc/

5. openvz https://openvz.livejournal.com/

6. Linux-VServer http://linux-vserver.org/Welcome_to_Linux-VServer.org

7. CoreOS https://coreos.com/rkt/

8. An introduction to Docker for reproducible research
Carl Boettiger
ACM SIGOPS Operating Systems Review (2015-01-20) https://doi.org/10.1145/2723872.2723882
DOI: 10.1145/2723872.2723882

9. Container-based Cluster Orchestration Systems: A Taxonomy and Future Directions
Maria A. Rodriguez, Rajkumar Buyya
arXiv:1807.06193 [cs] (2018-11-02) http://arxiv.org/abs/1807.06193

10. Google Trends
Google Trends
https://trends.google.com/trends

11. How do you compare large numbers of items in Google Trends?
jonathanbriggs
Digital Jobs To Be Done (2017-07-10) https://digitaljobstobedone.com/2017/07/10/how-do-you-compare-large-numbers-of-items-in-google-trends/

12. Google Scholar https://scholar.google.com/

13. Web Scraper - The #1 web scraping extension https://webscraper.io/

14. Apps in AppImage format
AppImageHub
AppImageHub https://appimage.github.io/apps/

15. https://bioconda.github.io/conda-recipe_index.html

16. Bioconductor - BiocViews http://bioconductor.org/packages/release/BiocViews.html#___Software

17. CRAN Packages By Name https://cran.r-project.org/web/packages/available_packages_by_name.html

18. Docker Hub https://hub.docker.com/search?q=&type=image&page=1

19. List of supported software — EasyBuild v4.2.2 documentation (release 20200708.0) https://easybuild.readthedocs.io/en/latest/version-specific/Supported_software.html

20. Flathub—An app store and build service for Linux https://flathub.org/apps/category/All

21. Packages — GNU Guix https://guix.gnu.org/packages/

22. https://omictools.com/software?page=1

23. Simple index https://pypi.org/simple

24. singularity-hub
vsoch
Singularity Hub https://singularity-hub.org/apps

25. Install Linux apps using the Snap Store | Snapcraft
@snapcraftio
Snapcraft https://snapcraft.io/store

26. Package List — Spack 0.15.4 documentation (2020-08-28) https://spack.readthedocs.io/en/latest/package_list.html

27. Vagrant Cloud by HashiCorp
Vagrant Cloud by HashiCorp
https://app.vagrantup.com

28. http://roscidus.com/0mirror/feed-list

29. pandas - Python Data Analysis Library https://pandas.pydata.org/

30. NumPy https://numpy.org/

Distribution System Name	URL	Publication	Type	License
AppImage	https://appimage.org	-	containerization	MIT
APT	https://wiki.debian.org/Apt	-	package manager	GNU GPL 2+
Bioconda	https://bioconda.github.io	Grüning et al, 2018	package manager	MIT
Bioconductor	https://www.bioconductor.org	Gentleman et al, 2004	package manager	MIT
conda	https://docs.conda.io/en/latest/	-	package manager	3-Clause BSD
CRAN	https://cran.r-project.org/index.html	-	package manager	GNU GPL
Docker	https://www.docker.com	-	containerization	Apache 2.0
Easybuild	https://easybuilders.github.io/easybuild/	Hoste et al, 2012	package manager	GNU GPL 2
Flatpak	https://flatpak.org	-	containerization	LGPL
GNU Guix	https://guix.gnu.org	Courtès, 2013	package manager	GNU AGPL
Homebrew	https://brew.sh	-	package manager	2-Clause BSD
pip	https://pypi.org/project/pip/	-	package manager	MIT
Singularity	https://sylabs.io	-	containerization	3-Clause BSD
Snap	https://snapcraft.io	-	containerization	propriertary
Spack	https://spack.io	Gamblin et al, 2015	package manager	MIT or Apache
Vagrant	https://www.vagrantup.com	-	virtual machine	MIT
yum	http://yum.baseurl.org	-	package manager
Zero Install	https://0install.net	-	package manager	GNU LGPL 2.1+

Distribution System Name	First Release	Latest Release	Age in Years	Number of Releases
AppImage	2014-01-24	2020-06-01	7	121
APT	1998-03-31	2020-05-08	22	362
Bioconda	2014-01-24	2016-09-06	7	39
Bioconductor	2002-05-01	2020-04-28	17	37
conda	2014-01-24	2020-04-13	6	261
CRAN	1997-04-23	2020-02-29	22	29
Docker*	2013-03-23	2020-06-01	7	121
Easybuild	2012-11-09	2020-04-14	7	51
Flatpak	2015-03-23	2020-04-03	5	128
GNU Guix	2012-07-07	2020-04-15	7	23
Homebrew	2009-05-20	2020-05-04	10	155
pip	2009-01-20	2020-04-28	11	81
Singularity	2012-07-07	2020-04-15	7	23
snapd	2014-12-09	2020-07-15	5	232
Spack	2014-07-09	2020-04-15	5	27
Vagrant
yum**	2002-06-08	2011-06-28	18	221
Zero Install	2005-02-04	2020-05-04	15	145