Interview: vzpkg2 and pkg-cacher creator Robert Nelson


Robert Nelson seems to have come out of nowhere with an update to vzpkg. Before we get started let me briefly review what vzpkg is.

An OS Template is what OpenVZ uses as install media so you may install a Linux distribution into a container... since you cannot use a traditional CD-ROM / DVD nor .iso disk image. An OS Template is a .tar.gz file that represents a somewhat stripped down version of an installed Linux distribution as you would find it installed on a disk filesystem. So, if you want to create a CentOS 5.2 i386 container, you need to find an CentOS 5.2 i386 OS Template.

There are a number of recipes on the OpenVZ wiki for building OS Templates for various Linux distributions but the general process takes several steps and is quite a bit of work. Any tool that can simplify the creation (and updating) of an OS Template is a welcome addition. OpenVZ comes with vzpkgcache (part of the vzpkg package) which is designed to facilitate OS Template creation for Red Hat based distributions.

In order to use vzpkg you also have to have a metadata package for the target distribution. A metadata package basically specifies a few pieces of required information: 1) what distribution, version, and architecture 2) where to find package repositories, 3) what packages to install, and 4) any dummy packages.

What is a dummy package? There are a few stock distribution packages that are either problematic for a container or aren't used by a container at all. Replacing such packages with dummy packages does two things: 1) satisfies package dependency requirements, and 2) frees up disk space that the real package would have consumed. There are only a small handful of dummy packages and the vast majority of packages used to build an OS Template come from the package repositories provided by the distribution. The two most common examples of dummy packages are the kernel package and the dev / udev package. Since containers do not run a kernel of there own (in OS Virtualization there is a single kernel on the host node) installing a kernel package inside of container would be a waste of disk space. Devices are handled a little differently inside of containers so the dev / udev package requires a bit of customization and a dummy package is used rather than the stock distro package.

One drawback of the existing vzpkg offered by the OpenVZ Project is that there are only a small handful of metadata packages so the number of OS Templates you can build is quite limited.

Who is Robert Nelson?

ML: Please tell me a little bit about yourself. Where are you from (originally and now)? What is your educational background? What are your hobbies? What is your family status (married, kids?)? What do you do for a living?

Robert: I was born in Burnaby, British Columbia, Canada. Burnaby is a suburb of Vancouver where Expo '86 was held. I spent most of my life living in various cities all across Canada. In 1992 I moved to Seattle, Washington in the United States to work at Microsoft. In 2004 I retired from Microsoft and currently live in Bellevue, Washington a few miles from the Microsoft Campus. My partner of 8 years and I don't have any children but we have two miniature Dachshunds that think they are our children.

Since retiring I've occupied my time managing my real estate investments and contributing to open-source projects, programming is probably the closest thing I have to a hobby :-). Most of the open-source projects I've been involved with have been directly or indirectly related to my business.

ML: How long have you been programming? What programming languages do you use / prefer? Are there any other software projects you are / have been involved with that you would like to mention?

Robert: I started programming in 1973 when I used to skip high school to sneak off to Simon Fraser University to play with the IBM 370/155 mainframe. I was around so much that they offered me a summer job developing courses using a CAI language they developed as an extension to APL. So my first languages were APL/CAI and CourseWriter III. From there I branched into PL/1 and System 360 Assembler.

Over the years I've learned and programmed in pretty much every programming language ever developed, including some oldies but goodies like Fortran and Cobol (somehow I missed out on Algol). Most of my professional life has been spent programming in C/C++ and various machine languages.

Lately I've been working a lot in Perl, PHP, Python and Shell scripts because they are the primary languages used in open-source projects.

I don't really have any preferences regarding specific programming languages, I believe that they are all just tools to get the job done. Some are better suited for certain jobs than others but they all have strengths and weaknesses.

Ever since I started playing with Actor (a defunct language, like Smalltalk but with a syntax similar to C++ rather than Pascal), I found I prefer object-oriented ones. I find that object-oriented programming is the best way to organize my thoughts and make large projects more manageable. Even when I'm using straight C code I still organize the code as if I was writing C++.

For most of my career I've worked on "system code", operating systems, compilers and device drivers. When I was with Motorola I worked primarily on proprietary and Unix SVR4 minicomputers. At Microsoft I worked on the Interactive TV project, Windows CE and in the Windows NT kernel group on the I/O subsystem and Plug and Play. Since leaving Microsoft I've worked mainly on Linux. I avoid a religious attachment to any platform, I feel that, like programming languages, each has its own strengths and weaknesses. Sometimes I think that the acolytes on all sides must be compensating for some physical shortcoming of their own. :-)

The main open-source projects I've contributed to include Bacula, mtx, FreePBX, and GForge. I've also contributed fixes to countless others. I've contributed a few of the tools I've written to the open-source community. Usually my involvement starts out with fixes for bugs that hinder my use of the software. If there is some area that could be improved to make the software much more useful for me then my contribution might be larger. It really depends on my interest and how easy it is to work with the other developers involved in the project. But my involvement is usually selfish.

ML: How long have you been using OpenVZ? What other virtualization products have you tried and do you use? What do you use OpenVZ for?

Robert: I've been using OpenVZ for over a year. My interest in virtualization products sprang from my desire to get more use out of a dedicated server I leased and my work on mtx (which I took over about a year or so ago) which in turn was an offshoot of my involvement in Bacula. Both Bacula and mtx required building and testing on a wide variety of operating systems and versions. I found the process of installing and booting all those operating systems tedious and was looking for a better solution than filling my house with dedicated machines.

I started out with VMware. While it mostly met my needs, I prefer an open-source solution where I can change the things I don't like and there is more likelihood of someone contributing other useful tools. I then switched my focus to Xen. It provided most of the functionality of VMware albeit with reduced performance for Windows guests due to a lack of PV drivers for video and disk.

That would have probably been the end of my virtualization quest were it not for another requirement coming from my business. Since a number of my investments are in Canada I've been using VoIP for a few years. I had been running an Asterisk server in my house but found that Comcast's network somewhat unreliable in terms of latency. So after shopping around I found a good deal on a dedicated server with great connectivity and moved my Asterisk server there.

Based on the success of running the server there I decided it would be nice to move other servers like my email server out of my house and on to the server. The only drawback was that the system was somewhat resource limited and the cost of increasing memory was a significant monthly increase. So I went looking for a more resource efficient way of running multiple virtual servers.

I looked at VServer and OpenVZ, VServer seemed unreliable and without much of a community backing it up. OpenVZ fit the bill and I settled on it. It has worked well on my dedicated server. I run three virtual machines, a DNS server, an Asterisk server and a Zimbra server. I've since replaced the rented dedicated server with a co-located one of my own and added two additional virtual servers Funambol for mobile sync and EJBCA for a certificate authority. [Editor's Note: I disagree with Robert's assessment of Linux-VServer. My experience has been that it is very stable and has a very active community backing it up. YMMV.]

As a result of my experience with OpenVZ I set up a virtual build machine that runs about 16 variations of operating systems and versions using a combination of Xen and OpenVZ. This allows me to a release a new version of mtx for for two versions of Debian, three of Fedora, two of CentOS/RHEL, two of FreeBSD, three of OpenSUSE, three of Ubuntu and Windows for both 32 and 64 bits in an hour or so. That's a total of 32 different builds all using one machine with no reinstalls, rebooting or other manual steps.

The stock vzpkg

ML: So, vzpkg used to work fairly well but over time, in certain situations, it started to fail. What is wrong with the current version?

Robert: The main limitation of the current version is it was developed to support Red Hat distributions and is dependent on Yum/RPM. Another limitation is that, due to the structure of the template meta data, there was a lot of duplication of information resulting in extra maintenance.

vzpkg2 and pkg-cacher

ML: You have added a number of features / capabilities to vzpkg. Could you give us an overview of what's new?

Robert: I think the most significant change over the stock version of vzpkg is the separation of the packager specific code from the higher level code. This allows scripts to be written to support other package managers like apt which is used on Debian and Ubuntu.

The other slightly less significant change is the introduction of the concept of a hierarchical structure to the template meta data. Information which is the same for all versions and platforms of a distribution need only be specified once. If there is a need for separate settings for a specific version it can be overridden by a file lower in the template meta data tree.

Also new packager-independent commands have been added for managing packages in installed containers.

ML: You added a new package named pkg-cacher. Where did pkg-cacher come from and what does it do exactly? Can pkg-cacher be used independantly of vzpkg2?

Robert: Most people managing multiple machines (physical or virtual) end up installing a local mirror of some sort. This ranges from a subset of a distribution like only the updates for a distribution on a single platform to multiple mirrors of multiple distributions, versions and platforms. These mirrors are generally maintained using rsync. They are used to reduce bandwidth usage and installation time.

However the amount of disk space and bandwidth used to maintain these mirrors can be quite significant. Particularly when most of the packages are never actually used in the target environment.

While looking for a solution to these drawbacks I came across apt-cacher available with Debian. It is a server written in Perl that acts as a transparent caching HTTP proxy. It processes requests from apt just like an HTTP server but forwards them to a distribution server and keeps the results in a cache, then uses it to respond to subsequent requests for the same file. It knows about the different types of files: packages versus packager meta data. Since a package is static once released there is no need to check the server for updated versions whereas meta data changes over time and the distribution server must be checked for updates. It also understands that the data may be present on multiple mirrors but the content will be the same regardless of which mirror is used.

General purpose caching proxies such as Squid may be used but they do not understand the unique attributes of distribution repositories and will duplicate files retrieved from different mirrors. They also rely on the HTTP headers to decide retention policy rather than using the packager meta data.

I used apt-cacher to handle my Debian distributions but wanted the same functionality for my other distributions such as Red Hat derived ones. So I rewrote it as pkg-cacher.

apt-cacher takes advantage of a key property of the Debian distributions. The version and platform specific meta data is stored separately from the packages. The packages for all versions are stored in a single consolidated set of directories so there is no chance of two packages having the same file name but different content.

However the same is not true of Red Hat derived distributions. Each version and platform has its own copy of the packages built using that release and there are a number of packages with identical names but different content. There are also packages which are unchanged from version to version within a distribution as well as across distributions.

In order to deal with these differences pkg-cacher uses a different directory structure for its cache.

Other significant differences from Debian are the Red Hat packager uses the Range HTTP header to retrieve partial information from the packages and some distributions use the HTTP Redirect header to transfer to a mirror closest to the client. I have added support for these headers in pkg-cacher.

pkg-cacher is designed to be a standalone tool separate from the new vzpkg2. However its use complements vzpkg2 and the default installation of vzpkg2 depends on it.

ML: How does pkg-cacher enhance vzpkg2?

Robert: The original vzpkg reduces the downloads by pointing yum's cache at a directory within the template meta data tree. While this was a step in the right direction, it still meant duplication across platforms. It also provided no benefit to installed containers.

pkg-cacher provides the benefits described in the previous question for producing cached templates as well as installed containers.

ML: Does pkg-cacher come into play from the perspective of the containers?

Robert: The default templates included as part of vzpkg2 configure the template meta data so that it uses the pkg-cacher server configured in vzpkg.conf as VZPKG_CACHE_HOST. The operating system installed configuration files are disabled by renaming them with a .disabled suffix and a new configuration file is installed pointing to the pkg-cacher server.

ML: What container configuration changes have to be made in order for a container to use the services provided by pkg-cacher?

Robert: Generally all that needs to be done is change the name of server in the packager configuration files. This is done automatically for containers installed from cached templates generated by vzpkg2.

ML: Does one have to use pkg-cacher in order to use vzpkg2?

Robert: No, all that is required is modifying the vzpkg.conf files located in the template meta data to use another proxy server, the original distribution servers or even a copy of the distribution server in the local filesystem.

ML: Are there any features in vzpkg2 that you can't use without pkg-cacher?

Robert: No, pkg-cacher supplements vzpkg2 providing more efficient use of resources.

A vision of the future

ML: What is the next step? After vzpkg2 has had a bit more community testing and you have gotten feedback and made any additional changes to it that are needed, is the plan for it to replace the official vzpkg or would you prefer it to stay an independant / separate app?

Robert: I would like to see the vzpkg2 changes incorporated into OpenVZ and replace the current outdated vzpkg. I anticipate that pkg-cacher will always remain a separate tool because of its general usefulness.

ML: Are there any features you haven't added to vzpkg2 (or pkg-cacher) yet that you hope to impliment in the not too distant future?

Robert: There are still a number of features of apt-cacher which I haven't rewritten to work with pkg-cacher. These are primarily in the area of maintenance of the cache, such as removal of packages which are no longer referenced by the packager meta data. I also plan on eliminating the dependency on the lockfile utility included as part of procmail. This was a dependency that I didn't realize was there until it was brought to my attention recently. The final planned change is conversion from a multiple process to a multithreaded application for improved efficiency. A minor configuration change is the port that pkg-cacher uses. It currently uses port 3142 which apt-cacher used. However that port is actually registered to something else but near as I can tell isn't actually used for its intended purpose. In the interest of being a good netizen, I currently have a registration request pending with the IANA for a port specifically for pkg-cacher's use.

For vzpkg2 there are two significant remaining work items. First is the completion of the manual pages for all the commands. The second is support for a way of specifying the included packages incrementally.

Currently package lists further down in the template meta data tree replace those above. Ideally it would be useful to say for example that, in a specific version, package X shouldn't be included but package Y should. Also it would be nice to be able to include the processed list of packages from another list. For example you would be able to say that the "web server" configuration is all the packages included in the "small" configuration with packages X, Y and Z added.

ML: What limitations, if any, do you see with vzpkg2 and what would your perfect OS Template manager be like?

Robert: I think the only significant limitation of vzpkg2 is its implementation using shell scripts. Some functions would be much easier to implement in Perl or Python. As far as functionality I believe that, with the additions described in the previous answer, it addresses all my needs for a package manager. I'm happy to entertain any suggestions others might have.

I suppose the obvious area that others might take exception to is the lack of support for either openSUSE or Gentoo. For openSUSE I've created templates for version 10.x however in 11.x the openSUSE folks modified rpm in a way completely incompatible with the upstream version as well as the version supplied on every other distribution. They did this for an increase in compression ratios whose benefit is far outweighed by problems caused by the incompatibility with the rest of the world and even previous versions of their own distribution. They created this incompatible version of rpm without renaming the tool. Because of this incredible lack of judgment on their part (IMHO) I haven't bothered to support it. That doesn't stop anyone else, with a need, from porting their version of rpm to other distributions, renaming it something like rpmSUSE and creating the appropriate packager specific scripts for vzpkg2. That is one of the main benefits of vzpkg2 over vzpkg is it is extensible merely by adding additional scripts specific to the packager used on the new distribution.

There is no technical reason why Gentoo couldn't be supported by the current vzpkg2 other than I just haven't gotten around to writing the interface scripts for emerge. I concentrated on yum/rpm and apt/dpkg to ensure that I had the right level of abstraction to deal with two very different packaging solutions. I also figured that supporting Red Hat and Debian based distributions covered the vast majority of users.

ML: Given the vast amount of changes from vzpkg to vzpkg2, and the addition of pkg-cacher... I see a need for a bit of updated / additional documentation. How is that going? All done? Need some help?

Documentation is one area where there is always need for more, better, more concise, ... (fill in your favorite adjective here). I am working to create additional manual pages for all the commands. But if someone wants to volunteer, particularly on creating higher level, user friendlier docs I think that would be great.

In conclusion

Robert, thank you... for your work on vzpkg2 and pkg-cacher... and for the time you put into answering my quetsions. One last one for you though...

ML: Are there any topics I've overlooked that you'd like to mention or any additional comments you'd like to make?

Robert: None that I can think of.

===End of Interview===

One thing worth mentioning is that while the number of OS Template Metadata packages provided by the OpenVZ Project is quite limited, Robert has created new metadata packages for vzpkg2 that allow for easily building CentOS, Debian, Fedora, and Ubuntu OS Templates. If I counted correctly, Robert's new metadata packages make it easy to build 44 different OS Templates. Wow!

It might take a few more weeks before vzpkg2 and pkg-cacher are finalized and added to the OpenVZ Project repositories. If you don't want to wait and would like to help out with testing, Robert has posted some instructions to the OpenVZ Users mailing list and here is a link to the archive for the time period in question: