Tuesday, May 17, 2011

Implicit outputs : badness in a build

In a previous post, I talked about explicit versus implicit inputs and outputs.
The final issue to discuss is implicit outputs, and why they are bad for a build.

Why they're bad in general

Implicit outputs for a particular build translation are filenames that are not knowable without inspecting the inputs of that translation. (or similarly stated -- you can't know the filenames of implicit outputs until you run the build task)

For example, let's say we add a translation to unzip a "boost.zip" file. During build graph construction, in the worst case, we cannot know which files will be generated. It could be the case that boost.zip is a generated file, so there is no file to inspect during graph creation time!
(if boost.zip is not generated as part of the build, then a work around is of course to allow inspection of the file during build graph creation -- but that's cheating!)

The real problem comes in when some other translation depends on such an implicit output. Let's say boost_format.hpp is generated, and some file foo.cpp depends on it. Since the build does not know that boost_format.hpp is generated at the start, it could schedule foo.cpp first, which would of course fail to compile (due to missing header). Later the build discovers this, and must retry the compilation of foo.cpp.

If you assume there is a well-formed order to the build (with no cycles), then an automated algorithm that discovers such an order would require at least O(N^2) work, where N = # of translations. This is clearly much worse than the O(N) required for a build that does not produce implicit outputs. The fundamental difference is that implicit outputs can dirty translations that previously completed; no other type of input or output can do this.

The reasoning for why it is O(N^2) goes like this :
  1. In the worst case, each pass at building N translations results in the final translation dirtying all other translations by producing implicit outputs.
  2. The build now knows the final translation in the current pass is at the correct global order.
  3. Reduce the remaining set of translations to N-1, and start again from step #1.
  4. Since each pass (costing O(N)) found one translation's global order, and there are O(N) passes, in total the cost is O(N^2).
The algorithm is effectively the same as a bubble sort for a build.

The clear lesson to me is : don't code up a build that produces implicit outputs!

Some examples of implicit outputs in the wild

  1. Unzipping or un-archiving files.
  2. Generating one header file per namespace or data-type from a .NET assembly.
  3. The java compiler generating a class file for each inner class.
  4. A tool that splits an image into many individual tiles or sub-images.
  5. A tool that splits a CD image into one file per track.

Common work-arounds

  1. Define phony targets (a form of manually specifying dependency information)
  2. Multi-step build processes where implicit outputs are completely produced by an early phase, so that a later phase can treat those outputs as if they were not generated. This can allows the build graph creation of the later phase to inspect files and predict implicit outputs.
  3. Cache the previous build's output-list to determine if a translation was dirtied. This is useful for the java inner-class issue, for example.

No comments:

Post a Comment