Tuesday, May 17, 2011

Implicit outputs : badness in a build

In a previous post, I talked about explicit versus implicit inputs and outputs.
The final issue to discuss is implicit outputs, and why they are bad for a build.

Why they're bad in general

Implicit outputs for a particular build translation are filenames that are not knowable without inspecting the inputs of that translation. (or similarly stated -- you can't know the filenames of implicit outputs until you run the build task)

For example, let's say we add a translation to unzip a "boost.zip" file. During build graph construction, in the worst case, we cannot know which files will be generated. It could be the case that boost.zip is a generated file, so there is no file to inspect during graph creation time!
(if boost.zip is not generated as part of the build, then a work around is of course to allow inspection of the file during build graph creation -- but that's cheating!)

The real problem comes in when some other translation depends on such an implicit output. Let's say boost_format.hpp is generated, and some file foo.cpp depends on it. Since the build does not know that boost_format.hpp is generated at the start, it could schedule foo.cpp first, which would of course fail to compile (due to missing header). Later the build discovers this, and must retry the compilation of foo.cpp.

If you assume there is a well-formed order to the build (with no cycles), then an automated algorithm that discovers such an order would require at least O(N^2) work, where N = # of translations. This is clearly much worse than the O(N) required for a build that does not produce implicit outputs. The fundamental difference is that implicit outputs can dirty translations that previously completed; no other type of input or output can do this.

The reasoning for why it is O(N^2) goes like this :
  1. In the worst case, each pass at building N translations results in the final translation dirtying all other translations by producing implicit outputs.
  2. The build now knows the final translation in the current pass is at the correct global order.
  3. Reduce the remaining set of translations to N-1, and start again from step #1.
  4. Since each pass (costing O(N)) found one translation's global order, and there are O(N) passes, in total the cost is O(N^2).
The algorithm is effectively the same as a bubble sort for a build.

The clear lesson to me is : don't code up a build that produces implicit outputs!

Some examples of implicit outputs in the wild

  1. Unzipping or un-archiving files.
  2. Generating one header file per namespace or data-type from a .NET assembly.
  3. The java compiler generating a class file for each inner class.
  4. A tool that splits an image into many individual tiles or sub-images.
  5. A tool that splits a CD image into one file per track.

Common work-arounds

  1. Define phony targets (a form of manually specifying dependency information)
  2. Multi-step build processes where implicit outputs are completely produced by an early phase, so that a later phase can treat those outputs as if they were not generated. This can allows the build graph creation of the later phase to inspect files and predict implicit outputs.
  3. Cache the previous build's output-list to determine if a translation was dirtied. This is useful for the java inner-class issue, for example.

Tuesday, May 3, 2011

Types of inputs and outputs : Explicit versus Implicit

In a previous post, I mentioned the terms explicit inputs and outputs, and implicit inputs.
Here I'll take the opportunity to explain what I mean by these terms.

Explicit Inputs and Outputs
Explicit I/Os are file names that can be fully determined by the parameters to a build command.

Implicit Inputs and Outputs
Implicit I/Os are file names that can only be determined by inspecting the contents of an input file. In the general case, they are only knowable by executing a Translation.

Let's do an example of C++ compilation. Assume foo.cpp includes stdio.h.
cl.exe foo.cpp /c /Fo:foo.obj
foo.cpp is an explicit input.
foo.obj is an explicit output.
stdio.h is an implicit input. We cannot figure this out by looking at the command-line; we must look inside foo.cpp to figure that out.
There are no implicit outputs.

Why are these classes of inputs and outputs important?
They represent different frontiers of knowledge that the build engine can have when constructing a graph.

Let's take a simplified look at how QRBuild does things. QRBuild constructs a graph of Translations, where each Translation is a unit of execution that advertises its inputs and outputs.

At the start of a build, dependency information between Translations can only be determined algorithmically based on each Translation's explicit inputs and outputs. At this stage, we cannot attempt to open any output-files to figure out implicit inputs or outputs, because said files probably don't exist yet! In an arbitrary build graph, any file may be generated (including headers or cpp files).

Based on the user-selected targets (which outputs to produce), now the build begins. The initial Translations don't depend on any other Translations' outputs. Once those are complete, other Translations that depend on the produced outputs are allowed to run, until all required Translations are executed.

If there were no such thing as implicit inputs, we'd be done. Of course, we're not even close to done here :-) We know implicit inputs exist (like in C++ code), and we want them to cause dependency edges in the build graph. Without this automatic behavior, we'd have to manually add the same dependencies between Translations.

QRBuild basically employs a strategy of retrying each build command until all of its implicit inputs have been created.

When all inputs of a Translation are available, the engine requests for it to update its implicit inputs. The Translation implementation does this by either running a special command on the input files, or by simply attempting to execute the real build command. If additional implicit inputs are found, they are reported back to the build engine.

At this point, the build engine checks to see if all inputs exist. If they all exist, then the Translation is allowed to execute normally. If some are missing, then the build graph is updated so that the Translation instance in question now depends on some other stuff. The engine then makes sure the other stuff runs before attempting our Translation again.

Case in point : the MsvcCompile translation. It invokes the pre-processor with /showIncludes to get the list of includes (the implicit inputs).

What is the performance implication here?
Given a graph G containing N nodes (each node is a Translation), let's figure out how many Translation.Execute() calls we need to make.
  1. With only explicit I/O, at most O(N) executions. Each node will be visited at most once, in dependency order, so it's exactly N.
  2. With implicit inputs, at most O(N) executions. Worst case is actually 2*N.
Intuitively, I was worried that the implicit inputs algorithm would be N^2. Fortunately this is not the case. The proof goes something like this:
  • Consider translation T1.
  • Base case : No implicit inputs found. T1 is executed.
  • Inductive case : implicit inputs found.
    • All inputs exist : T1 is executed.
    • One input missing : dependency established to translations T2.
      T1 is put back in the "sleep pool".
    • T2 must complete before T1 is retried.
  • → Each time a new implicit input is found on some translation T1, some other translation T2 is enqueued and eventually completed. Since there are N total translations, there are a maximum of N retries. Therefore, the overall runtime is O(N).

File up-to-date checks

What determines whether a file is up-to-date in a build?
Quite simply, it is up-to-date when its inputs have not changed since the previous build. If a file's inputs have changed, that file is said to be dirty.

For example, with a C++ object file foo.obj generated from foo.cpp, the following are its inputs:
  1. Contents of foo.cpp.
  2. Contents of header files that it last included.
  3. Command-line parameters (aka command options) to the compiler.
  4. The compiler.
  5. Environment variables that influence the compiler.
In order to really get this right, each object file must have some dependency-file representing all this info at the end of the last successful compilation.
Some build systems (like make and MSBuild) use heuristics like "if any-input-modified-time > any-output-modified-time then consider output-file dirty". This works in common cases, but fails if a source-control system puts a file-date back in time.
Ultimately, the information used to determine whether a file has changed is a policy decision, best made by the user. It's quite clearly a policy since there is a continuum of trade-offs between performance versus accuracy, with no clear "right" answer.

Scons generalizes this policy with the Decider function. There is pre-written support for:
  1. MD5 signature : very accurate but requires reading all bytes of every input
  2. Timestamp checking : less accurate since touching a file triggers rebuilding its dependent targets, but low overhead to do the check (fstat is fast)
QRBuild has taken a page from the Scons book, even going so far as to calling the interface IFileDecider. By default, a date+size decider is used. An MD5 decider is also available.

Notice that I didn't mention a change to the list of header files. #1 and #2 are sufficient to catch that, because the only way to change the list of headers is by either modifying the source file, or modifying one of the headers that was included in the last compilation. Many build systems successfully exploit this property, not least of all make with gcc.

QRBuild handles these checks in the engine, by requiring all Translations to implement functions that return lists of explicit inputs and outputs, implicit inputs, and canonicalized "cacheable" translation parameters. Splitting responsibilities and providing this as a standard feature frees individual Translation writers from having to repeat this work in each Translation class.

Frequently, you will find build systems that don't take compiler options into account. The result is frustrating -- for example, you might edit a simple Makefile and run make, only to find that nothing was recompiled! Custom build steps in Visual Studio projects often suffer the same issue.

Some authors work around this by adding the Makefile/build-scripts to the target files' dependency lists. While this is technically correct, it is a vastly sub-optimal solution. Since a single build script usually controls many independent targets, a modification to the command options of one target will cause all other independent targets to rebuild as well.

Lastly, you will find it very rare to include the build tools themselves into the dependency list. To some degree this is not as big a deal, since choice of build tool rarely changes. But it's easy to add at least the required executable to each target's dependency list as basic insurance. The QRBuild MsvcCompile translation class handles this by adding the VcBinDir and toolchain to the cacheable translation parameters.

MSVC "enable minimal rebuild" : triumph and tragedy

The Enable Minimal Rebuild (/Gm) feature of the Microsoft C++ compiler (MSVC) is really great. When enabled, the compiler is actually able to determine whether a significant code change has occurred, and selectively re-compiles only the functions that have changed. For example, if you only change a comment, the compiler will happily print out "Skipping... (no relevant changes detected)". This is certainly a triumph!

So what's the tragedy? All the documentation states that minimal rebuild cannot be compiled with parallel builds -- namely the /MP switch, which causes cl.exe to spawn itself multiple times. Apparently we must choose between single files being quick to compile, versus many files being quick to compile. However, you can have both, as I will demonstrate below.

To further explain, the bulk of the documentation is geared towards users of the Visual Studio project systems. Unfortunately, the Visual Studio project systems (*.vcproj and *.vcxproj) only support any kind of parallel C++ builds with /MP. (this is because neither vcproj nor MSBuild vcxproj support file-level dependencies)
The /MP switch is of course nonsense in the context of a real dependency-based build system like make, bjam, Scons, or QRBuild. A build system that provides a global solution for parallelism always works at the finest granularity, and utilizes all processors effectively. Local solutions like /MP will end up achieving much worse performance as the problem scale increases (like when building many MSBuild projects with project-level parallelism enabled). "Tuning C++ build parallelism in VS2010" demonstrates this very well in the sub-section "Too much of a good thing", where the author shows each MSBuild project launching NumCores cl.exe processes, which overwhelms the machine.

The root of the problem is that by default, cl.exe creates a single vcx0.pdb and vcx0.idb across all source files (where x is the msvc version -- for example VS2008 compilers produces vc90.pdb and vc90.idb). In order to arbitrate across multiple cl.exe processes in a parallel build, MSVC launches a server process called mspdbsrv.exe to synchronize access to a common pdb file.
Notice that there is no equivalent msidbsrv.exe for synchronizing access to a common idb file. I suspect this is the core reason why /Gm is unsupported for parallel builds.

What can we do to get parallelism and minimal rebuilds? The trick is to compile each source file independently, and force each file to generate a separate pdb and idb file. Then there is no issue of parallel compilations stepping on each others' idb files.
cl.exe accepts the /Fd switch to select the pdb name per compilation. The pdb name also implies the idb name, so this is sufficient to solve the problem.
QRBuild successfully employs this trick.

caveats:
Apparently YMMV with regard to compilation speed with /Gm.
  1. Slowness when compiling boost.
  2. Possibility of slower link times, as this FAQ implies.

Sunday, May 1, 2011

File-level dependencies == parallelism

I mentioned in a previous post that file-level dependency ordering is what defines a build system.
Let's see why this is with a counter-example : MSBuild.

MSBuild works by running a list of Targets in dependency order. Each Target executes one or more tasks in sequence. So overall, all Task executions are completely serialized within a single project's execution. A fancy C++ project sequence will go in phases like this:
  1. Generate source file gen1.cpp from gen1.pl
  2. Generate source file gen2.cpp from gen2.py
  3. Compile all C++ sources. (gen1.cpp, gen2.cpp, a.cpp, b.cpp, c.cpp)
  4. Link object files.
  5. Copy linker output to final destination.
According to MSBuild's programming model, each step must fully complete before the next step may begin. In this case, the opportunity to parallelize the code-generation from #1 and #2 are lost. Additionally, #1 and #2 could be parallelized with the compilation of a.cpp, b.cpp, and c.cpp; this opportunity is lost as well.
With max parallelization of 3 set on the CL task, you might see this play out:

Time01234
Proc #2..c.cpp..
Proc #1..b.cppgen2.cpp.
Proc #0gen1.plgen2.pya.cppgen1.cppprog.exe

A build system that supports file-level dependencies overcomes this kind of wasteful serialization. The code-generation can occur while independent compilation occurs. For example, a Makefile can achieve the following:

Time01234
Proc #2a.cppgen1.cpp...
Proc #1gen2.pyc.cpp...
Proc #0gen1.plb.cppgen2.cppprog.exe.

Of course, both of these examples are over-simplified -- build steps are quantized into equal chunks of time. In a real build, variable times for each build step will cause things to overlap in the most efficient way possible, resulting in optimal behavior under varying conditions.

Why is Build automation important?

Build automation is the lifeblood of any software organization. When it works, everyone can practically ignore it and focus on the bottom-line. When it doesn't work, it impacts every decision made by each person in the organization.

The biggest observable problems a build can have are:
  1. Slow incremental build speed.
  2. Poorly designed build scripts.
Slow incremental builds can be caused by many factors. False dependencies cause too much stuff to be built. Bad or missing dependencies create mistrust for a build, and require a full rebuild (maximum cost!) to work around.

Poorly designed build scripts are the other big problem. Build scripts are not amenable to change if a large number of files must be edited in order to re-arrange code, branch a module within the overall repo, or add a custom build step. High degrees of coupling between modules through the build system is another source of maintenance issues.

Slow builds bring individual developer productivity to a halt.
Developers work on an iteration cycle of repeat { edit -> build -> test }. The faster a dev can iterate, the faster they can converge upon a solution to a given problem. When build and test time are zero, then you are purely limited by developer editing speed -- that is the ultimate goal.
Qualitatively, who do you think will achieve more success : a team with incremental build-time of 1 second, 30 seconds, or 2 hours?

Slow builds discourage active code-reviews and participation.
In my experience, requiring code-review before committing is a huge determiner of code quality. In this kind of successful environment, reviewers must apply patches, build the software, and potentially tweak it as they find bugs.
What is the single greatest deterrent to applying a patch? The impending doom of a 2 hour build, of course! A 2 hour build forces each developer to think, "Do I get to work on my own objectives today, or will I lose the day to this review?"

Slow builds discourage good source-control practices.
You know how everyone says to keep commits minimal and orthogonal? That's really hard to do when each one-line fix costs you 2 hours to verify! So let's say you make the decision to commit without building first (if you're awesome enough to try this, of course). There's a good chance that the continuous build server will do something dumb and spend 2 hours on each of your three one-liner back-to-back commits.
Long build times tend to encourage developers to glob larger changes into single commits. And that's a bad thing.

Slow builds discourage developers from updating from source-control as often.
If it costs 2 hours of build time after you update/sync/rebase to the latest version of the code-base, then at most you will do this once per day.
Good-bye to best practices like sync'ing and building one last time before committing your code. Which of course implies that the build will be broken a higher % of the time.

Slow incremental builds make it harder to pin down build failures.
When the build takes 2 hours, continuous build servers will always be processing large batches of commits at a time. So when the build breaks due to one of the commits, you have a slew of problems:
  1. You found out at best 2 hours after the commit occurred. More likely, you found out after 4 hours (2 hours for previous build to complete + 2 hours for the breaking change).
  2. Which commit is it? You probably have 10 commits to look through.
  3. Back-to-back breaking changes (especially by different submitters) are especially hard to work through.
If a developer leaves work at time T, he will not commit code after T - buildTime. Therefore, with D developers, you are losing O(D*buildTime) time on your team.
Developers need to be around in case their commits break the build. If a build takes 2 hours, the developer will be sitting on a set of changes for 2 hours at the end of each day! What a waste of time. Hopefully he has some orthogonal piece of work, or else he's literally blocked.

Slow builds discourage the creation of multiple branches.
You're telling me I have to integrate these N changes into the Developer and Release branches, and build both? Woe unto me if the build mucks with some system-global state -- then I can't even run the two branches' builds in parallel!

Inflexible build scripts lead to more bad practices.
Example : if you can't easily add a code-generation target to the build, you might just check in the generated files. Now you have another maintenance problem!

Build scripts that don't implement 'clean' correctly have a high cost.
When 'clean' doesn't work, then developers are always manually deleting stuff in the file system.
Couple this with bad or missing file-level dependency checking, and you find yourself manually deleting files and rebuilding on every edit. Ouch.

Hopefully all these examples have convinced you of the importance of a correct, fast, and maintainable build.

And finally, an analogy to make this all feel a bit more "real".
Consider each developer to be a thread; the build server and repository server are also threads. Each thread has a copy of the repo in its TLS (thread local store).
The repository has a commit FIFO (a write FIFO); developer threads are producing commits and enqueueing them in some order. There is a similar FIFO on the build server, that is pushed into by the repository server after every commit has gone through.
The build time determines the frequency at which developer threads produce commits. Faster build time implies greater throughput to the build server. The build time also determines the build server's frequency of producing a result.

I leave you with this mighty pbrush creation: