Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 18 | Comments: 87

filter by tags archive

The insidious cost of allocations

time to read 5 min | 835 words

One of the things we learned from build high performance systems is that algorithm complexity is important for system performance, but it is (usually) easy to see.

Controlling allocations is something that you don’t see. But it can have an even greater impact on your system. In particular, allocations require some CPU cost immediately, but the .NET framework is pretty good about keeping that small. In most cases, that require only bumping a counter value and giving you some memory. The true cost of allocations happen when you need to release that memory. GC, compactions, multiple generations, etc.

There are whole fields of study that deal with optimizing this issue. The downside of allocating a lot of memory is that you are going to pay the cost later. And that figuring out what exactly allocated that memory can often be quite hard. Especially if you are trying to figure it out from a production trace.

And even with the state of the art GC and memory management subsystems, your best bet to reduce memory management pressure is to reduce the number of allocations. We are far from the first to run into this. In fact, see the guidelines to contributing to Roslyn.

With RavenDB, here are a few commit messages dealing (only) with this issue:

image

The results of the performance improvements on those cannot really be overstated. In some cases, we save an allocation by just 8 or 16 bytes. But in a hot path, that means that we reduce GB of memory being allocated.

The problems with allocations is that they can be pretty well hidden, here are a code sample with hidden allocations:

var fileNames = new List<Tuple<string, bool>>();
foreach (var file in System.IO.Directory.GetFiles(dirToCheck, "*.dat"))
{
    using (var fs = new FileStream(file, FileMode.Open))
    {
        uint crc = CalculateCrc(fs);
        var crcStr = File.ReadAllText(file+".crc");
        var crcMatch = crc == int.Parse(crcStr);
        fileNames.Add(Tuple.Create<string, bool>(file, crcMatch));
    }
}

We will assume that CalculateCrc does absolutely no allocations, just calling ReadByte() until the end. We’ll further assume that there are 10,000 files in the directory.

How much memory is going to be allocated here? I’ll freely admit that I’ll probably miss some, but here is what I found:

    • The fileNames list, which will allocate (using power of 2) a total of 256KB (32,764 items total in the list below, times 8 bytes per pointer)

new Tuple<string,bool>[4]
new Tuple<string,bool>[8]
new Tuple<string,bool>[16]
new Tuple<string,bool>[32]
new Tuple<string,bool>[64]
new Tuple<string,bool>[128]
new Tuple<string,bool>[256]
new Tuple<string,bool>[512]
new Tuple<string,bool>[1024]
new Tuple<string,bool>[2048]
new Tuple<string,bool>[4096]
new Tuple<string,bool>[8192]
new Tuple<string,bool>[16384] <—This goes in the large object heap, is is 128Kb in size

  • The call to GetFiles will allocate about 0.5MB:
    • It also uses a list internally, then it calls ToArray on it, so the same costs as the list, plus another 78Kb for the array. And another 128Kb buffer is in the large object heap.
    • Note that I didn’t went too deep into the code, there are other allocations there, but it uses an enumerator internally (to feed the list), so I’ll assume not allocations.
  • Creating a FileStream isn’t all that expensive, but reading from it (which is done in the CalculateCrc method) will force it to allocate a 4KB buffer. So that is 39MB right there.
  • File.ReadAllText is a really nice convenience method, but it uses a StreamReader internally, which has a 1Kb buffer, and that uses a FileStream, which has yet another 4Kb buffer. So another 48MB.
  • 10,000 strings for the file names returned from GetFiles, assuming that each file name is 8.3 characters, that gives us 107Kb.
  • 10,000 strings for the CRC values we read using ReadAllText, assuming each is also 11 characters (text value of 0x576B6624), we get another 107Kb.
  • Oh, and I almost forgot, we have another 10,000 strings allocated here, the file+”.crc” filename, which is another 146Kb.
  • Finally, we have 10,000 tuples, which comes to about 90Kb.

Overall, calling this method will allocate about 90MB(!), including several allocations that fall into the large object heap.

Note that most of those calls are done on your behalf, inside the List handling, but mostly instead the I/O routines.

It has been our experience that allocations inside various I/O routines are a really common problem, especially when you have no way to send your own buffer (which you can pool, reuse, etc). For example, there is no way to share the buffer between different FileStreams instances.

I created a Pull Request in CoreCLR for this feature. But in the meantime, profiling memory allocations (vs. just the memory used) is a great way to find such issues.

Production postmortemThe industry at large

time to read 1 min | 100 words

The following is a really good study on real world production crashes:

Simple Testing Can Prevent Most Critical Failures:
An Analysis of Production Failures in Distributed
Data-Intensive Systems

It makes for fascinating reading, especially since the include the details of the root cause of some of the errors. I wasn’t sure whatever to cringe or sympathize Open-mouthed smile.

image

RavenDB 3.0 New Stable Release

time to read 4 min | 782 words

We have just released build 3785 of RavenDB 3.0. This build has quite a bit of changes (for the full gory details, see the pull request).

This release includes about 3 months of bug fixes, performance improvements and the like. We have been testing this on our own systems for a few weeks now, as well as on multiple live production sites, and the results have been nothing but encouraging.

Major changes

  • Voron & map/reduce optimizations. We have done major work to optimize how RavenDB uses map/reduce on Voron. As a result, map/reduce performance on Voron has improved tremendously. However, this require a migration step during the first startup. If you have a large RavenDB database using Voron, and you are making heavy use of map/reduce, take into account that on first start, RavenDB will need to perform an internal migration, which can take a while.
  • Lucene & memory allocation reduction on queries. We have drastically reduced the amount of memory that is allocated per query, and improved the performance of queries substantially.

Improvements:

  • Many small perf optimizations, memory allocations reductions, object pooling, etc. Drastic reduction in memory allocations on common code paths.
  • Better handling of buffer allocations in websockets, reduces memory fragmentation.
  • Better handling of Take() / Skip() inside an index.
  • Allow only a single index to use the fast precomputation optimization at a time (reduce memory usage if multiple medium sized indexes are changed concurrently).
  • Better handling of concurrent addition of multiple indexes to large databases, will now run in the same set of indexing batches, instead of each having their own.
  • Re-implemented memory statistics checks using native calls to avoid expensive allocations.
  • Provide more detailed information when an index is corrupted.
  • Adding endpoint for stopping / starting just reduce work.
  • Less aggressive changes to the batch size at scale, being more cautious gives us a bit slower perf but more stable system under load.
  • Optimized Voron recovery code heavily to support slow I/O systems on large databases.
  • Allow to mark individual databases as development / staging /production.
  • Better handling of Lucene file usage, using mmap to avoid all allocations when querying the indexes. Significant improvements to both memory usage and querying speeds.
  • Subscriptions can now start from a given etag.
  • Subscriptions now offer more robust handling for attempting to open an existing subscription.
  • Subscriptions will now send a “no results found” so we won’t time out for mostly idle subscriptions.
  • Allow to manage scripted index scripts from code using AbstractScriptedIndexCreationTask.
  • Fixing replication issue with RavenFS with large number of files being modified all the time.
  • Track the query parse time as well, for certain queries the expensive part is parsing the query, rather than executing it.
  • Can lock transformers for modifications now as well as indexes.
  • Better heuristics for calculating how much memory (native & managed) we are actually using.
  • Add debug endpoint to track how much map/reduce work we still have to do.

Bug Fixes:

  • Don’t update a side by side index if it already exits.
  • Allow to update a side by side index while it is still running.
  • Fixing index compilation error on .NET 4.6 using “new string[0]”.
  • Fixed an NRE when the index definition was removed forcibly when using dynamic queries.
  • Fixed error handling during disposal causing an exception to escape thread boundary and crashing.
  • Fixed FIPS licensing issue on embedded dbs.
  • Admin logs are not capturing logs protected by IsDebugEnabled statement
  • Fixed a finalizer usage bug causing us to try to read from a closed handle.
  • Prevent corrupted index warning when creating a map-reduce index and indexing is disabled.
  • Preventing code from trying to use disposed internal transactions.
  • Installed fix - check and revoke URL reservation options when Use existing website is selected.
  • Properly dispose of timer instance when shutting down a database using expiration bundle.
  • Prevent an error loading ICSharpCode.NRefactory from killing RavenDB client startup.
  • When disk space is very low, stop indexing and warn about it, rather than index to full disk error (and probable index corruption).
  • Fixing stack trace generation in generate debug info when we have spaces in the temp path.
  • Moved default db locations outside of the IIS directory to avoid IIS bug causing restarts.
  • When out of memory, replication will back off and retry, rather than fail continuously.
  • Better handling of deleted then created indexes and transformer replicating to sibling nodes.
  • Fixing case sensitivity issue when returning document ids.
  • Fixing an issue with file names in RavenFS containing #.

Production postmortemThe case of the lying configuration file

time to read 4 min | 609 words

While studying an issue using customer data, I noticed that indexing speed wasn’t up to what I expected it to be. In fact, the size of the indexing batch remained roughly constant (and small), and didn’t exhibit the usual increases as RavenDB notices that the server has a lot of work to do and the resources to do it. This was while investigating something else, but since I had to re-index that database quite a few time, I decided to investigate what was going on.

The underlying issue turned out to be a configuration setup. An index was specified with a MaxNumberOfOutputsPerDocument of 55. We use this value for a few things, among them to ensure to manage the memory resulting from indexing operations. In particular, we have had some issues with indexes that output a large number of index entries per documents using more then the quota allocated to the index and generating (sometime severe) memory pressure.

Unfortunately, in this case, we had the other option. The index was configured properly, but the index didn’t actually output multiple entries. So we ended up assuming that the index would generate a lot more memory than it would actually really use. That meant that we couldn’t feed it larger batches, because we feared it would use too much memory…

The fix was to make effectively ignore this value. Instead of using the value assuming that the user knows what is going it, we’ll use this value as the maximum value only, and use heuristics to figure out how much memory we should reserve for the index in question.

This is a smaller example of a wider issue. The values that the system gets are user input. And they should be treated as such. This means that you need to validate them in the same sense you would validate any other input from users.

In the case outlined above, the only implication was that we would index a more slowly, and weren’t able to take full advantage of the machine resources we had available. There have been other such issues.

An administrator has setup a set of range values so the min value was larger than the max value. This turned out to cause us to have no effective limit, short of an integer overflow. The end result was that we would use unbounded memory for our needs, which would work, most of the time, except when you had a big set of changes to apply and we would try to do it all at once…

Another case was an ops guy that wanted to reduce RavenDB CPU usage, so he set the number of threads available for background work to 0. That meant that work that was queued on background threads would never complete, and the system would effectively hang.

You get the drift, I assume.

Your configuration file (or however you are actually configuring things) is yet another way for your user to communicate with your application. Sure, the tone is much stricter (commanding, rather than asking), but you still need to make sure that what the user is asking you to do actually make sense. And if it doesn’t, you need to decide what to do about it.

  • You can exit the process. The “I can’t work with you people” mode of error handling.
  • You can emit a warning and proceed. The “I told you so, buster!” motto of blame shifting.
  • You can ignore the set value. The “I know better than you” school of thought.

There aren’t really good choices, especially for server applications that can’t just show an error to the user.

Production postmortemThe case of the memory eater and high load

time to read 9 min | 1628 words

This is a recent case. One of our customers complained that every now and then they started to see very high memory utilization, escalating quickly until the system would bog down and die. They were able to deploy a mitigation strategy of a sort, when they detected this behavioral pattern, they would force RavenDB to reject client requests for a short while, which would fix this issue.

imageThis went on for  a while, because the behavior was utterly random. It didn’t seem to relate to load, peek usage time on the system didn’t correlate to this in any way. Eventually the customer raised another issue, that a certain spatial query was behaving very slowly in the logs.

We tested that, and we found that the customer was correct. More properly, the query executed just fine when run independently. But when we run this query tens or hundreds of times concurrently, we will see very high response times (and getting worse), and we would see the server memory just blowing up very quickly. So we have a memory leak, we figured out, let us see what is going on… We dumped the data, and tried to figure out what it was exactly that we were leaking.

But there wasn’t anything there!

In fact, looking at the problem, it became even curiouser.  Take a look at what we saw during one of the test runs:

image

Note that this is all running with no other work, just a lot of queries hitting the server.

Somehow, we had a lot of data going into the Gen2 heap. But when we checked the Gen2, it was pretty much empty. In fact, we had a 100% fragmentation. Something was very strange here. We enabled memory allocation tracking and started to look into what was going on. We found this very suspicious (note that this is from a different run from the one above):

image

So FileStream.Read is allocating GBs over GBs of memory? What is going on?! It took a while to figure out what was going on. The underlying issue was within Lucene. Actually, an intersection of a few things inside Lucene.

Here is how Lucene reads from a file on disk:

image

What you’ll notice is that Lucene is holding a lock on the file, and then issuing I/O. In other words, it is going to hold that lock for a while. This is a very strange thing to do, why is Lucene doing it?

It does this because of a strange decision on how to do concurrent I/O.

image

Basically, whenever Lucene needs to do concurrent I/O, it will clone the relevant input object, and then use it concurrently. The idea, I gather, is that Lucene didn’t want to have a separate file handle for each multi threaded operation, instead it created one file handle, and used it concurrently. Since concurrent I/O takes careful usage, they slapped a lock on it and call it a day. I’m being unfair, I know, this is explicitly called out in the docs:

image

And in the Java version, there is an NIOFSDirectory that is presumably much better. Such doesn’t exist in the Lucene.Net version we are using. In fact, I was curious and I checked the upcoming version, they do have a NIOFSDirectory implementation, which had the following code in it:

image

This is a single global lock for everything. Thank you, I’ll take the lock per file. Now, to be fair again, work in progress, etc.

We noticed this issue a long while ago, and we solved it by using multiple FileStreams. It used more resources, but it meant that we were far more concurrent. Note that all of this actually happened years ago, and we had no problems in this area. Note that Linux program typically worry a lot more about the number of open file handles than Windows programs do.

The problem was definitely related to the use of multiple FileStream. But it didn’t have anything to do with holding multiple handles to the same file. Instead, the issue was in the usage pattern that the query exhibited.

In particular, and I’m going to get deep into Lucene here, the problem was inside the SegmentReader.Terms() method:

image

This seem innocuous, right? Here is how this is implemented:

image

And all the way down until we gets to the input.Clone() method. Now, in a standard Lucene system, using concurrent queries, this would result in a fair amount of locking. In RavenDB, this just meant that we were creating new FileStream objects.

The problem was that this particular query had a list of terms that it needed to check, and it called the Terms() method many times. How much is many times? 12,000 times!

Still not a problem, except that it called FileStream.Read on each and every one of those. And FileStream.Read does the following:

image

And the _bufferSize is set to the default of 4KB.

In other words, processing a single instance of this particular query will result in the system allocating about 48MB of memory!

And when we have concurrent queries of this type? Each of them is allocating 48 MB of memory, and because they allocate so much, we have GC runs, which cause the memory (which is still in use) to be sent to Gen 1, and eventually park in Gen 2. There is languish (because it is temporary memory, but we don’t clear Gen 2 very often).

We changed the implementation to use overlapped I/O and tested that, and the memory consumption dropped by a significant number. But we still saw more allocations than we liked. We ended up tracking that down the this call (in BufferedIndexInput  in the Lucene codebase):

image

The buffer size in this case is 1KB. So allocation for this query was actually 60 MB(!), and we only managed to drop it by 48MB.

After fighting with this for a long while, we ended scratch the whole buffered index input idea. It is just not sustainable in terms of allocations. Instead, we created a memory map input class, that map the input data once, and doesn’t use a buffer (so no allocations). With that option, our cost to process this query was drastically lower.

We profile the change, to see whatever there are additional issues, and we found that the newly optimized code was much better, but still had an issue. The memory map code used the UnmanagedMemoryStream class to expose the file to the rest of the application. Unfortunately, this class appears to be intended for concurrent usage, which is a rarity for Streams. Here is ReadByte method from that class:image

As you can see, this method is doing quite a lot. And it showed up as a hot spot in our profiling. The rest of the class is pretty complex as well, and does significantly more than what we actually need it to do.

We replaced this with a MmapStream class, and here is the comparable implementation.

image

You can safely assume that this is much faster Smile.

We have tested this using 5000 concurrent requests, without caching. Memory consumption is steady, and doesn’t increase. We show marked improvement across the board, in memory utilization, CPU usage and I/O rates. Note that while this issue was caused by a particular query whose pattern of operation caused tremendous number of allocations, this change has wider reaching implications.

We now allocate less memory for all queries, and previous experience has shown us that reducing a single 4Kb allocation in query processing can improve overall performance but 30%. We haven’t run those tests yet, but I’ll be surprised if we’ll see negative results.

Concurrent max value

time to read 2 min | 324 words

For a feature in RavenDB, I need to figure out the maximum number of outputs per document an index has. Now, indexing runs in multiple threads at the same time, so we ended up with the following code:

var actualIndexOutput = maxActualIndexOutput;
if (actualIndexOutput > numberOfAlreadyProducedOutputs)
{
    // okay, now let verify that this is indeed the case, in thread safe manner,
    // this way, most of the time we don't need volatile reads, and other sync operations
    // in the code ensure we don't have too stale a view on the data (beside, stale view have
    // to mean a smaller number, which we then verify).
    actualIndexOutput = Thread.VolatileRead(ref maxActualIndexOutput);
    while (actualIndexOutput > numberOfAlreadyProducedOutputs)
    {
        // if it changed, we don't care, it is just another max, and another thread probably
        // set it for us, so we only retry if this is still smaller
        actualIndexOutput = Interlocked.CompareExchange(
            ref maxActualIndexOutput, 
            numberOfAlreadyProducedOutputs,
            actualIndexOutput);
    }
}

The basic idea is that this code path is hit a lot, once per document indexed per index. And it also needs to be thread safe, so we first do an unsafe operation, then a thread safe operations.

The idea is that we’ll quickly arrive at the actual max number of index inputs,  but we don’t have to pay the price of volatile reads or thread synchronization.

Optimizing I/O throughput

time to read 7 min | 1277 words

We got a customer request about performance issues they were seeing on startup on a particular set of machines.

Those machine run in a cloud environment, and they have… peculiar, one might say deviant, I/O characteristics. In particular, the I/O pipeline on those machines is wide, but very slow. What do I mean by that? I meant that any particular I/O operation on those is likely to be slow, but the idea is that you can get much better performance if you issue concurrent I/O. The system is supposed to be able to handle that much better, and overall you’ll see the same relative performance as elsewhere.

This is pretty big issue for us, because for many things, we really do care about serial I/O performance. For example, if we are committing a transaction, we really have no other way to handle it except to wait until the I/O is fully completed.

That said, the particular scenario where we had the problem was startup. If the database was under heavy load at the time it shut down, the recovery logs would be full, and the database would need to replay the recent actions that happened. Note that shutdown performance is important, because it many cases we are running in an environment where shutdown comes with a ticking clock (in IIS or as a Windows Service).

At startup, we usually have more time, and it is expected that we’ll take a while to get up to speed. If nothing else, just bringing enough of the database to memory is going to take time, so on large databases, startup time is expected to be non trivial.

That said, the startup time on those set of machines was utterly atrocious. To figure out what is going on, I pulled out Process Monitor and looked at the File I/O. We go this:

image

We are reading from a journal, and that is pretty serial I/O (in the image, I’m running of a remote network drive, to simulate slow responses). Note that we need to read the log in a serial fashion, and the way the OS reads things, we read 32Kb at a time.

Remember, we are reading things in a serial fashion, and that means that we have a lot of page faults, and we have a slow I/O system, and we execute them serially.

Yes, that is a killer for perf. By the way, when I’m talking about slow I/O system, I’m talking about > 0.5 MS per disk read for most requests (ideally, we would have latency of 0.05 – 0.15). And we have quite a few of those, as you can imagine.

Since I know that we are going to be reading the whole journal, I used the PrefetchVirtualMemory() method and passed it the entire file (it is a maximum of 64MB, and we are going to need to read it all anyway). This let the OS have the maximum amount of freedom when reading the data, and it generate big, concurrent I/O. Here is how this looks like:

image

This also give the wide I/O bandwidth a chance to play. We load the I/O subsystem with a lot of stuff that it can try to do in an optimized fashion.

The next part that was expensive was that we need to apply the data from the journal files to the data file, and sync it.

The performance of syncing a file is related to the size of the file, unfortunately. And the file in question was large, over 45GB. Especially on such a system, we saw a lot of latency here, as in multiple minutes. One obvious optimization was to not sync per journal file, but sync once per the whole recovery process. That helped, but it was still too expensive.

Next, we tried pretty much everything we could think about.

  • Switching to WriteFile (from using mmap and then calling FlushViewOfFile)
  • Using async I/O (WriteFileEx)
  • Using scatter / gather I/O with no buffering (saves the need to do sync in the end)
  • Completion ports
  • Asking a 4 months old baby girl what she think about it (she threw up on the keyboard, which is what I wanted to do at the time, then she cried, and I joined her)

Nothing seems to have worked. The major issue was that in this workload, we have a large file (45GB, as I said) and we are writing 4KB pages into it in effectively random places. In the workload we were trying to work with, there were roughly 256,000 individual 4KB writes (most of them weren’t consecutive, so we couldn’t get the benefit of that). That is about 1 GB of writing to do.

And nothing we could do would get us beyond 3MB/sec or so. Saturating the I/O subsystem with hundreds of thousands of small writes wouldn’t work, and we were at a loss. Note that a small test we made, just copying data around manually has resulted in roughly 10MS/sec peek performance on those machines. This is a very lame number, so there isn’t much that we can do.

Then I thought to ask, why are we seeing this only during startup? Surely this happens also on a regular basis. Why didn’t we notice?

The reason for that is pretty simple, we didn’t notice because we amortize the cost. Only on startup did we had to actually sit and wait for it to complete. So we dropped that requirement. We used to read all the journals, apply them to the data file, sync the data files and then delete the journals. Now we read the journals, apply them (via a memory map) to the data file, and only remember what is the last journal file we applied in memory.

There is a background process running that will take care of syncing the data file (and deleting the old journals). If we crash again, we’ll just have to replay the logs that we aren’t sure were synced before. This saves even more time.

But we still have another issue. Writing to memory mapped file require the OS to page the relevant pages into memory. And again, we are on slow I/O, and the OS will only page the stuff that we touch, so this is again a serial process that this time require us to load to memory about 1GB of data at 3MB/sec. That is… not a good place to be at. So the next step was to figure out all the addresses we’ll be writing to, and letting the OS know that we’ll be fetching them. We do some work to make sure that we load those values (and neighboring pages) to memory, then we can write to them without paging for each page individually.

A nice side effect of this is that because this is running on the latest changes in the system, this has the effect of preloading to memory the pages that are likely to be in used after the database has started.

That is a lot of work, but to be perfectly frank, this is mostly optimizing in a bad environment. The customer can’t walk away form their current machine easily, but the I/O rates those machines have would make any database sit in a corner and cry.

Unsafe operations are required in the real world

time to read 2 min | 363 words

There is a pretty interesting discussion in the Raft mailing list, about clarifying some aspects of the Raft protocol. This led to some in depth discussion on the difference between algorithms in their raw state and the actual practice that you need in the real world.

In case you aren’t aware, Raft is a distributed consensus protocol. It allows a group of machines to reach a decision together (a gross over simplification, but good enough for this).

In a recent post, I spoke about dedicated operations bypasses. This discussion surfaced another one. One of the things that make Raft much simpler to work with is that it explicitly handles topology changes (adding / removing nodes). Since that is part of the same distributed consensus algorithm, it means that it is safe to add or remove a node at runtime.

Usually when you build a consensus algorithm, you are very much focused on safety. So you make sure that all your operations are actually safe (otherwise, why bother?). Except that you must, in your design, explicitly allow the administrator to make inherently unsafe operations. Why?

Consider the case of a three node cluster, that has been running along for a while now. A disaster strikes, and two of the nodes die horriblyh. This puts our cluster in a situation where it cannot get a majority, and nothing happen until at least one machine is back online. But those machines aren’t coming back. So our plucky admin wants to remove the two dead servers from the cluster, so it will have one node only, and resume normal operations (then the admin can add additional nodes at leisure). However, the remaining node will refuse to remove any node from the cluster. It can’t, it doesn’t have a majority.

If sounds surprisingly silly, but you actually have to build into the system the ability to make those changes with explicit admin consent as unsafe operations. Otherwise, you might end up with a perfectly correct system, that breaks down horribly in failure conditions. Not because it buggy, but because it didn’t take into account that the real world sometimes requires you to break the rules.

The “you broke the build!” game

time to read 2 min | 344 words

Recently I pulled some code from a colleague, and tried to test it. It worked, which was fine, so I let it run the tests, and went out to lunch.

When I came back, I was surprised to discover that the build has failed, not because of some test failing, but because it couldn’t compile. To be rather more exact, we go the following error:

 [optimized-build] Using type script compiler: C:\Program Files (x86)\Microsoft SDKs\TypeScript\1.4\tsc.exe
 [optimized-build]
 [optimized-build] System.ComponentModel.Win32Exception thrown:
 [optimized-build] --------------------------
 [optimized-build] The filename or extension is too long
 [optimized-build] --------------------------

That was strange. I checked several times, and we had no such thing. No one had a veryVeryLongFileNameThatNeededToBeVeryExplicitAboutWhatItWasDoingAndDidNotCareAboutLength.ts.

And the tsc.exe location was in its normal place. This is from a part in our build process that gather all the TypeScript files and merge them into a single optimized bundle. And it suddenly failed. Now, on the colleague machine, it worked. The previous commit before I merged it, it worked. The merge was a clean one, and very obvious that nothing was going on there.

It took me a while, but I finally figured out that the error occurred because my colleague has added a new TypeScript file.

How can adding a file break the build?

As it turns out, the code we were calling did something like this:

tsc.exe file1.ts file2.ts file3.ts

And at some point, the size of the command line we were passing to the process has exceeded 32KB. And that is when everything broke loose.

Luckily, we were able to write those things to a file and ask the compiler to load them directly, instead of passing them via the command line:

tsc.exe @arguments.txt

And everything was okay again…

FUTURE POSTS

  1. Buffer allocation strategies: A possible solution - 2 days from now
  2. Buffer allocation strategies: Explaining the solution - 3 days from now
  3. Buffer allocation strategies: Bad usage patterns - 4 days from now
  4. The useless text book algorithms - 5 days from now
  5. Find the bug: The concurrent memory buster - 6 days from now

There are posts all the way to Sep 11, 2015

RECENT SERIES

  1. Find the bug (5):
    20 Apr 2011 - Why do I get a Null Reference Exception?
  2. Production postmortem (10):
    03 Sep 2015 - The industry at large
  3. What is new in RavenDB 3.5 (7):
    12 Aug 2015 - Monitoring support
  4. Career planning (6):
    24 Jul 2015 - The immortal choices aren't
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats