Ayende @ Rahien

Refunds available at head office

What is bad about good ideas?

I finalized thinking about Rhino.ETL late last night, and then went to bed. I couldn't sleep for over an hour later, kept rolling thinks in my head. When I caught myself thinking about how I would design the test strategy when I was in the shower, I knew that it is an emergency, this idea wants to get out, badly.

Help! I am attached by my own muse!

Idea: The Boo ETL DSL

Note: This is just syntax idea that I have in my head right now, it has no implementation.

I have decided to do something about, and spent some time thinking about it, here is the initial design. It is a textual DSL for ETL. What you can see directly below is in a direct correlation of a lot of the stuff that I need to do with SSIS. 

The first one is basically a Data Flow task. The script goes like this:

  • Define two data sources, and you can see that I ma using named connection strings in one, and hard coded connection string in the other.
  • Define the source, which is a SQL command, against Oracle database, with supports for parameters (that is not something that SSIS can do, amazing as it sounds). Note that the parameters are retrieved by executing SQL against the different databases, and then storing it in a variable.
  • We have a simple transform, which include some logic as well as string manipulation and date formatting, all things that are disgustingly hard in SSIS.
  • Then we have the destination, we define the command that we want to execute, and we get all the parameters from the context (note that we are using a new parameter that we created in the transform "@Registered". We are also defining a BatchSize, which should increase performance if it matters.
DataSource(
	"SouthSand",
	"System.Data.SqlConnection",
	"Data Source=localhost;Integrated Security=SSPI;Initial Catalog=SouthSand"
	)

DataSource(
	"Northwind",
	"Orale.Client.OracleConnection",
	System.Configuration.ConfigurationManager.ConnectionStrings["Northwind"]
	)

source Northwind:
	Sql: """
		SELECT CustomerID,
		   CompanyName,
		   ContactName,
		   ContactTitle,
		   Address,
		   City,
		   Region,
		   PostalCode,
		   Country,
		   Phone,
		   Fax,
RegisteredYear,
RegisteredMonth, FROM tblCustomers WHERE LastUpdateDate BETWEEN :lastUpdate AND :currentTime """ Parameters: @lastUpdate = ExecuteScalar("SouthSand", "GetLastRunTimeForETL") @currentTime = ExecuteScalar("Northwind", "SELECT sysdate from dual") transform: @Phone = @Fax if @Phone is null @CustomerId = @CustomerId.ToLower()
@Registered = date(@RegisteredYear,@RegisteredMonth, 1) destination SouthSand: BatchSize: 500 Command: """ INSERT INTO [Customers] ( [CustomerID] ,[CompanyName],[ContactName],[ContactTitle], [Address],[City],[Region],[PostalCode],[Country],[Phone], [Fax],[Registered] ) VALUES ( @CustomerID,@CompanyName,@ContactName,@ContactTitle,@Address, @City,@Region,@PostalCode,@Country,@Phone,@Fax,@Registered ) """

If this is a data flow, how would the overall package look like? Let us take a package like this one, which is a format that I am using right now to do a lot of things.

DataSource(
	"SouthSand",
	"System.Data.SqlConnection",
	"Data Source=localhost;Integrated Security=SSPI;Initial Catalog=SouthSand"
	)

DataSource(
	"Northwind",
	"Orale.Client.OracleConnection",
	System.Configuration.ConfigurationManager.ConnectionStrings["Northwind"]
	)

Execute("SouthSand", "etl.TruncateAllTables") # etl.TruncateAllTables is a stored proc

sequence: 
	DataFlow("Customers.flow")
	DataFlow("Orders.flow")
	DataFlow("Products.flow")
	DataFlow("ProductStatuses.flow")
	DataFlow("OrderStatuses.flow")
	
Execute("SouthSand", "etl.CleanData") 
Execute("SouthSand", "etl.UpdateInsertAllTables") 

This probably means that I will have a way of inheriting the Data Source definitions. It took me about an hour to think it though, but I have this running through my head for a while now.

I estimate that it should take me a day or two to build this DSL. But you know what, once I have that, writing the ETL processes themselves are down to minutes! What is much more important is that I would estimate a similar amount of time to get an SSIS package of the same complexity ready to go, under the same circumstances, and I refuse to make estimates regarding SSIS deployments.

So, about the same time to build a new framework as it would take me to build a single task. Yes, it will not have a designer, but it will be maintainable RAD!

Nitpicker corner: No, it will not do XYZ feature of SSIS (lookups come to mind, or SSIS joins), but I rarely use them, and they are fairly simple to write once I do need them.

On Competition, NIH and Good Software

Jdn doesn't agree with me:

.NET OSS developers *can't* have it both ways.  They can't complain about Microsoft 'reinventing the wheel' and not make it about a competition.  It is the same thing, when it boils down to it.
What is the complaint otherwise?  That Microsoft shouldn't come out with something that mirrors OSS efforts unless it is 10 times better?  10 times better according to whom?  You?  The OSS police?

[...snip...]

Your own and Jeremy Miller's own blogs about 'building a better CAB in an hour' *reek* of 'it is a competition.'
I do not doubt that you do not intend it to come across that way, but it certainly does, in spades, and I don't see how you could think otherwise.

Let me start by saying that I believe that it is a poor mind that can't argue against itself (and win). I most certainly can have it both ways. I can complain about MS reinventing the wheel when they aren't providing as much value as existing stuff, because they are in a position where people will follow them blindly. This means that they have the responsibility to be just as good as the existing things out there.

I think that you missed an important distinction here, it is not about OSS, or using something other than MS, it is about not having to deal with a half-assed product. I would rather have nothing from Microsoft than something that doesn't do everything that I expect something in its category to do. The reasoning is very simple, if I have nothing from Microsoft, it is much easier to build / buy something else. If there is something from Microsoft, but it is not up to par with the established standards, that is bad. It is bad because "you don't get fired for buying Microsoft" way of thinking.

And yes, this is a big issue to me. I would much rather do actual work than have to argue politics about "But we are a Microsoft shop and they have a really cool presentation".

Pointing out things that I don't like, or consider overblown is not something that I would consider a competition, I do much the same for a lot of other things, including my own.

Why I hate SSIS: Part N+1

Take a look at this (highly simplified) package. The flow is very simple, truncate shadow tables, copy to shadow tables, perform cleanup and copy to the real tables. The inside of each data flow is simple Source -> Convert to Unicode -> Destination. Not really complicated or ground breaking, right?

The shadow tables have the same structure as the real tables, but they have no FKs.

image

It failed on me, in production, but only when running as a job! The error? Primary Key violation in the Products Statuses and Order Statuses tables. I have verified that:

  • I am truncating those tables
  • That only one instance of the job is running at any single time
  • That the source data is neither changing nor invalid (those are lookup tables with 6 and 2 values, respectively)

99% reproducible, it would give an error when running as a job, 50% reproducible, it would give the error when running from dtexec.

After hitting the wall with it for way too long, I "fixed" it by doing:

image

Why it would make a difference, I have no idea.

The original version is what I was using for the last four or five months, and it never gave me any problems, until three days ago, just when I wanted to load it to production.

I have began thinking about how to build my own "ETL engine" for the next project, code is reliable, safe, debuggable and easy to understand. Something in Boo is probably the way to go...

I had enough of fighting with tools!

Quid quid latine dictum sit, altum videtur *

I don't speak much about what goes at work, but this just got to get out, when I am LOL-ing from a serious reply that I am getting from my boss, there is something good going on.

Several days ago I have sent an email to my boss, with some technical details, to which he replied with: "Omnia mihi lingua graeca sunt**"

That has released the flood, and right now we have a discussion that involved:

  • Sum perdidi ***
  • Vis eccum erit,  semper. ****
  • Luke sum ipse patrem te *****
  • De integro ******

Those are just the ones I can recall off-hand.

Now I just need to find a reason to use "Facta, non verba".

Naturally,this means that I need to put "throw new FelixCulpaException()" in my code somewhere.

(last two phrases are left as an exercise for the reader.)

* Everything in Latin sounds profound
** It is all Greek to me
*** I am wasted
**** May the force be with you, my son
***** Luke, I am your father
***** Repeat again from the start

Fluent Interfaces & Method Chaining

Hammett calls the term "Fluent Interface" unnecessary:

And about Fluent interfaces, what about OOP? What about method chaining? Does it need, for god’s sake, a new name?

This is method chaining:

string user = new StringBuilder()
	.Append("Name: ")
	.Append(user.Name)
	.AppendLine()
	.Append("Email: ")
	.Append(user.Email)
	.AppendLine()
	.ToString();	

And this is a fluent interface:

return new Finder<Order>(
		Where.Order.User == CurrentUser &&
		(
			Where.Order.TotalCost > Money.Dollars(150) ||
			Where.Order.OrderLines.Count > 15
		),
		OrderBy.Order.CreatedAt
	).List();

Anders posts about DSL building contains examples that are tied more closely to the domain.

Method chaining is something that you would certainly use in the a fluent interface, but that is like saying that you need to use interfaces when you build a plugin framework. The fact that you are using something doesn't mean that what you do is only that something.

Fluent interfaces are different than merely method chaining because they allow you to express your intent in the domain terms and allows you to get more readable code. Method chaining, operator overloading, nasty generics tricks are all part of that, certainly, but the end result is much more than just a simple method chain.

And yes, I think that it certainly deserves its own name.

It is not a competition: OSS & Microsoft

This comment on Anders' blog has been brought to my attention:

I really enjoyed what Anders and Ayende did when they took the piss with Microsoft's "cool" new technologies when they wrote Mean Fiddler and Bumbler in record time. Its a pity that these master programmers don't want to continue developing these frameworks - it would be soooo cool to see open source alternatives to M$ stuff taking the lead before M$ are able to release anything themselves.

Anders' post is about new development with Fiddler, exposing NHibernate's entities over REST services (which deserves another post all together). I wanted to respond to this comment because of several things:

  • M$ - I would imagine that it is at least somewhat offensive to Microsoft when this is used.
  • "taking a piss" - No, I use a bathroom for that.

The general tone of the comment is basically amount us vs. them, which is completely opposite to the way I see things. I didn't wrote Bumbler to show off anything, I wrote it because I was annoyed that Jasper was presented as some great & heroic thing, when in practice it is very simple wrapper around existing functionality. I imagine that Anders has much the same reasoning when writing Fiddler, the counterpart to Astoria. The initial versions, at least, exists to make a point, not to show who is better. Writing software based on the old "I'll show them" is not a good idea, in my opinion.

I am working on OSS because:

  • It it interesting
  • It makes my job easier
  • Clear my head
  • I get plenty of benefits (from code reviews to patches, from bug reports to experience)

Taking both those platforms and extending them to be usable (which is what is happening to Mean Fiddler right now) is dependant on need, not development for the sake of competition.

As an aside, I would generally rather use something that already exists than write my own, on the condition that I get the same simplicity, flexibility and quality that I would get if I would write it on my own. A lot of the stuff from Microsoft fit that bill (most of the .Net framework), and a lot of the stuff doesn't (the entire Web Forms stack, to start with). That is the main motivation, to make my work easier.

SSIS' 15 Faults

I dislike SSIS intently, and I say this as someone who has done two projects using it, and has spent much time recently struggling to work with it .Instead of harping how much I dislike, I decided that it would be better to list the things that I find it so hateful:

  1. Bad Errors: The number one reason that I am using .Net instead of C++ is not the garbage collection, it is the errors. Clear, understandable errors, with the ability to trace them back into their source. Errors that gives you the reason for what happened. SSIS' errors are anything but useful. Often you will get some sort of a cryptic COM error, and maybe you will get lucky and get the message from the database, instead of relying on SSIS's errors.
    The errors are not only hard to figure out, they are often hard to find, hiding in a mess of all the other information (mostly useless) that SSIS throws at you. No attempt is made to make it easier to locate and find the errors, those are left as an exercise for the user.
    The last time I felt this alone I was working with Pascal, trying to understand how the code moved the spindle of the HD using ASM commands.
  2. Random Errors: I like predictability. One of the worse things that can happen to you is that you get an "sometimes it happens, but I don't know why" error report. With SSIS, I literally had the entire project breaking up because I made such a significant change as changing the connection string. They are few things that I like less than having to deal with stuff that make me work in a stupid way, SSIS' way is to force me to go through all the boxes in my package and approved that I want it to do the same thing that it did before.
  3. Keeping track of what it shouldn't: I wish I had a dollar for every time that SSIS kept track of something it shouldn't. Be it the old configuration, hard coding the connection string inside the package and completely ignoring the configuration... etc.
  4. Sorry excuse for deployment: As you can probably see from above, SSIS's previous faults doesn't make for a nice transfer to production. I had to deal with that today, and it was a pain. Things just refuse to work if you move them between machines, when the database target remains the same. In most cases, those are "need new metadata" or plain "I am broke" errors, which leave you with the tried and true method of getting VS, opening the package, and starting to mess around with that manually, until it decides that it doesn't hate you so much again. Requiring VS to deploy successfully is a big mistake, but I have been unable to avoid this requirement in any package of any complexity so far.
    Would I need to do it again when I want to redeploy? Why, yes! It is not like I have better things to do...!
  5. Security? Who needs that: I should also mention that SSIS packages requires sysadmin rights to run when scheduled as a job. Which of course it will not tell you until you have run the job. I am aware of the agent proxy solution, it still remains a mystery to me why this is a requirement, after all, you can run an SSIS package just fine from inside SQL Server, just not as a job.
  6. UI formatting instructions along side with the executable code: Here is another huge mistake. The SSIS package contains both the executable blocks and the formatting for it, making it completely unreadable from plain text perspective, and making it impossible to understand what has change from a diff of the file.
  7. No thought about version control: As long as we touch that, SSIS packages and configuration might as well be binary objects, they are completely opaque for version control, and the decision to make the configuration files for SSIS un-indented was made by someone who really never had to modify a configuration file outside of the pretty UI, or wanted to actually see what is going to change.
  8. Bad configuration scheme: This bring me to the configuration scheme itself, yes, you can put the configuration anywhere you want, from incomprehensible XML files with hard coded paths to SQL tables to which you will have to have an hard coded connection string, from environment variables (WTF?!) to stored inside the package.
  9. Random configuration scheme: The problem with all those options is that SSIS seems to choose between those at random. At times it would choose the connection that it had assimilated inside the package, at times it would go to the wrong file, get the old configuration, or just complains that it is not having fun and that I should baby sit it again.
  10. Bad UI: If we are talking about baby sitting, that is what the UI makes me feels like. Having to feed it very carefully, in smaaaalll bites, what I would like it to do. Going through six dialogs and three property grid just to get something to work is not my idea of fun, and SSIS has the lovely errors to point your way. Oh, and there is the advanced property dialog, if I was feeling stupid.
    Then there are the worse issue, building expressions is horribly broken from the UI perspective, you can see the evaluated string if it is bigger than a few lines, and no one will help you if you dare break the source string into multiply lines.
  11. Lack of extensibility: My company has actually developed a series of components for SSIS, they cost a lot of time and frustration, so I can honestly say that trying to extend SSIS is nothing but pain.
  12. Bad interoperability: As it happened, I am dealing with Oracle. As an integration services platform, I fully expected SSIS to support this little known database, but it doesn't. I will spare you the pain of "CREATE VIEW Customers_SSIS_DOESNT_LIKE_ORA_TABLES as SELECT * FROM Customers;", but I assume your can guess the rest.
  13. Busy work: I mentioned it before, but SSIS is very click happy, requiring your to do a lot of work with the mouse, over and over again. Just trying adding a field to a data source, see how much pain you have to go through as you have to go through each and every one of the steps that it passed along the way, even if they never touched that field. Heaven forbid that you dare to remove a field. Did I mentioned the lovely huge dialog when SSIS basically tells you: "Nothing have changes, shall I map with the same column names?"
  14. Hard to debug: Today I had a dynamically generated SQL inside an SSIS package that was causing an error. I had no way of finding out how to get this dynamic SQL (which was stored in a variable), eventually I created a secondary path that saved the variable to  a table, where I could read it.
    Update: This is possible, see here for the details. The rest of this point still stand, though.
    Finding out why things are failing is a task for those who have little respect for their time, for SSIS insist that it is important enough to engulf all your time. Oh, and have fun getting a MESSAGE BOX from the SSIS process if a script task has thrown an exception! Obviously this message box will be hidden behind SSIS, so you will wait for the process to end for quite some time before realizing what have happened.
  15. The missing basics: I lost count of the many things that I consider a given that SSIS doesn't have. From date formatting to parameterized queries (to Oracle) to the basic UPSERT support. There have been a few times when I literally count not believe that it didn't have this capability. Date formatting and parsing is a basic part of what you ought to get out of the box with an integration package, as just simple example of a basic lack that is driving me crazy.

There are more, but I am going to walk the dog now.

Brail: Null propagation

Here is a small, but interesting tidbit. Yesterday I have finally sat down and documented a lot of the changes that I made in Brail recently, I was surprised to see how much I had to document:

  • Auto Imports
  • Strongly typed variables
  • Sections
  • The "?variableName" syntax
  • Symbols
  • Null Propagation

The last part is what I want talk about now. Brail is a .Net language, which means that something like this:

output user.Parent.Name

will raise NullReferenceException if the user's parent is null. That can be somewhat of a pain in many scenarios. Brail now has a better syntax for this:

output ?user.Parent.Name

 will ignore any null values that it encounter in the way. It is will either output the parent's name or nothing at all.

The small print:

This will work only for variables that you get from the controller via the PropertyBag or Flash. You can't use it on variables that you define in the view. However, that is rare enough that I don't think that it is going to be a problem.

Note that I didn't document the DSL support in Brail yet, that is going to wait until it is stabilized a bit.

Map, Reduce, Filter

Dustin Campbell has a couple of posts about Map/Reduce/Filter:

He makes it easy to understand, but he forgot one thing that is important, usage of the Map/Reduce/Filter pattern make it very easy to parallelize your code, since you have already separated everything into an action on a set, which can be performed in parallel safely (in most cases).

Blog Posts vs. Articles

Jakob Nielsen says that experts should not blog, Larry O'Brien disagrees and bring some real world data about leads generated from blog posts and articles.

I have a different approach for this, and it is about the time invested vs. the exposure earned. I have published several thousands of blog posts, and I rarely work on a blog post for over a few hours. The single article that I published took several weeks to write and re-write, peer reviewed and get published. The amount of contacts that I got from that vs. the ones that I get from this blog cannot be compared.

Blogging allows me to post quickly, which means that you get a lot of content that would never see the light of day otherwise. To make the cost of putting content out any higher than it should be means that you are limiting what you can do. It means that you will have less visibility and less traffic, in the end.

To my mind, publishing an article is good only for the additional exposure in channels where I don't have access to already (all readers of XYZ Magazine, for instance), but while that has an appeal, I don't see the need to invest the amount of time that would be required to do so.

Benefits Of Production Virtualization

Rod Paddock is talking about virtualization for production, and calls it snake oil. Now, my company has done a lot of projects to consolidate and virtualizes servers, (largest in Israel in this market) although that is not the part that I am dealing with.

Server virtualization helps when you:

  • Want to conserve space / power
  • Have a lot of applications that are running on obsolete systems (be that NT 4.0 systems to Mystery Reports 1.2.3-rc2.1) and cannot be readily moved.
  • Want to reduce manageability costs by reduce the amount of work needed
  • Have low utilization (at least in today standards) for the servers, so you get more bang for the buck by overloading a single physical server with multiply VMs.
  • Ability to build up and tear down environments in a snap

Rod mentions the reliability concerns of moving all servers to a few machines, but those are mitigated by clustering the virtual machines themselves (VMotion) and not from distributing over more machines.

One thing that I would warn against is putting high performance / high throughput applications / services on a VM, the performance of a VM is always going to be slower, but the question is whatever this is a true constraint for the application, in many cases, I would assume that it isn't. I have multiply projects in production now, working off VMs, and my company had several projects where the "deployment" consisted of copying the VM file to the ESX, booting, reconfiguring some URLs and going home.

So yes, I certainly think that Virtualization is a valuable tool for production.

But it is a PRODUCTION problem

Today I was called to fix a critical production problem, as I arrived, I had various scenarios going in my minds about what can go wrong. All of them meant that my day was basically ruined.

I was out of there in 45 minutes, I dedicated two of which to fix the problem, and the rest to educate the users that: Closing the service means that the application will not work.

Urgh!

Don't start a demo with DDoS

A few days ago I had a launch of my current project to the internal customers. This meant that many more users had a chance to see the nearly-complete application. There are still a lot to do, but we are very close to feature-freeze (yeah!).

Anyway, I closed down everything the day before (around noon, so I had plenty of time to verify that it worked), the day of the presentation, I took the liberty of arriving late (I often does, actually, it allows me to skip traffic). Around 20 minutes from work, I started to get urgent phones. We have an issue with this, and there is an error with that. The application is slow, there are timeout errors, etc.

I was absolutely bewildered, it worked yesterday, and no one has changed a thing. I think that I arrived to work about ten minutes before it was to go live to the users, and I quickly started the crisis mode maneuvers. Restart IIS, Restart SQL Server. It helped for a few minutes, but then the application started to show timeout errors from the database. I increased the command timeout (single line change, in the configuration, very cool) and the application seems to have stabilized, albeit very slow.

I then start looking more deeply at the root cause of the matter. One of the first places that I looked at was the requests/sec perf counter, and it was very high for a system that no one should be using at the moment. We got a fairly high numbers of requests per second (~17), sustained over a long period of time, it was as if I suddenly had a few hundreds very active users.

It was a good chance to brush off my forensic skills, and a short while after I have determined that the hits where coming from a small set of IPs, at a very high rate. Someone was DDoSing me. It took some exchange of blame and a whole round of denials before someone remembered that they are running the stress test kit against the demo servers.

The moment that stopped, everything went back to normal, but that guy is responsible for at least three separate breakdowns, a whole lot of swearing, and quite a bit of worrying.

 Not a good way to start a demo, but once we passed that, we have gotten very positive results. :-)

Things that scares me

I had run into something today that I have no idea how to respond to. I was working with someone and at one point I did a WAIT A MINUTE, What Just Happened?!

He then proudly showed me how he improved his daily work flow. He has modified the VS #region template so he was able to select a piece of code, hit a few keys, and get the following:

 #region old
/*

selected code...

*/
#endregion

You can guess how the code base looks like, I assume, which this technique in place.

Sending arrays to SQL Server: Xml vs. Comma Separated Values

I spoke before about using the XML capabilities of SQL Server in order to easily pass list of values to SQL Server. I thought that this was a pretty good way to go, until I started to look at the performance numbers.

Let us take a look at this simple query:

DECLARE @ids xml
SET @ids = '<ids>
      <id>ALFKI</id>
...
      <id>SPLIR</id>
</ids>'

SELECT * FROM Customers
WHERE CustomerID IN (SELECT ParamValues.ID.value('.','NVARCHAR(20)')
FROM @ids .nodes('/ids/id') as ParamValues(ID) )

This simple query has a fairly involved execution plan:

image

This looks to me like way too much stuff for such a simple thing, especially when I see this:

image

So the XML stuff is taking up 98% of the query?

I then checked the second route, using fnSplit UDF from here. Using it, I got this result:

image

So it looks like it is significantly more efficient than the XML counter part.

But what about the larger scheme? Running the fnSplit over 9,100 items got me a query that took nearly 45 seconds, while a XML approach over the same set of data had no measurable time over just 91 records.

I then tried a simple SqlCLR function, and got this the same performance from it:

image

The code for the function is:

 [SqlFunction(FillRowMethodName = "FillRow", TableDefinition = "str NVARCHAR(MAX)")]
 public static IEnumerable Split(SqlString str)
 {
     if (str.IsNull)
         return null;
     return str.Value.Split(',');
 }

 public static void FillRow(object obj, out SqlString str)
 {
     string val = (string) obj;

     if (string.IsNullOrEmpty(val))
         str = new SqlString();
     else
         str = new SqlString(val);
 }

As you can probably guess, there are all sorts of stuff that you can do to make it better if you really want, but this looks like a very good approach already.

Tripling the size of the data we are talking about to ~30,000 items had no measurable difference that I could see.

Obviously, when you are talking about those numbers, an IN is probably not something that you want to use.

Total Frustration: Ambiguous Match Exception with WebForms

A colleague has called me over to see why a page was failing with Ambiguous Match Exception yellow screen of death. There was a simple change made to the page, that broke it, I tried the usual trouble shooting methods* (restarting VS & ISS, cursing, etc) but that failed to fix it. I moved to code to my machine and verified that it was completely reproducible across machines. Then we started to pick apart the changes, literally line by line.

The problem turned out to be this line:

IList<string> products = new List<string>();

Only then we remembered that the page also had a text box called "Products", which apparently caused the WebForms Parser to choke and die. Really nice way to make sure that we would both lose over an hour of work, trying to find the root cause of an opaque problem.

* Take it for what it worth, but I am sad that I know that in many cases, those methods actually work...

Rambled Thoughts

I am usually a big believer in separation of concerns for posts, but this a collections of interesting things that have crossed my path recently, which I don't have the time/energy to do full posts about.

The truth about string concatenation performance...

Here is a riddle, what is faster?

  • string str = "Id: " + i;
  • string str = string.Format("Id: {0}", i);
  • string str = new StringBuilder().Append("Id: ").Append(i).ToString();

If you guess StringBuilder or string.Format, you are mistaken. Over 10 million iterations, the simple "Id: " + i finished in 4.7 seconds, StringBuilder in 5.7 seconds and string.Format in 7.6 seconds.

The reason for that is that the compiler can optimize the + operator to a call to string.Concat, and it does it quite often when you have several parameters. The optimizations of StringBuilder only shows up if you have several concatenations, or if you are using it on more than a single expression.

What is going on?

My schedule for the last week was:

  • Sunday - Work
  • Monday - Work , Teach
  • Tuesday - Giving a talk at a conference, Work
  • Wednesday - Work, 2 user group talks
  • Thursday - Work, Teach

Planned for Friday - rest!

My email has 48(!) unread items, and another ~40 starred items, this is a whole new level of a record for me. Allow me to apologize in advance for not responding to email in a timely manner, right now I feel like I have been mowed.

VB User Group Talks Summary

Yesterday I gave two talks to the VB User Group, about TDD and Rhino Mocks. I had a lot of fun, and got to embarrass myself in public, trying to code in VB. (When I need to ask the crowd for the array literal syntax, in the middle of the lecture,that is trouble...)

There were a lot of interesting questions, and I got to see how mocking tests would look like in VB.Net. A lot of the methods names in Rhino Mocks are reserved words in VB.Net, which surprised me, because I got VB.Net code from people using Rhino Mocks, and I never noticed that...

Apparently you get code like this:

smsSender.SendSms(Nothing, Nothing)
LastCall.[Throw](new Exception()).Constraints([Is].Equals("oren"), [Is].Anything())

Code & Presentations (in PDF format) can be found here.

The value of a feature

How do you value a feature? My own estimation is based on the amount of effort that it took, but the client has a totally different metric. The value of a feature for a client is directional proportional to the amount of pain it removes, with complete and utter disregard to the effort it took to implement.

I have been reminded of this when the customer completely failed to appreciate my JavaScript multi threading capabilities (that enabled a drop down list), and was supremely impressed with the user impersonation feature, which took about an hour to write. Amusingly enough, a strong to the same effect has just appeared on the Daily WTF.

That is completely reasonable approach from the client point of view, naturally. But I wish I could always get a "Wow, that completely rocks!" for those features, and something better, "Oh, you finally got it to work" for the stuff that had me pulling hair.

As an aside, this leads to managing the customer expectations, the classic example of which is of the mockup screen that looks real, but I have a whole lot to say about that, at a different post.

Adding to list from multiple threads?

I just read an email from someone that assume that it is safe to access a collection from multiple threads, if this is done for adding items to the collection only.

My gut feeling said that this is a bad idea, but I set out to verify it anyway. Here is the test case: 

static void Main(string[] args)
{
    List<int> ints = new List<int>();
    ManualResetEvent resetEvent = new ManualResetEvent(false);
    int additions = 0;
    int totalCount = 1000000;
    for (int i = 0; i < totalCount; i++)
    {
        int tmp = i;//check C# spec for reasons
        ThreadPool.QueueUserWorkItem(delegate
         {
             resetEvent.WaitOne();
             ints.Add(tmp);
Interlocked.Increment(ref additions); }); } Console.WriteLine("Starting to add..."); resetEvent.Set(); while (additions < totalCount) Thread.Sleep(200); Console.WriteLine(additions); Console.WriteLine(ints.Count); }

It is a bit tricky, because all code that deals with multi threading is tricky, but basically I am trying to create as much contention as possible, and on my machine (dual core), I get the following results:

Starting to add...
1000000
999812

So the answer is pretty clear, it is not safe to access a collection from multiply threads without proper locks, not matter what you do to it. I had to increate the totalCount significantly before I could get a reproducible result, but this happened at much lower iteration counts as well, so don't think that you can get away with it if you have small collections.

Working with threads requires thread synchronization, always.