Tuesday, September 07, 2010
#
Being a product owner
A while ago I have come to the realization that it is impossible for me to everything alone, so I got other people to help me build my projects. In some cases, that was done using OSS, by soliciting contribution from the community. In others, it is a simple commercial transaction, where I give someone money for code.
I think that I have gotten too used to the OSS model, because I got the following reply for this:
That was somewhat of a rude shock, I was using the same loose language that I always hated when I got specs to implement.
Those are all really good questions.
Monday, September 06, 2010
#
Normalization is from the devil
The title of this post is a translation of an Arabic saying that my father quoted me throughout my childhood.
I have been teaching my NHibernate course these past few days, and I had come to realize that my approach for designing RDBMS based applications has gone a drastic change recently. I think that the difference in my view was brought home when I started getting angry about this model:
I mean, it is pretty much a classic, isn’t it? But what really annoyed me was that all I had to do was look at this and know just how badly this is going to end up as when someone is going to try to show an order with its details. We are going to have, at least initially, 3 + N(order lines) queries. And even though this is a classic model, loading it efficiently is actually not that trivial. I actually used this model to show several different ways of eager loading. And remember, this model is actually a highly simplified representation of what you’ll need in real projects.
I then came up with a model that I felt was much more palatable to me:
And looking at it, I had an interesting thought. My problem with the model started because I got annoyed by how many tables were involved in dealing with “Show Order”, but the end result also reminded me of something, Root Aggregates in DDDs. Now, since my newfound sensitivity about this has been based on my experiences with RavenDB, I found it amusing that I explicitly modeled documents in RavenDB after Root Aggregates in DDD, then went the other way (reducing queries –> Root Aggregates) with modeling in RDBMS).
The interesting part is that once you start thinking like this, you end up with a lot of additional reasons why you actually want that. (If the product price changed, it doesn’t affect the order, for example).
If you think about it, normalization in RDBMS had such a major role because storage was expensive. It made sense to try to optimize this with normalization. In essence, normalization is compressing the data, by taking the repeated patterns and substituting them with a marker. There is also another issue, when normalization came out, the applications being being were far different than the type of applications we build today. In terms of number of users, time that you had to process a single request, concurrent requests, amount of data that you had to deal with, etc.
Under those circumstances, it actually made sense to trade off read speed for storage. In today’s world? I don’t think that it hold as much.
The other major benefit of normalization, which took extra emphasis when the reduction in storage became less important as HD sizes grew, is that when you state a fact only once, you can modify it only once.
Except… there is a large set of scenarios where you don’t want to do that. Take invoices as a good example. In the case of the order model above, if you changed the product name from “Thingamajig” to “Foomiester”, that is going to be mighty confusing for me when I look at that order and have no idea what that thing was. What about the name of the customer? Think about the scenarios in which someone changes their name (marriage is most common one, probably). If a woman orders a book under her maiden name, then changes her name after she married, what is supposed to show on the order when it is displayed? If it is the new name, that person didn’t exist at the time of the order.
Obviously, there are counter examples, which I am sure the comments will be quick to point out.
But it does bear thinking about, and my default instinct to apply 3rd normal form has been muted once I realized this. I now have a whole set of additional questions that i ask about every piece of information that I deal with.
Sunday, September 05, 2010
#
I need my own blog software, damn it
I am well aware that I am… outside the curve for bloggers. For a long while I handled that by simply dumping the posts as soon as I wrote them, but that turned out to be quite a burden for some readers, and pieces that I think deserve more attention were skipped, because they were simply drowning in the noise of so many blog posts.
I am much happier with the future posting concept. It make things more predictable, both for me and for the readers. The problem happen when you push this to its logical conclusion. At the time of this writing, I have a month of scheduled posts ahead of me, and this is the third or forth blog post that I wrote in the last 24 hours.
In essence, I created a convoy for my own blog. At some point, if this trend progresses, it will be a problem. But I kinda like the fact that I can relax for a month and the blog will function on auto pilot. There is also the nice benefit that by the time that the blog post is published, I forgot what it said (I use the write & forget method), so I need to read the post again, which helps, a lot.
But there are some disadvantages to this as well. My current system will simply schedule a post on the next day after the last day. This works, great, if I have posts that are not time sensitive. But what actually happen is that there are lot of scenarios in which I want to set the date of the post to the near future. I still try to keep it to one post a day, so that means that I need to shuffle the rest of the items in the queue, though. This is especially troubling when you consider that I usually write a series of posts that interconnect to a full story.
So I can’t just take one of them and bump it to the end, I might have to do rearranging of the entire timeline. And there is no support for that, I have to go and manually update the timing for everything else.
It is pretty clear why this feature is missing, it is an outlier one. But it probably means that i am going to fork SubText and add those things. And the real problem is that I would really like to avoid doing any UI work there. So I need to think about a system that would let me do that without any UI work from my part.
Saturday, September 04, 2010
#
Security Models: On Behalf Of
In a security system, On Behalf Of is a vastly underutilized concept. I created that for the first time in 2007, in a project that spurred the creation of Rhino Security. On Behalf Of allows another user to assume the mantle of another user. From the authorization system, once you activate On Behalf Of, you are that other user. That means that you have all the access rights (and limitations) of that user.
Why is this useful?
- On Behalf Of gives a help desk operator a very quick way to reproduce a bug that a user run into. Usually those bugs are things like “why can’t I see product Foo”, or “I run into a bug in Order #823838”. Those bugs can only be reproduced when the help desk operator is running within the security context of the user.
- On Behalf Of represent how the real world work. Imagine a Team Leader that takes a vacation. For the duration of the vacation, there is someone else that assumes that Team Leader role on a temporary basis. On Behalf Of allows that someone to do the required work, in the the context of the Team Leader, which allows her to perform operations that she would otherwise may not be able to do.
Auditing
On Behalf Of has important implications on auditing. In most systems where auditing plays a role, the user who perform the action is just as important as the action that was taken. On Behalf Of integrates with the auditing system, to indicate not only who actually did the operation (the end user), but also what user that operation was On Behalf Of. This is important, because it avoid “impossible” audit entries later on, where either “I was in vacation that time, I couldn’t have done it” or “I never had permissions to do this, there is a bug in the system” might crop up.
It is important to note that most systems where On Behalf Of is used have a sophisticated security rules. At that point, system administrators are more akin to Windows’ Administrators than Unix’s Root. In Unix, if you are root, you can do whatever you like. In Windows, if you are Administrator, you can do almost everything that you like. A typical example is that as an administrator on Windows, you can’t read another’s user files without leaving a mark that you can’t remove (changing ownership), but there are others.
You can think about On Behalf Of as an extension to this, we want to act as another user, but we have to know that we did. That is why you want to be able to pull the operations made by users acting On Behalf Of other users from the Auditing System easily. In fact, when running On Behalf Of, your audit level is much increased, because we need to track what sort of operations you made, even operations that are normally below the audit level of the system (viewing an entity you otherwise had no way of accessing, for example).
Authorization
From authorization perspective, the actual mechanics are pretty simple, instead of passing the actual user to the security system, you pass the user that we execute operations On Behalf Of.
However, the security system does need to be aware of On Behalf Of, because there are some operations that you cannot perform On Behalf Of someone else. For example, while I may be authorized to act On Behalf on another Team Leader, it is not possible for me to fire a team member while operating On Behalf Of Team Leader. Firing someone is a decision that can be made only by that person direct manager, not by someone acting On Behalf Of. This is a business decision, mind you, to define the set of operations that may be performed On Behalf Of (usually most of them) and the few critical ones that you mustn’t.
Authentication
As you can imagine, there is a great deal to abuse with this feature, so for the most part, it is strictly limited. Usually it is active for system administrators only, but very often, there are both temporary and permanent set of conditions where On Behalf Of is enabled.
For example, an exec is always able to act On Behalf OF his captain. And we already discussed covering another Team Leader when they are sick / vacationing / etc.
Summary
On Behalf Of is a powerful feature, but it requires understanding from the users, it has the potential to be looked at as a security hole, even though it is just a reflection of how we work in real life. And the implications if the users expect some level of privacy in the system are huge. A large part of implementing On Behalf OF correctly isn’t in the technical details, it is in the way you build /document / sale it to the users.
Friday, September 03, 2010
#
The Law of Conservation of Tradeoffs
Every so often I see a group of people or a company come up with a new Thing. That new Thing is supposed to solve a set of problems. The common set of problems that people keep trying to solve are:
- Data access with relational databases
- Building applications without needing developers
And every single time that I see this, I know that there is going to be a catch involved. For the most part, I can usually even tell you what the catches involved going to be.
It isn’t because I am smart, and it is certain that I am not omnificent. It is an issue of knowing the problem set that is being set out to solve.
If we will take data access as a good example, there aren’t that many ways tat you can approach it, when all is said and done. There is a set of competing tradeoffs that you have to make. Simplicity vs. usability would probably be the best way to describe it. For example, you can create a very simple data access layer, but you’ll give up on doing automatic change tracking. If you want change tracking, then you need to have Identity Map (even data sets had that, in the sense that every row represented a single row :-) )
When I need to evaluate a new data access tool, I don’t really need to go too deeply into how it does things, all I need to do is to look at the set of tradeoffs that this tool made. Because you have to make those tradeoffs, and because I know the play field, it is very easy for me to tell what is actually going on.
It is pretty much the same thing when we start talking about the options for building applications without developers (a dream that the industry had chased for the last 30 – 40 years or so, unsuccessfully). The problem isn’t in lack of trying, the amount of resources that were invested in the matter are staggering. But again you come into the realm of tradeoffs.
The best that a system for non developers can give you is CRUD. Which is important, certainly, but for developers, CRUD is mostly a solved problem. If we want plain CRUD screens, we can utilize a whole host of tools and approaches to do them, but beyond the simplest departmental apps, the parts of the application that really matter aren’t really CRUD. For one application, the major point was being able to assign people to their proper slot, a task with significant algorithmic complexity. In another, it was fine tuning the user experience so they would have a seamless journey into the annals of the organization decision making processes.
And here we get to the same tradeoffs that you have to makes. Developer friendly CRUD system exists in abundance, ASP.Net MVC support for Editor.For(model) is one such example. And they are developer friendly because they give you he bare bones of functionality you need, allow you to define broad swaths. of the application in general terms, but allow you to fine tune the system easily where you need it. They are also totally incomprehensible if you aren’t a developer.
A system that is aimed at paradevelopers focus a lot more of visual tooling to aid the paradeveloper achieve their goal. The problem is that in order to do that, we give up the ability to do things in broad strokes, and have to pretty much do anything from scratch for everything that we do. That is acceptable for a paradeveloper, without the concepts of reuse and DRY, but those same features that make it so good for a paradeveloper would be a thorn in a developer’s side. Because they would mean having to do the same thing over & over & over again.
Tradeoffs, remember?
And you can’t really create a system that satisfy both. Oh, you can try, but you are going to fail. And you are going to fail because the requirement set of a developer and the requirement set of a paradeveloper are so different as to be totally opposed to one another. For example, one of the things that developers absolutely require is good version control support. And by good version control support i mean that you can diff between two versions of the application and get a meaningful result from the diff.
A system for paradeveloper, however, is going to be so choke full of metadata describing what is going on that even if the metadata is in a format that is possible to diff (and all too often it is located in some database, in a format that make it utterly impossible to work with using source control tools).
Paradeveloper systems encourage you to write what amounts to Bottun1_Click handlers, if they give you even that. Because the paradevelopers that they are meant for have no notion about things like architecture. The problem with that approach when developers do that is that it is obviously one that is unmaintainable.
And so on, and so on.
Whenever I see a new system cropping up in a field that I am familiar with, I evaluate it based on the tradeoffs that it must have made. And that is why I tend to be suspicious of the claims made about the new tool around the block, whatever that tool is at any given week.
Wednesday, August 04, 2010
#
How far can you push commercialization?
I was recently at a private company event (not my company, I was invited, along with others, because we have a close association to that company). The event itself wasn’t notable, but there was one thing that really bothered me, before the event actually started, there was the usual phase when everyone is munching on the snacks and mingling. The food was some sort of green cupcakes with inspirational messages on them: “think positive”, “fitting the world to you”, etc.
All in all, I found that somewhat strange, but I didn’t really care, but I was talking with a few friends when a woman walked up to us and started handing out coupons for some free demo courses using a whole new technique, etc. I was quite taken aback. I am used to stuff like that on conferences floors, where you have booth babes doing stuff like that, but that was a private meeting of less than fifty people, and I couldn’t understand what was going on.
It helped that the woman kept dropping the same phrases that appeared on the cupcakes. That was later confirmed at the beginning of the meeting, where the presenter stood up and started by thanking the sponsors for bringing the food, etc.
Looking back at this, I am both appalled, amazed and utterly unsurprised (you can be both at the same time, it seems). That company actually sold sponsorship for an internal, private, meeting. I don’t really know what was the point, if they were trying to save money on the food or they were actually making money out of this, but that behavior really bother me.
I am absolutely for commercialization, if only because the bank would otherwise object, but I was utterly stunned by how crass it was.
What is next? Hiring employees for the express purpose of watching commercials while the company is getting paid for that?
More to the point, there is some expectation about how such functions are going to be, and stunts like that are leaving very bad impression.
Wednesday, September 01, 2010
#
NHibernate Quick-start Workshop - November 1th
Tikal is delighted to invite you to join our .NET open source workshop on November 1th , lead by Oren Eini (Ayende Rahien) and other Tikal .NET experts.
Tikal offers a set of software development tools and methodologies that enable .NET developers to integrate open source software modules into their native .Net environment.
We invite you to join our .NET open source workshop, which will equip you all the knowledge required for using open source tools to develop excellent .NET applications quickly and effectively.
The workshop will include diverse topics such as:
- NHibernate hands on training, covering assimilation of the ORM framework to your product.
- S#harp Architecture introduction.
- Assessment of your application development needs and where open source tools can fit in.
Read more about the workshop here.
Wednesday, August 11, 2010
#
Don’t TOUCH that debugger, you moron, READ the exception stack
There is a tendency to reach the debugger for every error that you run, but in most cases, it is the exception (and the exception stack) that provides enough to solve the problem in 99% of the cases.
Case in point, I made some changes to Uber Prof and run the tests. For various reasons, I had to reinstall SQLExpress, and all the Java related tests failed, throwing up copious amount of error text in my lap. I cringe when they do that, because it means having to setup the Java environment and having to check how to do things like real debugging in Java (something I have very little knowledge of).
I did just that, spending over an hour getting things to a position where I could run everything properly. Then I run a scenario and got an error, then I looked at the exception stack:
Moron, did I mention already?
Wednesday, September 01, 2010
#
It is an issue of traffic
I just had to respond to this post, Davy Brion talks about the Ruby community, and he had the following to say:
When i asked them about interesting resources to follow as a newbie Rubyist, they all gladly shared their suggestions. When i thanked them for it, they all replied stating that i should feel free to contact them if i had any more questions about whatever Ruby related. Seriously, can you imagine the few .NET heroes that we have responding to questions through email from people they don’t even know like that? I can’t. Hell, i know most of them don’t respond like that. The few that do are still trying to earn their MVP award or are too worried about renewing their MVP status.
Ignoring the MVP dig, allow me to explain exactly what is going on.
In the last 48 hours:

Those are all cold requests, from people I have never met, and all to my private email. Note that in most cases, there is a dedicated mailing list for the topic in question.
For that matter, the last two days has been decidedly quiet in the NHibernate front, this represent a more realistic sample of what is going on:
And those are in addition to the business, private, mailing list and other stuff that I do in email.
Putting it simply, there is too much traffic for me to welcome most cold questions with anything more than a direction to the appropriate mailing list. This isn’t about being rude, or uncaring, this is about actually being able to do any work at all.
Tuesday, August 31, 2010
#
It really happened, legacy programmers tales
Fairy tales always start with “Once upon a time”, and programmers tales starts with “when I was at a client”…
Two days ago I was a client, and the discussion turned to bad code bases, as it often does. One story that I had hard time understanding was the Super If.
Basically, it looked like this:
I had a hard time accepting that someone could write an if condition that long. I kept assuming that they meant that the if statements were 50 lines long, but that wasn’t the case.
And then yesterday I had an even more horrifying story. A WCF service making a call to the database always timed out on the first request, but worked afterward. What would be your first suspicion? Mine was that it took time to establish the database connection, and that after the first call the connection resided in the connection pool.
They laughed at my naivety, for it wasn’t connecting to the database that caused the timeout, it was JITting the method that the WCF service ended up calling.
Yep, you got that right, JITting a single method (because the runtime only JIT a single method at a time). I had even harder time believing that, until they explained to me how that method was built:
Some interesting stats:
- It had a Cyclomatic Complexity of either 4,000 or 8,000, the client couldn’t remember.
- The entire Rhino Mocks codebase fits in 13,000 LOC, so this single method could contain it several times over.
But you know what the really scary part is?
I upgraded from Super If to Black Hole Methods, and I am afraid to see what happen today, because if I get something that top the Black Hole Method, I may have to hand back my keyboard and go raise olives.
Monday, August 30, 2010
#
Entity != Table
I recently had a chance to work on an interesting project, doing a POC of moving from a relational model to RavenDB. And one of the most interesting hurdles along the way wasn’t technical at all, it was trying to decide what an entity is. We are so used to make the assumption that Entity == Table that we started to associate the two together. With a document database, an entity is a document, and that map much more closely to a root aggregate than to a RDMBS entity.
That gets very interesting when we start looking at tables and having to decide if they represent data that is stand alone (and therefore deserve to live is separate documents) or whatever they should be embedded in the parent document. That led to a very interesting discussion on each table. What I found remarkable is that it was partly a discussion that seem to come directly from the DDD book, about root aggregates, responsibilities and the abstract definition of an entity and partly a discussion that focused on meeting the different modeling requirement for a document database.
I think that we did a good job, but I most valued the discussion and the insight. What was most interesting to me was how right was RavenDB for the problem set, because a whole range of issues just went away when we started to move the model over.
Sunday, August 29, 2010
#
I ain’t going against my professional judgment pro bono
I had an interesting conversation with a guy about some problem he was having. This was just one of those “out of the blues” contacts that happen, when someone contact me to ask a question. He presented a problem that I see all too often, trying to create a system in which the entities are doing everything, and he run into problems with that (to be fair, he run into a unique set of problems with that). I gave him a list of blog posts are articles to read, suggesting the right path to go. After a few days, he replied with:
I went over your advised reading in depth, but let me describe in short the properties and functions of our system, which I think causes the system to be an exception to those methods.
He then proceed to outlay his problem, a proposed solution and then asked a very specific NHibernate question that was a blocking stumbling block to get ahead with the solution he wanted. My reply was that he took the wrong approach, a suggestion how to resolve it in a different manner and a link to our NHibernate Commercial Support option.
Thursday, August 26, 2010
#
Database assisted denormalization – Oracle edition
I decided to take a chance (installing Oracle is a big leap :-) ) and see how things match in Oracle.
I decided to run the following query:
SELECT deptno,
dname,
loc,
(SELECT COUNT(*)
FROM emp
WHERE emp.deptno = dept.deptno) AS empcount
FROM dept
WHERE deptno = 20
Please note that I run in on a database that had (total) maybe a 100 records, so the results may be skewed.
Like in the SQL Server case, we need to create an index on the FK column. I did so, after which I got:
Then I dropped that index and create a simple view:
CREATE VIEW depswithempcount
AS
SELECT deptno,
dname,
loc,
(SELECT COUNT(*)
FROM emp
WHERE emp.deptno = dept.deptno) AS empcount
FROM dept
Querying on top of that gives me the same query plan as before. Trying to create a materialized view out of this fails, because of the subquery expression, I’ll have to express the view in terms of joins, instead. Like this:
SELECT dept.deptno,
dname,
loc,
COUNT(*) empcount
FROM dept
LEFT JOIN emp
ON dept.deptno = emp.deptno
WHERE dept.deptno = 20
GROUP BY dept.deptno,
dname,
loc
Interestingly enough, this is a different query plan than the subquery, with SQL Server, those two query exhibit identical query plans.

Now, to turn that into an materialized view.
CREATE materialized VIEW deptwithempcount
AS SELECT dept.deptno,
dname,
loc,
COUNT(*) empcount
FROM dept
left join emp
ON dept.deptno = emp.deptno
GROUP BY dept.deptno,
dname,
loc
And querying on this gives us very interesting results:
select * from deptwithempcount
where deptno = 20
Unlike SQL Server, we can see that Oracle is reading everything from the view. But let us try one more thing, before we conclude this with a victory.
update emp
set deptno = 10
where deptno = 20;
select * from deptwithempcount
where deptno = 20
But now, when we re-run the materialized view query, we see the results
as they were at the creation of the view.
There appears to be a set of options to control that, but the one that I want (RERESH FAST), which update the view as soon as data changes will not work with this query, since it consider it too complex. I didn’t investigate too deeply, but it seems that this is another dead end.
The Profiler New Features: Starring & Renaming
An interesting thing happened recently, when I started to build the profiler, a lot of the features were what I call Core Features. Those were the things that without which, we wouldn’t have a product. Things like detecting SQL, merging it into sessions, providing reports, etc. What I find myself doing recently with the profiler is not so much building Core Features, but building UX features. In other words, now that we have this in place, let us see how we can make better use of this.
Case in point, the new features that were just released in build 713. They aren’t big, but they are there to improve how people are commonly using the products.
Renaming a session:
This is primarily useful if you are in a long profiling session and you want to mark a specific session with some notation:
Small feature, and individually not very useful. But you might have noticed that the sessions are marked with stars around them. They weren’t there is previous builds, so what are they?
They are a way to tell the profiler that you really like those sessions :-)
More to the point, such sessions will not be removed when you clear the current state. That lets you keep around the previous state of the application as a base line while you work to improve it. Beside, it makes it much easier to locate them visually.
And finally, as a quicker way to do that, you can just ask the profiler to clear all but the selected features.
Not big features, but nice ones, I think.
Wednesday, August 25, 2010
#
LightSwitch on the wire
This is going to be my last LightSwitch post for a while.
I wanted to talk about something that I found which was at once both very surprising and Doh! at the same time.
Take a look here:
What you don’t know is this was generated from a request similar to this one:
wget http://localhost:22940/Services/LSTest-Implementation-ApplicationDataDomainService.svc/binary/AnimalsSet_All?$orderby=it.Id&$take=45&$includeTotalCount=
What made me choke was that the size of the response for this was 2.3 MB.
Can you guess why?
The image took up most of the data, obviously. In fact, I just dropped an image from my camera, so it was a pretty big one.
And that lead to another problem. It is obviously a really bad idea to send that image on the wire all the time, but LightSwitch make is so easy, indeed, even after I noticed the size of the request, it took me a while to understand what exactly is causing the issue.
And there doesn’t seems to be any easy way to tell LightSwitch that we want to put the property here, but only load it in certain circumstances. For that matter, I would generally want to make the image accessible via HTTP, which means that I gain advantages such as parallel downloads, caching, etc.
But there doesn’t seems to be any (obvious) way to do something as simple as binding a property to an Image control’s Url property.
LightSwitch & Source Control
Something that I found many high level tools are really bad at is source control, so I thought that I would give LightSwitch a chance there.
I created a Git repository and shoved everything into it, then I decided that I would rename a property and see what is going on.
I changed the Animals.Species to Animals.AnimalType, which gives me:
This is precisely what I wanted to see.
Let us see what happen when I add a new table. And that created a new set in the ApplicationDefinition.lsml file.
Overall, this is much better than I feared.
I am still concerned about having everything in a single file (which is a receipt for having a lot of merge conflicts), but at least you can diff & work with it, assuming that you know how the file format works, and is seems like it is at least a semi reasonable one.
Nevertheless, as promised:

True story, I used to have a lot of ravens in my backyard, but they seem to have gone away single my dog killed one of them, about a week after RavenDB’s launch.
Analyzing LightSwitch data access behavior
I thought it would be a good idea to see what sort of data access behavior LightSwitch applications have. So I hook it up with the EntityFramework Profiler and took it for a spin.
It is interesting to note that it seems that every operation that is running is running in the context of a distributed transaction:
There is a time & place to use DTC, but in general, you should avoid them until you really need them. I assume that this is something that is actually being triggered by WCF behavior, not intentional.
Now, let us look at what a simple search looks like:
This search results in:
That sound? Yes, the one that you just heard. That is the sound of a DBA somewhere expiring. The presentation about LightSwitch touted how you can search every field. And you certainly can. You can also swim across the English channel, but I found that taking the train seems to be an easier way to go about doing this.
Doing this sort of searching is going to be:
- Very expensive once you have any reasonable amount of data.
- Prevent usage of indexes to optimize performance.
In other words, this is an extremely brute force approach for this, and it is going to be pretty bad from performance perspective.
Interestingly, it seems that LS is using optimistic concurrency by default.
I wonder why they use the slowest method possible for this, instead of using version numbers.
Now, let see how it handles references. I think that I run into something which is a problem, consider:
Which generates:
This make sense only if you can think of the underlying data model. It certainly seems backward to me.
I fixed that, and created four animals, each as the parent of the other:
Which is nice, except that here is the SQL required to generate this screen:
-- statement #1
SELECT [GroupBy1].[A1] AS [C1]
FROM (SELECT COUNT(1) AS [A1]
FROM [dbo].[AnimalsSet] AS [Extent1]) AS [GroupBy1]
-- statement #2
SELECT TOP ( 45 ) [Extent1].[Id] AS [Id],
[Extent1].[Name] AS [Name],
[Extent1].[DateOfBirth] AS [DateOfBirth],
[Extent1].[Species] AS [Species],
[Extent1].[Color] AS [Color],
[Extent1].[Pic] AS [Pic],
[Extent1].[Animals_Animals] AS [Animals_Animals]
FROM (SELECT [Extent1].[Id] AS [Id],
[Extent1].[Name] AS [Name],
[Extent1].[DateOfBirth] AS [DateOfBirth],
[Extent1].[Species] AS [Species],
[Extent1].[Color] AS [Color],
[Extent1].[Pic] AS [Pic],
[Extent1].[Animals_Animals] AS [Animals_Animals],
row_number()
OVER(ORDER BY [Extent1].[Id] ASC) AS [row_number]
FROM [dbo].[AnimalsSet] AS [Extent1]) AS [Extent1]
WHERE [Extent1].[row_number] > 0
ORDER BY [Extent1].[Id] ASC
-- statement #3
SELECT [Extent1].[Id] AS [Id],
[Extent1].[Name] AS [Name],
[Extent1].[DateOfBirth] AS [DateOfBirth],
[Extent1].[Species] AS [Species],
[Extent1].[Color] AS [Color],
[Extent1].[Pic] AS [Pic],
[Extent1].[Animals_Animals] AS [Animals_Animals]
FROM [dbo].[AnimalsSet] AS [Extent1]
WHERE 1 = [Extent1].[Id]
-- statement #4
SELECT [Extent1].[Id] AS [Id],
[Extent1].[Name] AS [Name],
[Extent1].[DateOfBirth] AS [DateOfBirth],
[Extent1].[Species] AS [Species],
[Extent1].[Color] AS [Color],
[Extent1].[Pic] AS [Pic],
[Extent1].[Animals_Animals] AS [Animals_Animals]
FROM [dbo].[AnimalsSet] AS [Extent1]
WHERE 2 = [Extent1].[Id]
-- statement #5
SELECT [Extent1].[Id] AS [Id],
[Extent1].[Name] AS [Name],
[Extent1].[DateOfBirth] AS [DateOfBirth],
[Extent1].[Species] AS [Species],
[Extent1].[Color] AS [Color],
[Extent1].[Pic] AS [Pic],
[Extent1].[Animals_Animals] AS [Animals_Animals]
FROM [dbo].[AnimalsSet] AS [Extent1]
WHERE 3 = [Extent1].[Id]
I told you that there is a select n+1 builtin into the product, now didn’t I?
Now, to make things just that much worse, it isn’t actually a Select N+1 that you’ll easily recognize. because this doesn’t happen on a single request. Instead, we have a multi tier Select N+1.
What is actually happening is that in this case, we make the first request to get the data, then we make an additional web request per returned result to get the data about the parent.
And I think that you’ll have to admit that a Parent->>Children association isn’t something that is out of the ordinary. In typical system, where you may have many associations, this “feature” alone is going to slow the system to a crawl.
Profiling LightSwitch using Entity Framework Profiler
This post is to help everyone who want to understand what LightSwitch is going to do under the covers. It allows you to see exactly what is going on with the database interaction using Entity Framework Profiler.
In your LightSwitch application, switch to file view:
In the server project, add a reference to HibernatingRhinos.Profiler.Appender.v4.0, which you can find in the EF Prof download.
Open the ApplicationDataService file inside the UserCode directory:
Add a static constructor with a call to initialize the entity framework profiler:
public partial class ApplicationDataService
{
static ApplicationDataService()
{
HibernatingRhinos.Profiler.Appender.EntityFramework.EntityFrameworkProfiler.Initialize();
}
}
This is it!
You’re now able to work with the Entity Framework Profiler and see what sort of queries are being generated on your behalf.
LightSwitch: Initial thoughts
As promised, I intend to spend some time today with LightSwitch, and see how it works. Expect a series of post on the topic. In order to make this a read scenario, I decided that that a simple app recording animals and their feed schedule is appropriately simple.
I created the following table:
Note that it has a calculated field, which is computed using:
There are several things to note here:
- ReSharper doesn’t work with LightSwitch, which is a big minus to me.
- The decision to use partial methods had resulted in really ugly code.
- Why is the class called Animals? I would expect to find an inflector at work here.
- Yes, the actual calculation is crap, I know.
This error kept appearing at random:
It appears to be a known issue, but it is incredibly annoying.
This is actually really interesting:
- You can’t really work with the app unless you are running in debug mode. That isn’t the way I usually work, so it is a bit annoying.
- More importantly, it confirms that this is indeed KittyHawk, which was a secret project in 2008 MVP Summit that had some hilarious aspects.
There is something that is really interesting, it takes roughly 5 – 10 seconds to start a LS application. That is a huge amount of time. I am guessing, but I would say that a lot of that is because the entire UI is built dynamically from the data source.
That would be problematic, but acceptable, except that it takes seconds to load data even after the app has been running for a while. For example, take a look here:
This is running on a quad core, 8 GB machine, in 2 tiers mode. It takes about 1 – 2 seconds to load each screen. I was actually able to capture a screen half way loaded. Yes, it is beta, I know. Yes, perf probably isn’t a priority yet, but that is still worrying.
Another issue is that while Visual Studio is very slow, busy about 50% of the time. This is when the LS app is running or not. As an a side issue, it is hard to know if the problem is with LS or VS, because of all the problems that VS has normally.
As an example of that, this is me trying to open the UserCode, it took about 10 seconds to do so.
What I like about LS is that getting to a working CRUD sample is very quick. But the problems there are pretty big, even at a cursory examination. More detailed posts touching each topic are coming shortly.
Tuesday, August 24, 2010
#
Runtime code compilation & collectible assemblies are no go
The problem is quite simple, I want to be able to support certain operation on Raven. In order to support those operations, the user need to be able to submit a linq query to the server. In order to allow this, we need to accept a string, compile it and run it.
So far, it is pretty simple. The problem begins when you consider that assemblies can’t be unloaded. I was very hopeful when I learned about collectible assemblies in .NET 4.0, but they focus exclusively on assemblies generated from System.Reflection.Emit, while my scenario is compiling code on the fly (so I invoke the C# compiler to generate an assembly, then use that).
Collectible assemblies doesn’t help in this case. Maybe, in C# 5.0, the compiler will use SRE, which will help, but I don’t hold much hope there. I also checked out Mono.CSharp assembly, hoping that maybe it can do what I wanted it to do, but that suffer from the memory leak as well.
So I turned to the one solution that I knew would work, generating those assemblies in another app domain, and unloading that when it became too full. I kept thinking that I can’t do that because of the slowdown with cross app domain communication, but then I figured that I am violating one of the first rules of performance: You don’t know until you measure it. So I set out to test it.
I am only interested in testing the speed of cross app domain communication, not anything else, so here is my test case:
public class RemoteTransformer : MarshalByRefObject
{
private readonly Transformer transfomer = new Transformer();
public JObject Transform(JObject o)
{
return transfomer.Transform(o);
}
}
public class Transformer
{
public JObject Transform(JObject o)
{
o["Modified"] = new JValue(true);
return o;
}
}
Running things in the same app domain (base line):
static void Main(string[] args)
{
var t = new RemoteTransformer();
var startNew = Stopwatch.StartNew();
for (int i = 0; i < 100000; i++)
{
var jobj = new JObject(new JProperty("Hello", "There"));
t.Transform(jobj);
}
Console.WriteLine(startNew.ElapsedMilliseconds);
}
This consistently gives results under 200 ms (185ms, 196ms, etc). In other words, we are talking about over 500 operations per millisecond.
What happen when we do this over AppDomain boundary? The first problem I run into was that the Json objects were serializable, but that was easy to fix. Here is the code:
static void Main(string[] args)
{
var appDomain = AppDomain.CreateDomain("remote");
var t = (RemoteTransformer)appDomain.CreateInstanceAndUnwrap(typeof(RemoteTransformer).Assembly.FullName, typeof(RemoteTransformer).FullName);
var startNew = Stopwatch.StartNew();
for (int i = 0; i < 100000; i++)
{
var jobj = new JObject(new JProperty("Hello", "There"));
t.Transform(jobj);
}
Console.WriteLine(startNew.ElapsedMilliseconds);
}
And that run close to 8 seconds, (7871 ms). Or over 40 times slower, or just about 12 operations per millisecond.
To give you some indication about the timing, this means that an operation over 1 million documents would spend about 1.3 minutes just serializing data across app domains.
That is… long, but it might be acceptable, I need to think about this more.
Monday, August 23, 2010
#
Database assisted denormalization
Let us say that I have the homepage of the application, where we display Blogs with their Post count, using the following query:
select
dbo.Blogs.Id,
dbo.Blogs.Title,
dbo.Blogs.Subtitle,
(select COUNT(*) from Posts where Posts.BlogId = Blogs.Id) as PostCount
from dbo.Blogs
Given what I think thoughts of denormalization, and read vs. write costs, it seems a little wasteful to run the aggregate all the time.
I can always add a PostCount property to the Blogs table, but that would require me to manage that myself, and I thought that I might see whatever the database can do it for me.
This isn’t a conclusive post, it details what I tried, and what I think is happening, but it isn’t the end all be all. Moreover, I run my tests on SQL Server 2008 R2 only, not on anything else. I would like to hear what you think of this.
My first thought was to create this as a persisted computed column:
ALTER TABLE Blogs
ADD PostCount AS (select COUNT(*) from Posts where Posts.BlogId = Blogs.Id) PERSISTED
But you can’t create computed columns that uses subqueries. I would understand easier why not if it was only for persisted computed columns, because that would give the database a hell of time figuring out when that computed column needs to be updated, but I am actually surprised that normal computed columns aren’t supporting subqueries.
Given that my first attempt failed, I decided to try to create a materialized view for the data that I needed. Materialized views in SQL Server are called indexed views, There are several things to note here. You can’t use subqueries here either (likely because the DB couldn’t figure which row in the index to update if you were using subqueries), but have to use joins.
I created a data set of 1,048,576 rows in the blogs table and 20,971,520 posts, which I think should be enough to give me real data.
Then, I issued the following query:
select
dbo.Blogs.Id,
dbo.Blogs.Title,
dbo.Blogs.Subtitle,
count_big(*) as PostCount
from dbo.Blogs left join dbo.Posts
on dbo.Blogs.Id = dbo.Posts.BlogId
where dbo.Blogs.Id = 365819
group by dbo.Blogs.Id,
dbo.Blogs.Title,
dbo.Blogs.Subtitle
This is before I created anything, just to give me some idea about what kind of performance (and query plan) I can expect.
Query duration: 13 seconds.
And the execution plan:
The suggest indexes feature is one of the best reasons to move to SSMS 2008, in my opinion.
Following the suggestion, I created:
CREATE NONCLUSTERED INDEX [IDX_Posts_ByBlogID]
ON [dbo].[Posts] ([BlogId])
And then I reissued the query. It completed in 0 seconds with the following execution plan:
After building Raven, I have a much better understanding of how databases operate internally, and I can completely follow how that introduction of this index can completely change the game for this query.
Just to point out, the results of this query is:
Id Title Subtitle PostCount
----------- --------------------- ---------------------- --------------------
365819 The lazy blog hibernating in summer 1310720
I decided to see what using a view (and then indexed view) will give me. I dropped the IDX_Posts_ByBlogID index and created the following view:
CREATE VIEW BlogsWithPostCount
WITH SCHEMABINDING
AS
select
dbo.Blogs.Id,
dbo.Blogs.Title,
dbo.Blogs.Subtitle,
count_big(*) as PostCount
from dbo.Blogs join dbo.Posts
on dbo.Blogs.Id = dbo.Posts.BlogId
group by dbo.Blogs.Id,
dbo.Blogs.Title,
dbo.Blogs.Subtitle
After which I issued the following query:
select
Id,
Title,
Subtitle,
PostCount
from BlogsWithPostCount
where Id = 365819
This had the exact same behavior as the first query (13 seconds and the suggestion for adding the index).
I then added the following index to the view:
CREATE UNIQUE CLUSTERED INDEX IDX_BlogsWithPostCount
ON BlogsWithPostCount (Id)
And then reissued the same query on the view. It had absolutely no affect on the query (13 seconds and the suggestion for adding the index). This make sense, if you understand how the database is actually treating this.
The database just created an index on the results of the view, but it only indexed the columns that we told it about, which means that is still need to compute the PostCount. To make things more interesting, you can’t add the PostCount to the index (thus saving the need to recalculate it).
Some points that are worth talking about:
- Adding IDX_Posts_ByBlogID index resulted in a significant speed increase
- There doesn’t seem to be a good way to perform materialization of the query in the database (this applies to SQL Server only, mind you, maybe Oracle does better here, I am not sure).
In other words, the best solution that I have for this is to either accept the cost per read on the RDBMS and mitigate that with proper indexes or create a PostCount column in the Blogs table and manage that yourself. I would like your critique on my attempt, and additional information about whatever what I am trying to do is possible in other RDMBS.
Saturday, August 07, 2010
#
Finding chrome bugs
That one was annoying to figure out. Take a look at the following code:
static void Main(string[] args)
{
var listener = new HttpListener();
listener.Prefixes.Add("http://+:8080/");
listener.Start();
Console.WriteLine("Started");
while(true)
{
var context = listener.GetContext();
context.Response.Headers["Content-Encoding"] = "deflate";
context.Response.ContentType = "application/json";
using(var gzip = new DeflateStream(context.Response.OutputStream, CompressionMode.Compress))
using(var writer = new StreamWriter(gzip, Encoding.UTF8))
{
writer.Write("{\"CountOfIndexes\":1,\"ApproximateTaskCount\":0,\"CountOfDocuments\":0}");
writer.Flush();
gzip.Flush();
}
context.Response.Close();
}
}
FireFox and IE have no trouble using this. But here is how it looks on Chrome.
To make matter worse, pay attention to the conditions of the bug:
- If I use Gzip instead of deflate, it works.
- If I use "text/plain” instead of “application/json”, it works.
- If I tunnel this through Fiddler, it works.
I hate stupid bugs like that.
Friday, August 06, 2010
#
Hunt the bug
The following code will throw under certain circumstances, what are they?
public class Global : HttpApplication
{
public void Application_Start(object sender, EventArgs e)
{
HttpUtility.UrlEncode("Error inside!");
}
}
Hint, the exception will not be raised because of transient conditions such as low memory.
What are the conditions in which it would throw, and why?
Hint #2, I had to write my own (well, take the one from Mono and modify it) HttpUtility to avoid this bug.
ARGH!
Friday, August 20, 2010
#
Application databases and external integration points
Dave has an interesting requirements in his project:
We're not in control of where the data is located, how it's stored and in what configuration. In most cases employees need to be retrieved from a Active Directory (There's is no 'login', the Window Identity determines what a user can or can't do). Customer contacts are usually handled by the helpdesk department and each contact moment is logged in a helpdesk database. The customer (account information) itself often needs to be retrieved from an IBM DB2 database.
What you have is not one application that needs to access different data sources. That would be the wrong way to think about this, because it introduce a whole lot of complexity into the application.
It is much better to structure the application as an independent application with each integration point made explicit. Instead of touch the DB/2 database, you put a service on it and access that.
This isn’t just “oh, SOA nonsense again”, it is an important distinction. When you tie yourself directly to so many external integration points, you are also ensuring that whenever there is a change in one of them, you are going to be impacted. When you put a service boundary between you and the integration point (even if you have to build the service), the affect is much less noticeable.
Also, did you notice the blue lines going from the databases? Those are background ETL processes, replicating data to/from the databases. It allows us to handle situations where the integration points are not available.
In short, design you application so it doesn’t stick its nose into other people’s databases. If you need data from another database, put a service there, or replicate it. You’ll thank me when you app stays up.
Thursday, August 19, 2010
#
NH Prof & usage data
There seems to be some suspicion about the usage data from NH Prof that I published recently.
I would like to apologize for responding late to the comments, I know that there are some people who believe that I have installed a 3G chip directly to my head, but I actually was busy in the real world and didn’t look at my email until recently. The blog runs on auto pilot just so I’ll be able to do that, but sometimes it does give the wrong impression.
So, what does NH Prof “phone home” about?
Well, the data is actually divided into two distinct pieces. Most of the data (numbers, usages, geographic location, etc) actually comes from looking at the server logs for the update check.
Another piece of data that the profiler reports is feature usage. There are about 20 – 30 individual features that are being tracked for usage. What does it means, tracking a feature?
Well, here are three examples that shows what gets reported:
There is no way to correlate this data to an individual user, nor is there a way to track the behavior of a single user.
I use this data mainly in order to see what features are being used most often (therefore deserving the most attention, optimizations, etc).
Those are mentioned in the product documentation.
To summarize:
- I am not stealing your connection strings.
- I don’t gather any personally identifying data (and I am at somewhat at a loss to understand what I would do with it even if I did).
- There is never any data about what you are profiling being sent anywhere.
I hope this clear things out.