SQL Gotchas
I think you can imagine the amount of paint involved in having a query behave in an unexpected manner. I have run into both of those recently. This one had me doubting my sanity (imagine this on a table with several hundred thousands records, with a fairly complex query around it:
select 1
where 1 not in (2,3,null)
select 1 where 'e' = 'e '
NHibernate for SQL
Here is a little known fact, you can ask NHibernate to take any arbitrary result set from the database and turn it into an object.
return session.CreateSqlQuery(@"
select customer.Name as CustomerName,
count( case order.Type when 1 then 1 else 0 end ) as FastOrder,
count( case order.Type when 1 then 2 else 0 end ) as SlowOrder,
count( case order.Type when 1 then 3 else 0 end ) as AirMailOrder,
from Orders order join Customers customer
on order.Customer = customer.Id
group by customer.Name
")
.AddScalar("CustomerName", NHibernateUtil.String)
.AddScalar("FastOrder", NHibernateUtil.Int64)
.AddScalar("SlowOrder", NHibernateUtil.Int64)
.AddScalar("AirMailOrder", NHibernateUtil.Int64)
.SetResultTransformer(new AliasToBeanResultTransformer(typeof (CustomerOrderTypesCount)))
.List();
As you can probably imagine, this will return a list of CustomerOrderTypesCount objects, with the respective properties filled with the values from the result set.
You can also do it in the configuration, and even map full blown entities along side other interesting values.
SQL Challenge, getting historical data: Performance characteristics
I just run the query over a table with 120,000~ rows, and it completed in less than second and change.
After playing with it a bit, I decided to skip the function and put the results in a table and map that, it is easier to handle it that way with NHibernate, and I can put an index on it that would save a table scan on the months result.
The major cost is a clustered index scan (95%) and nested loops join (5%). Fun stuff :-)
I am getting deep into NHibernate's queries recently, extremely powerful, especially if you combine it with your own customer dialect.
SQL Challenge, getting historical data: solution
As it turned out, this isn't that hard, all I needed to do was remember my trusty DateRange function, modify it to on months instead of days, and it was off to the races. This type of code does make my head hurt a tiny bit, it packs a lot into it.
DECLARE @start DATETIME,
@end DATETIME
SET @start = '2007-04-01'
set @end = '2007-10-01'
SELECT YEAR(Currentdate) Year,
MONTH(Currentdate) Month,
COUNT(Bug.Id) OpenedBugCount
FROM Dbo.MonthRange(@start,@end )
JOIN Bugs Bug
ON YEAR(Currentdate) >= YEAR(Bug.Openedat)
AND MONTH(Currentdate) >= MONTH(Bug.Openedat)
AND YEAR(Currentdate) <= YEAR(ISNULL(Bug.Closedat,Currentdate))
AND MONTH(Currentdate) < MONTH(ISNULL(Bug.Closedat,DateAdd(MONTH,1,Currentdate)))
GROUP BY YEAR(Currentdate),MONTH(Currentdate)
I would get it into NHibernate on Sunday, should be fun, it is about the ninth select is the super report... :-)
SQL Challenge, getting historical data
Here is an interesting problem that I had to face at work. I need to get historical data from this table. Basically, I need to count the number of opened bugs per months.
Id OpenedAt ClosedAt
--- ---------- -----------
1 2007-04-02 2007-09-01
2 2007-07-01 NULL
3 2007-07-01 2007-07-23
A solution to this with the data above should print (for 2007-04 to 2007-10):
| Month | Year | Opened Bugs |
| 4 | 2007 | 1 |
| 5 | 2007 | 1 |
| 6 | 2007 | 1 |
| 7 | 2007 | 2 |
| 8 | 2007 | 2 |
| 9 | 2007 | 1 |
| 10 | 2007 | 1 |
Have fun :-)
Sending arrays to SQL Server: Xml vs. Comma Separated Values
I spoke before about using the XML capabilities of SQL Server in order to easily pass list of values to SQL Server. I thought that this was a pretty good way to go, until I started to look at the performance numbers.
Let us take a look at this simple query:
DECLARE @ids xml
SET @ids = '<ids>
<id>ALFKI</id>
...
<id>SPLIR</id>
</ids>'SELECT * FROM Customers
WHERE CustomerID IN (SELECT ParamValues.ID.value('.','NVARCHAR(20)')
FROM @ids .nodes('/ids/id') as ParamValues(ID) )
This simple query has a fairly involved execution plan:
This looks to me like way too much stuff for such a simple thing, especially when I see this:
So the XML stuff is taking up 98% of the query?
I then checked the second route, using fnSplit UDF from here. Using it, I got this result:
So it looks like it is significantly more efficient than the XML counter part.
But what about the larger scheme? Running the fnSplit over 9,100 items got me a query that took nearly 45 seconds, while a XML approach over the same set of data had no measurable time over just 91 records.
I then tried a simple SqlCLR function, and got this the same performance from it:
The code for the function is:
[SqlFunction(FillRowMethodName = "FillRow", TableDefinition = "str NVARCHAR(MAX)")] public static IEnumerable Split(SqlString str) { if (str.IsNull) return null; return str.Value.Split(','); } public static void FillRow(object obj, out SqlString str) { string val = (string) obj; if (string.IsNullOrEmpty(val)) str = new SqlString(); else str = new SqlString(val); }
As you can probably guess, there are all sorts of stuff that you can do to make it better if you really want, but this looks like a very good approach already.
Tripling the size of the data we are talking about to ~30,000 items had no measurable difference that I could see.
Obviously, when you are talking about those numbers, an IN is probably not something that you want to use.
Why I have SSIS?!
ERROR: Unused variable declaration, line: "DateTime goingAwayFromOffice = DateTime.Now.AddMinutes(-45);"
SSIS Scheduling Conflicts?
I have two SSIS packages that cannot run concurrently, however, one must run once every 5 minutes, and the second every hour, so they are assured to conflict very soon. I can handle this situation manually, but I wanted to know if there was a builtin way to handle that.
SSIS Integration Woes
I am not sure if it is me or SSIS, but something is very wrong here. I just spend an ungodly amount of time trying to solve this simple scenario:
- Source database is Oracle, Destination database is SQL Server
- A table with large amount of rows
- The table has a LastModifed field
- Every few minutes, an SSIS process should wake and grab the latest updates from the table, then update a state table, with the last updated date, for the next run
It is something that should take minutes to handle with code, but even now, I just can't make it happen. The major issue here is very simple, I can't pass parameters to the query. The OLE DB Source won't accept parameters when using Oracle, and when using the DateReader source, it simply has no way to pass input parameters that I can find.
Parameters cannot be extracted from the SQL command. The provider might not help to parse parameter information from the command. In that case, use the "SQL command from variable" access mode, in which the entire SQL command is stored in a variable.
ADDITIONAL INFORMATION:
Provider cannot derive parameter information and SetParameterInfo has not been called. (Microsoft OLE DB Provider for Oracle)
I am not an SSIS expert, but I consider myself technically able, and I certainly knows how to search google. No answer on this one. I went and wrote the logic in an Oracle stored procedure, then I faced another interesting issue. There doesn't seem to be a way to call an Oracle SP from SSIS.
SSIS supposed to stand for Integration, and while I wouldn't assume that it would integrate with the DB engine that I wrote last month in 2AM, I don't think that Oracle is such a minor player that integration with it should be so hard.
Please, someone, make my day and show me that I am stupid, because I am very close to throwing the whole thing out and writing it in an envrionment that doesn't force me to jump through hops just to get the simplest things done. (And have much better error reporting, as an aside)
Remote Keys: Cross Database Foreign Keys
Here is an interesting product (via Larkware). RemoteKeys Adds cross-database constraints to SQL Server, which is something that is really nice in many scenarios.
Specifically, I am a firm believer that database == application, and that separate applications should use separate databases. (Versioning, deployment, WhoBrokeMyQuery, etc). This product seems to give a nice answer to Referential integrity concerns.
I was interested to know how they did it, and it looks like a good UI on top of Linked Server + INSTEAD OF Triggers. I really like the idea, but a couple of things needs to be considered when using it are:
- Availability concerns - Does both databases have the same availability? If not, what happens when either database goes down? The way it works, both databases needs to talk to each other for CUD.
- Performance concerns - If we are talking about remote databases, then it is something that you really should consider carefully. The example on their site has US Orders and Europe Orders with remote FK to Customers Database. The problem is that this means that the Customers database needs to issue two remote calls (probably per statement). There may be some smarts there about consolidating checks, but I didn't check.
Cool idea nonetheless, although if I ever see the need, I will create the trigger via code gen and be done with it.
Transactions and concurrency
I just had a head-scratching, hair-pulling, what-the-hell-is-going-on bug. Basically, a classic producer / consumer issue. Each work item is composed of several sub-items, which should always be processed as a single unit (exactly one time). The problem was that I suddenly started to get duplicate processing of the same work item, but with different sub-items each time.
I went over the code with a comb, and I couldn't see anything wrong, I investigated the database, and everything was fine there as well. I knew that I was in trouble when I considerred going down to the SQL Server protocol level and check if somehow network problems were causing this issue.
Here is a simplified version of the producer (and yes, I would never write this kind of code for production, test, or anything but a short and to the point demo). As you can see, it merely generate 500 records into a table.
private static void Producer()
{
SqlConnection connection = new SqlConnection(connectionString);
int count = 0;
while (true)
{
connection.Open();
SqlTransaction sqlTransaction = connection.BeginTransaction(isolationLevel);
for (int i = 0; i < 500; i++)
{
SqlCommand sqlCommand = connection.CreateCommand();
sqlCommand.Transaction = sqlTransaction;
sqlCommand.CommandText = "INSERT INTO t (Id) VALUES(@p1)";
sqlCommand.Parameters.AddWithValue("@p1", count);
sqlCommand.ExecuteNonQuery();
sqlCommand.Dispose();
}
sqlTransaction.Commit();
Console.WriteLine("Wrote 500 records with count " + count);
count += 1;
connection.Close();
}
}
And here is the consumer, which read from the table, and ensure that it reads in batches of 500:
private static void Consumer()
{
SqlConnection connection = new SqlConnection(connectionString);
while (true)
{
connection.Open();
SqlTransaction sqlTransaction = connection.BeginTransaction(isolationLevel);
SqlCommand sqlCommand = connection.CreateCommand();
sqlCommand.Transaction = sqlTransaction;
sqlCommand.CommandText = "SELECT COUNT(*) FROM t GROUP BY id";
SqlDataReader sqlDataReader = sqlCommand.ExecuteReader();
if (sqlDataReader.RecordsAffected != -1)
Console.WriteLine("Read: {0}", sqlDataReader.RecordsAffected);
while (sqlDataReader.Read())
{
int count = sqlDataReader.GetInt32(0);
Console.WriteLine("Count = {0}", count);
if (count != 500)
Environment.Exit(1);
}
sqlDataReader.Dispose();
sqlCommand.Dispose();
sqlTransaction.Commit();
connection.Close();
}
}
Note that I have painted the isolationLevel red in both. Here is the code that run both methods:
private static void Main()
{
Delete();
new Thread(Producer).Start();
new Thread(Consumer).Start();
}
If I set isolationLevel to ReadCommited or RepeatableRead, this consistently fails within a couple of seconds, it manage to do partial read of the records that were inserted by the consumer. I I set the isolationLevel to Serializable or Snapshot, it behaves as expected.
I may be missing something, but I would expect ReadCommited to only allow the second transaction to read... well, commited rows. Just to point out, the real scenario doesn't involve aggregation, and I am seeing the same issue. I suppose that the new records were commited (and thus made visible) when the query already scanned that part of the table, and thus it missed some one the rows, while snapshot and serializable force either a static image all the way through or waiting till the end.
Any comments?
Calculating most popular posts with SubText
How do you get the most popular posts? There are three criteria that you want to check:
- Aggregator Views
- Web Views
- Comments
I decided to give each the following ranking:
- Each aggregator view counts as 10 points - a lot of people are subscribe to the feed and so they are more likely to go through all the posts
- Each web view counts counts as 15 points - somebody bothered to go to the site, or arrive via search / link.
- Each comment gets 35 points - somebody took the time to comment on what I said
This means that a post doesn't get too wieghted toward too much comments (which can happen if a discussion is started) but not enough views.
SELECT TOP 15
EntryId,
Title,
WebCount,
AggCount,
CommentsPerPost,
( WebCount * 15 ) + ( AggCount * 10 ) + ( CommentsPerPost * 35 ) weightedScore
FROM subtext_EntryViewCount,
( SELECT COUNT(dbo.subtext_Feedback.Id) CommentsPerPost,
dbo.subtext_Content.Id
FROM dbo.subtext_Feedback
RIGHT JOIN dbo.subtext_Content ON dbo.subtext_Content.Id = dbo.subtext_Feedback.EntryId
WHERE FeedbackType = 1
GROUP BY dbo.subtext_Content.Id
) Comments,
Subtext_Content
WHERE Comments.Id = EntryId
AND Subtext_Content.Id = EntryId
ORDER BY ( WebCount * 15 ) + ( AggCount * 10 ) + ( CommentsPerPost * 35 ) DESC
Blog Stats: SQL
The best part about using a blog based on a database server is that you get to run all sorts of interesting queries against it.
Posts #:
SELECT COUNT(*)
FROM dbo.subtext_Content
Comments #
SELECT COUNT(*),
FeedbackType
FROM dbo.subtext_Feedback
GROUP BY FeedbackType
(FeedbackType: 1 = Comment, 2 = Trackback)
Avg. Comments Per Post:
SELECT AVG(CommentsPerPost)
FROM ( SELECT COUNT(dbo.subtext_Feedback.Id) CommentsPerPost,
dbo.subtext_Content.Id
FROM dbo.subtext_Feedback
RIGHT JOIN dbo.subtext_Content
ON dbo.subtext_Content.Id = dbo.subtext_Feedback.EntryId
WHERE FeedbackType = 1
GROUP BY dbo.subtext_Content.Id
) commentsPerPost
Post with most comments:
SELECT TOP 1
COUNT(dbo.subtext_Feedback.Id) CommentsPerPost,
dbo.subtext_Content.Id
FROM dbo.subtext_Feedback
RIGHT JOIN dbo.subtext_Content ON dbo.subtext_Content.Id = dbo.subtext_Feedback.EntryId
WHERE FeedbackType = 1
GROUP BY dbo.subtext_Content.Id
ORDER BY COUNT(dbo.subtext_Feedback.Id) DESC
Avg. Posts per Month:
SELECT AVG(PostsPerMonth)
FROM ( SELECT DATEADD(year, YEAR(DateAdded) - 1900,
DATEADD(month, MONTH(DateAdded)-1, 0)) Date,
COUNT(*) PostsPerMonth
FROM subtext_Content
GROUP BY MONTH(DateAdded),
YEAR(DateAdded)
) postsPerMonth
Avg. Posts per Week
SELECT AVG(postsPerWeek)
FROM ( SELECT DATEPART(week, dateadded) weekNum,
YEAR(dateadded) [year],
COUNT(*) postsPerWeek
FROM subtext_Content
GROUP BY DATEPART(week, dateadded),
YEAR(DateAdded)
) postsPerWeek
Avg. Comments per Month:
SELECT DATEADD(year, YEAR(DateCreated) - 1900,
DATEADD(month, MONTH(DateCreated) - 1, 0)) Date,
COUNT(*) CommentsPerMonth
FROM subtext_Feedback
WHERE feedbacktype = 1
GROUP BY MONTH(DateCreated),
YEAR(DateCreated)
Open Sourcing SQL Server
[Via Stefano] I reached this post about a thought experiment regarding open sourcing SQL Server under the GPL. The author is a former Microsoft employee, and he spends quite a bit of time trying to explain how GPLing SQL Server will not hurt Microsoft business model. The terms that he suggests are similar to the way MySQL works, a GPLed products for free and a commercial version + support avialable for $$$.
I just don't see the point here. SQL Server is one of those products that I feel that I can rely on with my eyes closed, and while having the source might be nice to DB geeks out there, I just don't see the big value there. (Having the source to the client UI would be nicer, but even this one has a good extention model for what I want.)
Frankly, as an OSS proponent, I just don't see the point. Microsoft doesn't stand to gain anything from this, except lose potential revenues. From my projects alone, I can count about ~100,000$ in licensing fees for SQL Server that wouldn't have made it if SQL Server was avialable under the GPL*.
SQL Server is one of the revenue streams of Microsoft, and I don't see this changing. Offhand, I can't recall any successfull, revenue generating, product that was open sourced. If I were to make that decision, I would have definately said NO.
The BCL, on the other hand ... :-)
* Choosing SQL Server is mostly influenced by the fast that Microsoft is behind it, that it has great tools and a lot of community behind it.
Why this is failing?
Consider the following query (heavily reduced and thus rendered meaningless)...
select *
from Users user0_,
Forum forum1_,
Messages message2_
left outer join Users user3_ on forum1_.Manager = user3_.Id
If you will try to run it, it will fail, complaining:
The multi-part identifier "forum1_.Manager" could not be bound.
This is even though this should work. I run the same query on MySQL, and it passed cleanly (although this is not saying anything, of course :-) ). I currently don't have a handy Oracle box to test this, so I have to rely on just those two data points. It works on MySQL, fails in MS SQL.
The issues seems to be the orderring in the from clause. If I replace Forum and Messages, it works, but I am not sure why.
Unit testing with NHibernate / Active Record
One of the more difficult regions to test in an application is the data access layer. It is difficult to test for several reasons:
- It is usually complicated - fetching data effectively is not something trivial in many cases.
- It can be highly dependant on the platform you are using, and moving between platforms can be a PITA.
- It is usually hard to mock effectively.
- Database by their natures keep state, tests should be isolated.
- It is slow - we are talking out of process calls at best, remote system calls at worst.
I am a big fan of NHibernate, and I consider myself fairly proficent in mocking, and I find it very hard to mock data access code effectively, and that is when NHibernate already provides a very easy set of interfaces to work with.
The issue is not with calls like this:
This is very easy to mock.
The issue is with calls like this one:
Post post = (Post)session.CreateCriteria(typeof(Post))
.Add(Expresion.Like("Title", title))
.AddOrder(Order.Asc("PublishedAt"))
.SetFirstResult(0)
.SetMaxResult(1)
.UniqueResult();
Trying to mock that is going to be... painful. And this is a relatively simple query. What is worse are a series of queries, which work together to return a common result. When my setup code crossed the 500 lines of highly recursive mocking just to give the test a reasonable place to work with, I knew that I had an issue.
I could break it up to more rigid interface, but that completely ignore the point of being flexible. The above query is hard coded, but pretty often I find myself building those dynamically, which is not possible using rigid (but more easily mockable) interfaces.Please note that I am not talking about the feasability of mocking those, I have done it, it is possible, if lenghty, I am talking about maintainability and the ability to read what was the intention after six months has passed. Bummer, isn't it?
Then I thought about SQLite. SQLite, despite their documnetations shortcoming, is a lightwieght database engine that supports an in memory database. What is more, NHibernate already supports it natively (which saved me the effort :-) ). SQLite is an in-process database, and in-memory databases are wiped when their connections are closed. So far we removed two major obstacles, the statefulness of the databasess, and the inherent slowdowns we are going across process/machine boundaries. In fact, since we are using entirely in-memory database, we don't even touch the file system :-).
But we have the issue of moving between platforms. We can't just port the database to SQLite just for testing. Or can we?
First, let us define what we are talking about. I am not going to performance testing (except maybe SELECT N+1 issues) on SQLite, this require the production (or staging) database with a set of tools to analyze and optimize what we are doing.
So, if we ruled perf testing from the set of scenarios we are looking for, we don't need large amounts of data. NHibernate will create a schema for us, free of charge, and it can handle several databases transperantly. We don't need to mock anything, it looks like we are golden.
I got the following code:
[TestFixture]
public class InMemoryTests : NHibernateInMemoryTestFixtureBase
{
private ISession session;
[TestFixtureSetUp]
public void OneTimeTestInitialize()
{
OneTimeInitalize(typeof(SMS).Assembly);
}
[SetUp]
public void TestInitialize()
{
session = this.CreateSession();
}
[TearDown]
public void TestCleanup()
{
session.Dispose();
}
[Test]
public void CanSaveAndLoadSMS()
{
SMS sms = new SMS();
sms.Message = "R U There?";
session.Save(sms);
session.Flush();
session.Evict(sms);//remove from session cache
SMS loaded = session.Load<SMS>(sms.Id);
Assert.AreEqual(sms.Message, loaded.Message);
}
}
Initialize the framework, create a session, and run. Notice that the test doesn't care what it is working against. It just test that we can test/load an entity. Let us look at the base class:
public class NHibernateInMemoryTestFixtureBase
{
protected static ISessionFactory sessionFactory;
protected static Configuration configuration;
/// <summary>
/// Initialize NHibernate and builds a session factory
/// Note, this is a costly call so it will be executed only one.
/// </summary>
public static void OneTimeInitalize(params Assembly [] assemblies)
{
if(sessionFactory!=null)
return;
Hashtable properties = new Hashtable();
properties.Add("hibernate.connection.driver_class", "NHibernate.Driver.SQLite20Driver");
properties.Add("hibernate.dialect", "NHibernate.Dialect.SQLiteDialect");
properties.Add("hibernate.connection.provider", "NHibernate.Connection.DriverConnectionProvider");
properties.Add("hibernate.connection.connection_string", "Data Source=:memory:;Version=3;New=True;");
configuration = new Configuration();
configuration.Properties = properties;
foreach (Assembly assembly in assemblies)
{
configuration = configuration.AddAssembly(assembly);
}
sessionFactory = configuration.BuildSessionFactory();
}
public ISession CreateSession()
{
ISession openSession = sessionFactory.OpenSession();
IDbConnection connection = openSession.Connection;
new SchemaExport(configuration).Execute(false,true,false,true,connection,null);
return openSession;
}
}
Here we just initialize NHibernate with an in memory connection string and a SQLite provider. Then, when we need to grab a session, we make sure to initialize the database with our schema. Disposing the session closes the connection, which frees the database.
So far we handled the following issues: Slow, Stateful, Hard to mock, platform dependant. We have seen that none of them apply to the issue at hand. Now, what about the last one, testing complicate data fetching strategies?
Well, that is what we do here, aren't we? In this case, true, we aren't doing any queries, but it is the prinicpal that matters. Looking at a database through NHibernate tinted glasses, they look pretty much the same. And a querying strategy that works on one should certainly work on another (with some obvious exceptions). I am much more concerend about getting the correct data than how I get it.
The beauty here is that we don't need to do anything special to make this happen. Just let the tools do their work. To use Active Record with this approach, you need replace the calls to the configuration with calls to ActiveRecordStarter, and that is about it.
Even though those tests execute code from the business logic to the database, they are still unit tests. To take Jeremy's Qualities of a Unit Test as an example, unit tests should be:
- Atomic - each test gets each own database instance, they can't affect each other.
- Order indepenent and isolated - same as above, once the test finished, its database is back to the Great Heap in the Sky.
- Intention Revealing - throw new OutOfScopeException("Issue with test, not the technique");
- Easy to setup - check above for the initial setup, afterward, it is merely an issue of shoving stuff into the database using NH's facilities, which is very easy, in my opinion.
- Fast - It is an in memory database, it is fast. For comparison, running this test on SQL Server (locahost) runs at about 4.8 seconds (reported from TestDrive.Net, and include all the initialization) running it on SQLite results in 3.4 seconds. Running an empty method on TestDriven.Net takes about 0.8 seconds. Most of the time is spent in the initial configuration of NHibernate, though.
Hope this helps....
Complexity *= 10, Beauty *= 100
Trying to provide even more capabilities to the NHibernate Query Generator is turning tricky. The issue is that I would like to make stuff as natrual as possible. Due to this, I decided to take a break from my usual habit (sitting in from of the IDE and waiting for a nervous tick to cause characters to appear on the screen), and actually think about what I want to do. Then I went crazy with a mess of overloaded operators and generic query objects.
The result, however, is this:
Post.FindAll(
Where.Post.Blog.Author == ayende &&
(Where.Post.Title == "Overloading" || Where.Post.Title == "Operator")
);
What is new about this? Well, the above statement compiles, and when executed, it creates the following query (variables names and formatting applied, nothing else changed):
SELECT
post.Id AS PostId, post.Title AS PostTitle,
post.Contnet AS PostContent, post.Blog AS PostBlog,
blog.Id AS BlogId, blog.Name AS BlogName,
blog.Author AS BlogAuthor, users.Id AS UsersId,
users.Name AS UsersName, users.Email AS UsersEmail
FROM
Posts post
INNER JOIN Blogs blog
ON post.Blog = blog.Id
LEFT OUTER JOIN Users users
ON blog.Author = users.Id
WHERE
blog.Author = @p0 AND
(
( post.Title = @p1 ) OR
( post.Title = @p2 )
) ;
-- @p0 = '1', @p1 = 'Overloading', @p2 = 'Operator'
Now that is a totally different ball game. Take a look at the OO query and the generated SQL. That it three tables join as simple as anything you can think of.
And you know what? The following query doesn't compiles (type safety, yeah!):
Post.FindAll(
Where.Post.Blog.Author == ayende && Where.Blog.Name == "Ayende"
);
The caveat here is that it is a fairly new (and novel direction) so I don't have the generator up to generate this yet (but I made sure that this will be easy to do).
I was really tempted to keep the code hidden and let you try guess how I did it, but I think I will get enough satisfaction from knowing what this is going to make someone's head spin.
Next on this line, I need to get the code gen working once again, better than ever, and then modify Active Record and Rhino Commons to use this, such fun!
Also, codus to Dan, who discovered how many (direct) method calls there are in this riddle (16).
SQL Refactor
So I got the beta of SQL Refactor, opened SQL Management Studio and liked what I saw:

The first thing that I noticed, more important than anything else, is that it has a SQL Beautifier. This is the first I have seen this for T-SQL (there seems to be a lot for Oracle, though), and the first where the beautifier is integrated directly into SQL Management studio.

Why is this important? Well, check this out (from Northwind):
ALTER procedure [dbo].[Employee Sales by Country]
@Beginning_Date DateTime, @Ending_Date DateTime AS
SELECT Employees.Country, Employees.LastName, Employees.FirstName, Orders.ShippedDate, Orders.OrderID, "Order Subtotals".Subtotal AS SaleAmount
FROM Employees INNER JOIN
(Orders INNER JOIN "Order Subtotals" ON Orders.OrderID = "Order Subtotals".OrderID)
ON Employees.EmployeeID = Orders.EmployeeID
WHERE Orders.ShippedDate Between @Beginning_Date And @Ending_Date
And after applying the formatting:
ALTER procedure [dbo].[Employee Sales by Country]
@Beginning_Date DateTime,
@Ending_Date DateTime
AS
SELECT
Employees.Country,
Employees.LastName,
Employees.FirstName,
Orders.ShippedDate,
Orders.OrderID,
"Order Subtotals".Subtotal AS SaleAmount
FROM
Employees
INNER JOIN
(
Orders INNER JOIN
"Order Subtotals"
ON Orders.OrderID = "Order Subtotals".OrderID
)
ON Employees.EmployeeID = Orders.EmployeeID
WHERE
Orders.ShippedDate Between @Beginning_Date And @Ending_Date
I can actually read that. And the formatting is configuration enough that in 30 seconds I got just the way I wanted. Okay, so I am gushing over a code formatter, not very exciting, until you realize that I have to read thousnads of SQL Statements that were just off. The other stuff on the menu looks very interesting as well.
Expand wildcards means moving from "select * from foo" to a proper statement. Qualify names seems to work on the schema level, although I expected it to work on the columns as well. (Meaning that it turned Employees to [dbo].[Employees], but didn't changed FirstName to Employees.FirstName).
Moving right along, I looked at the script summarizer:

It can give a overview of a script, and take you directly into parts of it. Then you can use the Encapsulate Stored Procedure to extract stuff out. I like the way the wizards for the actions are setup, so I can move back and forth between them, and the commentry is excellent (the screen shot below if from split table refactoring):

Finishing thoughts:
- It generate scripts that would modify the DB, and the whole eperiance is... respectful is not quite the word I am looking for here. Too often I see a tool that is dumbing down ideas, and I get annoyed. Check out this UI, I don't think that I can explain it better now (now being 3AM here):
- I wonder how this would deal with a big DB (4000+ tables, hundreds of SPs, views, etc) when working on it remotedly.
- Missing finishing touches:
- There is no integration with right mouse click on the text editor, and it is annoying.
- No accesselrator key for the menu, more annoying.
It looks very nice, and as you can see, what excites me the most is probably what the guys at Red Gate spent the least time on :-). When I hear the term Database Refactoring, I think about moving tables around, Add Column, Move Column, etc. I didn't consider that refactoring in this sense will be improvements to the scripts themselves. In retrospect, that seems very obvious.
I know that already SQL Prompt has changed the way I work with databases, and I think that this has a similar potential. I know of one thing that I would like to put it to use already. Rename refactor on all the tables in the database, From Taarich to Date. I work with databases quite a bit, although I am not a DBA, and from a ten minute run, I know of several places where I would want to know that. (I would have given a lot to have this nine months ago, when I was doing the Big SQL Project).
There Be Dragons: Rhino.Commons.SqlCommandSet
After last night's post about the performance benefits of SqlCommandSet, I decided to give the ADO.Net team some headache, and release the results in a reusable form.
The relevant code can be found here, as part of Rhino Commons. Beside exposing the batching functionality, it is very elegant (if I say so myself) way of exposing functionality that the original author decided to mark private / internal.
I really liked the declaration of this as well:
[
ThereBeDragons("Not supported by Microsoft, but has major performance boost")]public class SqlCommandSet : IDisposable
The usage is very simple:
SqlCommandSet commandSet = new SqlCommandSet();
commandSet.Connection = connection;
for (int i = 0; i < iterations; i++)
{
SqlCommand cmd = CreateCommand(connection);
commandSet.Append(cmd);
}
int totalRowCount = commandSet.ExecuteNonQuery();
As a note, I spiked a little test of adding this capability to NHibernate, and it seems to be mostly working, I got 4 (out of 694) test failing because of this. I didn't check performance yet.
Opening Up Query Batching
I have ranted before about the annoying trend from Microsoft, to weld the hood shut in most of the interesting places. One particulary painful piece is the command batching implementation in .Net 2.0 for SQL Server. The is extremely annoying mainly because the implementation benefits are going for those who are going to be using DataSets (ahem, not me), but are not avialable to anyone outside of Microsoft. (See topic: OR/M, NHibernate, etc).
Today, I have decided to actually check what the performance difference are all about. In order to do this, I opened the (wonderful, amazing) Reflector and started digging. To my surprise, I found that the Batching implementation seems to be centralized around a single class, System.Data.SqlClient.SqlCommandSet (which is internal, of course, to prevent it from being, you know, useful).
Since the class, and all its methods, are internal to System.Data, I had to use Reflection to pry them out into the open. I noticed that the cost of reflection was fairly high, so I converted the test to use delegates, which significantly imporved perfromance. The query I run was a very simple query:
INSERT
INTO [Test].[dbo].[Blogs] ([blog_name]) VALUES (@name)With the @name = 'foo' as the parameter value. The table is simple Id (identity), Blog_Name (nvarchar(50))
Note: Before each test, I truncated the table, to make sure it is not the additional data that is causing any slowdown.
The Results:

The X axis is the number of inserts made, the Y axis is the number of ticks that the operation took. As you can see, there is quite a performance difference, even for small batch sizes. There is a significant difference between batching and not batching, and that reflection / delegates calls are not a big cost in this scenario.
Here is the cost of a smaller batch:

This shows a significant improvement even for a more real-world loads, even when we use Reflection.
I just may take advantage of this to implement a BatchingBatcher for NHibernate, it looks like it can make a good benefit for perfromance. Although this will probably not affect SELECT performance, which is usually a bigger issue.
You can get the code here: BatchingPerfTest.txt
Spot the WTF
Fixing log4net 1.2.9 AdoNetAppender NULL bug
In log4net 1.2.9 there is a bug in the AdoNetAppender. It will not log null values appropriately. This bug is fixed in log4net 1.2.10, but it is not always possible to move to the next version. (In my case, both NHibernate and Castle uses log4net 1.2.9)
I tried recompiling everything to use 1.2.10, but it took too long, and eventually I simply wrote this little trigger:
CREATE TRIGGER [ReplaceNullLookAlikeWithNulls]
ON [dbo].[Logs]
INSTEAD OF INSERT
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
INSERT INTO .[dbo].[Logs]
([Date]
,[Thread]
,[Level]
,[Logger]
,[Message]
,[Exception]
,[Filename]
SELECT
[Date]
,[Thread]
,[Level]
,[Logger]
,[Message]
,[Exception]
,case [Filename] when '(null)' then NULL else [Filename] end
FROM Inserted
END
Not the best solution, but it will hold water.
Pessimistic Locking
I need to be able to ask the database for some rows to work on, and I need to be sure that I am the only one that works on those rows. After thinking about it, I came up with this SQL:
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
BEGIN TRANSACTION
INSERT INTO Locks(LockedBy, LockedAt, LcokedItemId, LockedItemType)
OUTPUT inserted.Id
SELECT TOP 100 @lockedBy, getdate(), item.Id, @itemType
FROM Items item
WHERE NOT EXISTS (SELECT 1 FROM Locks lock
WHERE lock.LcokedItemId = item.id AND lock.LockedItemType = @itemType)
ORDER BY item.Id
COMMIT TRANSACTION
This is the first time I need to implement pessimistic locking, and I am not sure if this is the right way to go. The work I am doing is disconnected, so I can't just lock the rows and continue to work.
Any suggestions?
Update: This is what I have now:
CREATE PROCEDURE lockAndGetItems AS
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
BEGIN TRANSACTION
DECLARE @lockedBy uniqueIdentifier, @countOfItemsToLock int
SET @lockedBy = '68A6D54C-557D-428D-8B82-5D68C9C1B33E'
-- Get number of rows already locked by me
SELECT @countOfItemsToLock = 100 - COUNT(Id) From Locks where LockedBy = @lockedBy
-- Lock rows to complete to an even 100
INSERT INTO Locks(LockedBy, LockedAt, LockedItemId)
SELECT TOP (@countOfItemsToLock) @lockedBy, getdate(), item.Id
FROM Items item
WHERE NOT EXISTS (SELECT 1 FROM Locks lock
WHERE lock.LockedItemId = item.id and lock.LockedBy = @lockedBy)
ORDER BY item.Id
-- Return row data
SELECT Items.* FROM Items JOIN Locks ON Items.Id = Locks.LockedItemId AND LockedBy = @lockedBy
COMMIT TRANSACTION
Basically it fetch the first 100 items designated for this SP. Each SP has a single caller, so I don't worry about expiring the lock. I will always get the locked rows. In the case of a crash when the rows are locked, when the client is restarted, it will get the same rows that it already locked, and can start working on them. The reason that I implement is as a SP with hard coded Guid is that each SP locks a different table.
Better solution to the Tricky SQL problem
Moran has pointed me to the CROSS APPLY syntax in T-SQL 2005, which does allows joining against a table valued function. Using this, the query goes down to this:
SELECT
EmpName,
CONVERT(NVARCHAR, CurrentDate,103) Date,
HasWorked = CASE SUBSTRING(Roster,
dbo.IndexInRoster(StartDate,CurrentDate,LEN(Roster)), 1)
WHEN '_' THEN 0
WHEN '*' THEN 1
ELSE NULL
END
FROM Schedules CROSS APPLY DateRange(StartDate,EndDate)
This is much nicer way to deal with it. Considerring that I am using similar techniques all over the place, this is a really good thing to know.
Solving a Tricky SQL Problem
I'm starting to get quite a bit of mail from this blog. Some of those questions are about subjects I can answer immediately, some require a fair amount of work (which can be had, if you really want), and the more interesting ones are those that require some thinking, but does not require too much time. This question from Dave is the best one so far, and I got his permission to blog about it, so I'm doubly happy.
The issue is working against a legacy database to get the data for further processing. I'll let Dave explain the issue, since he does it much better:
I have to write a query to generate a report over some interesting data. It's basically scheduling which days people are working. The data looks like this:
| Id | EmpName | StartDate | EndDate | Roster |
| 1 | Bob | 12/06/2006 | 18/06/2006 | _*___** |
| 2 | Mary | 12/06/2006 | 18/06/2006 | *_*__*_ |
The trick is, the roster field contains a string with a _ or * depending on wether the person is scheduled to work that day or not, but the first character always starts on the sunday. The startdate and enddate can be any day of the week.
In the example above, the 12-jun is a monday, so monday corresponds to the second character in the roster string, so Bob's working and Mary's not.
The roster string wraps around, so the first character of the roster string actually corresponds with the enddate here! Now, this roster string could be 7, 10, 14 days long.
I could get the report out if I can write a query to get it to this:
Employee DateWorking Bob 12/06/2006 Bob 16/06/2006 Mary 13/06/2006 Mary 16/06/2006
By the way, I haven't asked, but I'll bet that this schema has originated from a MainFrame, if not currently, than in its recent past.
First I created the schema I needed:
CREATE TABLE Schedules
(
Id INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
EmpName NVARCHAR(255) NOT NULL,
StartDate DATETIME NOT NULL,
EndDate DATETIME NOT NULL,
Roster NVARCHAR(50) NOT NULL
);
GO
INSERT INTO Schedules
SELECT 'Bob','12-Jun-06','18-Jun-06','_*___**'
UNION ALL
SELECT 'Mary','12-Jun-06','18-Jun-06','*_*__*_'
GO
Then, I started playing with DatePart(), getting the day of the week of StartDate from each row. This gave me the index I needed into the Roster column. But, this only told me whatever the employee worked or didn't work on the start date, which isn't very helpful. What I needed was a way to check for all the values between StartDate and EndDate.
I posted about this issue a while ago, and I made use of this techqnique here:
CREATE FUNCTION DateRange ( @start datetime, @end datetime )
RETURNS @DateRange TABLE ( CurrentDate datetime )
AS
BEGIN
WHILE (@start <= @end)
BEGIN
INSERT INTO @DateRange(CurrentDate) VALUES(@start)
SELECT @start = DATEADD(day,1,@start)
END
RETURN
END
GO
Conceptually, what I wanted was this:
SELECT
IndexInRoster = DatePart(dw,StartDate) + DateDiff(day,StartDate, CurrentDate)
FROM Schedules, DateRange(StartDate,EndDate)
Unfortantely, DateRange() is a table valued function, and what this query ask from SQL Server is to join each row in the Schedules table to another table. This is not possible, of course.
I settled on faking it using this appraoch:
WITH AllDatesInTable(CurrentDate) AS
(
SELECT CurrentDate FROM dbo.DateRange(
(SELECT MIN(StartDate) FROM Schedules),
(SELECT Max(EndDate) FROM Schedules) )
)
SELECT
TestIndexInRoster = DatePart(dw,StartDate) + DateDiff(day,StartDate, CurrentDate)
FROM Schedules JOIN AllDatesInTable
ON CurrentDate BETWEEN StartDate AND EndDate
This query uses Common Table Expression to define a table that has all the dates in the Schedules table. Notice that I constrained it to all the dates in the current row. In essense, this give me a row per each date in the date range of each row. This is the basis of solving this problem.
The other issue is the wrapping of the day index in the roster. This is a bit complicated because we need to take into account three things. SQL Server string handling is 1 base, not 0 based (argh!), we are shifting based on the start date functionality, and we need to wrap around correctly. In order to handle this issue I created this function:
CREATE FUNCTION IndexInRoster(@StartDate DATETIME, @CurrentDate DATETIME, @RosterLen INT)
RETURNS INT AS
BEGIN
DECLARE @Result int
SET @Result = (DATEDIFF(day,@StartDate,@CurrentDate) + DATEPART(dw,@StartDate)) % (@RosterLen)
IF @Result = 0
RETURN @RosterLen
RETURN @Result
END
GO
The check for @Result equals 0 is there because SQL Server is using 1 based string handling.
Brining it all together, we get this:
WITH AllDatesInSchedulesTable(CurrentDate) AS
(
SELECT CurrentDate FROM dbo.DateRange(
(SELECT MIN(StartDate) FROM Schedules),
(SELECT Max(EndDate) FROM Schedules) )
)
SELECT
EmpName,
CONVERT(NVARCHAR, CurrentDate,103) Date,
HasWorked = CASE SUBSTRING(Roster,
dbo.IndexInRoster(StartDate,CurrentDate,LEN(Roster)), 1)
WHEN '_' THEN 0
WHEN '*' THEN 1
ELSE NULL
END
FROM Schedules JOIN AllDatesInSchedulesTable
ON CurrentDate BETWEEN StartDate AND EndDate
And the result of this query:
| EmpName | Date | HasWorked |
| Bob | 12/06/2006 | 1 |
| Mary | 12/06/2006 | 0 |
| Bob | 13/06/2006 | 0 |
| Mary | 13/06/2006 | 1 |
| Bob | 14/06/2006 | 0 |
| Mary | 14/06/2006 | 0 |
| Bob | 15/06/2006 | 0 |
| Mary | 15/06/2006 | 0 |
| Bob | 16/06/2006 | 1 |
| Mary | 16/06/2006 | 1 |
| Bob | 17/06/2006 | 1 |
| Mary | 17/06/2006 | 0 |
| Bob | 18/06/2006 | 0 |
| Mary | 18/06/2006 | 1 |
And from here it is trivial to get to whatever format you want.