Ayende @ Rahien

Unnatural acts on source code

Task Scheduling Improvements

I took some time to see how I can improve the sample implementation of the task scheduling that I posted yesterday.

You can checkout the code here. What I wanted to check is the ability to scale the solution to many tasks and to IO bound tasks.

In a stress test that I made, the library held its own while scheduling 9,864,100 tasks, at one point we had ~3,500,000 concurrently queued tasks. Memory usage hovered at the ~500Mb range.

The next stage was to see what happens when we have tasks that takes a long amount of time to execute, depends on IO, etc.

I wrote this piece of code:

public class GatherAllFiles : AbstractTask
{
    private readonly string root;

    public GatherAllFiles(string root)
    {
        this.root = root;
    }

    protected override IEnumerable<Condition> Execute()
    {
        List<string> results = new List<string>();
        results.Add(root);
        
        List<Future> futures = new List<Future>();
        foreach (string directory in Directory.GetDirectories(root))
        {
            Future spawn = Spawn(new GatherAllFiles(directory));
            futures.Add(spawn);
        }
        string[] files = Directory.GetFiles(root);

        yield return Done(futures);

        foreach (Future future in futures)
        {
            results.AddRange(future.GetValue<IEnumerable<string>>());
        }
        results.AddRange(files);
        SetResult(results);
    }
}

I then run it on my OSS directory. This directory contains 133,108 directories and 206,298 files (ask me how I know... ).

The library just ate it up without even noticing. Very nice, even if I say so myself :-)

Comments

Francois Tanguay
01/08/2008 03:54 AM by
Francois Tanguay

Using Lambda Expressions, I would simplify:

class AbstractTask:

public WaitFuture Done(IEnumerable items)

{

return new WaitForFuture(() => items.All(item => item.HasValue));

}

Frans Bouma
01/08/2008 08:32 AM by
Frans Bouma

Task scheduling is a topic that's been researched a lot. In general, round-robin with a priority system is on average a system which gets you the best results.

Your testcase is a bit weird. The disk is a resource which can be in 1 task at a time. This means that even if you create a fancy scheduling system, it doesn't matter, everything is done sequential anyway. What's worse: the slowest thing in your system is the head in the harddisk: every time it steps, you'll lose a big chunk of performance.

To get optimal performance on a disk with a set of tasks is to execute all of them in a sequential order and not having any of them interfere with one another. The thing is though that modern disks are able to 'accept' more than one command at a time, and switch between them during atomic block reads for example. This causes a lot of stepping, which results in a degration of performance.

So to illustrate/test task scheduling, I'd pick a topic which isn't limited by a sequential resource like a disk ;)

Ayende Rahien
01/08/2008 11:28 AM by
Ayende Rahien

Frans,

Actually, the FS is not limited to the HD, there is quite a bit of caching and forward loading going on.

In addition to that, we aren't executing everything at the same time, this is the beauty of this approach, we are only executing N tasks at a time, where N == count(CPU)

Roy Tate
01/08/2008 06:34 PM by
Roy Tate

On a decent server, I would expect multiple CPUs and a RAID configuration, so a file system (FS) example is not totally out of the question.

Marco
01/08/2008 08:50 PM by
Marco

How would you handle the scheduling part? for example run one task every minute and another task every day at 05:00..

Thanks

Marco

Ayende Rahien
01/08/2008 10:24 PM by
Ayende Rahien

Marco,

That is something that I usually won't do using this approach.

Nevertheless

public class PrintTimeTask : AbstractTask

{

public override IEnumerable Execute()

{

      while(true)

      {

                   Console.WriteLine(DateTime.Now);

                   DateTime nextSchedule = DateTime.Now.AddSeconds(5);

                  yield return delegate { return DateTime.Now >= nestSchedule; };


      }

}

}

Jeff Brown
01/08/2008 11:01 PM by
Jeff Brown

It works of course, but I can imagine it would get really tiresome having to write code in this stilted form all of the time.

It's not clear to me why you don't just compose all of the futures together more directly using lambdas to encapsulate dependent computations instead of all of this tedious "yield return" nonsense.

Ayende Rahien
01/09/2008 01:47 AM by
Ayende Rahien

Jeff,

Can you show me the code of doing this with Lambda?

Francois Tanguay
01/09/2008 03:21 AM by
Francois Tanguay

Probably something like

class AbstractTask

public Condition Where(Condition condition)

{

return condition;

}

And then

while(true)

{

Console.WriteLine(DateTime.Now);

return Where(() => DateTime.Now >= nextSchedule);

Ayende Rahien
01/09/2008 06:24 AM by
Ayende Rahien

Francois,

I am using .NET 2.0 to write the code.

This approach is the same thing, just with lambda syntax.

Jeff Brown
01/12/2008 09:03 AM by
Jeff Brown

The idea with lambda is instead of making up a scheduling loop with yield return which only works within the context of a single method, we make the tasks chain themselves around an asynchronous callback.

So basically, we have Task objects floating around. To start running some Task, we create a new Task object containing a delegate for the first block of code we want to run. We let the Task bubble up to the scheduler.

When the delegate eventually runs, it can begin a new Task which gets chained onto the previous one.

This idea works particularly well if we think of Tasks as spawning asynchronous operations. That is, we might run a little code in a delegate that starts the asynchronous operation. When the asynchronous operation finishes, we finally mark the Task finished and set its result (or an exception).

When the Task is marked done, it sends out a notification that enables other chained Tasks to start up using whatever result was previously computed.

I did something very much like this in ActionScript 3 because I had to manage a large number of asynchronous operations all sharing the only thread available. So everything ends up having to post back to the event loop. Basically the event loop is functioning as a kind of scheduler.

Here's the code for the ActionScript 3 stuff.

https://svn.castleproject.org/svn/castlecontrib/flexbridge/trunk/src/Castle.FlexBridge.Flash/src/castle/flexbridge/common/AsyncTask.as

Unfortunately most of the cool code using AsyncTask is in other proprietary projects that have yet to be OSS'd However typical usage looks like:

            public function signInWithSavedCredentials():AsyncTask

    {

        return AsyncTask.start(function(task:AsyncTask):void

        {

            if (hasSavedCredentials)

            {

                task.joinTask(signIn(savedUserName, savedPassword, true));

            }

            else

            {

                task.done(false);

            }

        });

    }


    public function signIn(userName:String, password:String, saveCredentials:Boolean):AsyncTask

    {

        return authenticationService.onResultStart(function(task:AsyncTask, service:AbstractService):void

        {

            task.joinToken(service.GetAuthenticationTicket(userName, password));

        }).onResultStart(function(task:AsyncTask, ticket:String):void

        {

            if (ticket != null)

            {

                _webServiceLocator.setAuthenticationTicket(ticket);

                _signedInUserName = userName;


                dispatchEvent(new Event("signedInUserNameChange"));


                if (saveCredentials)

                {

                    _savedCredentials.data['userName'] = userName;

                    _savedCredentials.data['password'] = password;

                    _savedCredentials.flush();

                }   


                task.done(true);

                return;

            }


            task.done(false);

        });

    }

Now in .Net we have a lot more power available so we will probably want to use a somewhat different abstraction. Moreover with the C# 3.0 syntax, we can cook up a more fluent interface for creating and chaining these asynchronous tasks together.

And we can rearrange the control-flow a bit to put some kind of TaskScheduler in change of the scheduling so that each new Task to run gets posted back to the scheduler instead of running immediately. If desired.

Lots of possibilities in this design space. So don't stop with "yield return" loops...

Ayende Rahien
01/13/2008 02:44 PM by
Ayende Rahien

Jeff,

what you are talking about is message passing architecture.

I have tried that with Retlang, and it works very nicely. For some things.

For others, this approach is far preferable, it is much easier to write code this way than explicitly handle the state transfer issues.

Jeff Brown
01/13/2008 05:25 PM by
Jeff Brown

No. It's not strictly message passing... Quite different from passing data around using message ports.

What we're really doing is chaining Futures that to create composite asynchronous operations.

Ayende Rahien
01/13/2008 06:26 PM by
Ayende Rahien

I don't see the difference, can you elaborate?

Comments have been closed on this topic.