Negative Space Scheduling
My first project as a professional software developer was to build a scheduling system for a dental clinics chain. That was a huge project (multiple years) and was really quite interesting. Looking back, I have done a lot of really cool technical things there. I also learned quite a lot from that project. The danger of complexity being one of the chief issues.
Consider a dental clinic, where we have the following schedule for a dentist:
- Monday – 10:00 – 15:00
- Wednesday – 09:00 – 13:00
- Thursday – 11:30 – 16:30
The task is to be able to schedule an appointment for the dentist given those rules.
In addition to the schedule of the dentist, we also have actual Appointments, those looks like this:
Assume that you have quite a few of those, and you want to schedule a new appointment for a patient. How would you do that? I’m a database guy, let’s see if I can specify the task as a query?
We need a dentist that has availability of a particular length (different tasks have different schedules) and particular qualifications. However, there is no such thing as availability in our model. We have just:
The complexity here is that we need to search for something that isn’t there.
I actually found some of my posts on this topic, from 2006. That isn’t a simple problem. And the solution is usually to generate the missing data and query on that. My old posts on the topic actually generate an in memory table and operate on that, which is great for small datasets, but will fail in interesting ways for real world datasets.
For what it’s worth, RavenDB allows you to generate the missing data during the indexing process, so at least the queries are fast, but the indexing process is now compute-intensive and a change in the dentist schedule can result in a lot of work.
All of that is because of two issues:
- We are trying to query for the data that isn’t there.
- The information is never used as queried.
These two points are strongly related to one another. Consider how you would schedule a dentist appointment. You first need to find the rough time frame that you need (“come back in six months”) and then you need to match it to your schedule (“I can’t on Monday, I got the kids”, etc).
There is a better way to handle that, by filling in the missing pieces. Instead of trying to compute the schedule of a dentist from the specification that we have, go the other way around. Generate the schedule based on the template you have. The result should be something like this:
In other words, based on the schedule provided, we’ll generate an entry per day for the dentist. That entry will contain the appointments for the day as well as the maximum duration for an available appointment. That means that on query time, we can do something as simple as:
where Dentist = $dentistId
and At between $start and $end
and MaximumDuration >= $reqDuration
And that gives us the relevant times that we can schedule the appointment. This is cheap to do, easy to work and it actually matches the common ways that users will use the system.
This has a bunch of other advantages, that are not immediately apparent but end up being quite important. Working with time sucks. The schedule above is a nice guideline, but it isn’t a really useful one when you need to run actual computations. Why is that? Well, it doesn’t account for vacations days. If there is a public holiday on Wednesday, the dentist isn’t working, but that is an implied assumption in the schedule.
For that matter, you now need to figure out which calendar to use. A Christian and a Jewish dentist are going to have different holiday calendars. Trying to push that into a query is going to be quite annoying, if not impossibly so. Putting that on the generator simplifies things, because you can “unroll” the schedule, apply the holiday calendar you want and then not think about it.
Other factors, such as vacation days, reserved time for emergencies and similar issues make it a lot easier to manage in a concrete form. Another important aspect is that the schedule changes, for any decent size clinic, the schedule changes all the time. You may have the dentist requesting to close a couple of hours early on Monday because of a dance recital and add more hours on Thursday. If the schedule is generated, this is a simple matter to do (manual adjusting). If we have just the schedule template, on the other hand… that becomes a lot more complex.
In short, the best way to handle this is to take the template schedule, generate it to a concrete schedule and operate from that point on.
not sure why this is not merely a "where does not exist ()" sql query?
ie self-join subquery. for example, in this table of sequences:
seq: 1,2,3,4,6,7,8, 10.
select seq +1 from seqtable s1 where not exists (select * from seqtable s2 where s2.seq = s1.seq + 1)
Because that isn't what you are searching for.In this context, assuming numbers, you want to search for: "give me a range of consecutive numbers between 10 and 25 that is at least 4 numbers apart".And then try to think about what the database needs to do in order to answer that.
And the actual problem you have is that you aren't just working with a sequence of numbers. Assume that you have the
Appointmentsas tables.Find me all the 1.5 hours holes for a dentist on Wednesday between Oct 1 and March 14?
That is what makes this tough. Also, note that if you try to schedule on Oct 5, that is Yum Kippur (holiday), so not applicable.Trying to do that as a query is insanely complex.
I still see this as straightforward, for example, starting with a minute-level table (8 hours * 60 min = 480 rows per day , one for each minute) and a booked flag, finding free ranges in minutes could be as below. (Beginning of day and end of day logic still req)
My first a semester project at university was a planning/scheduling application for some imaginary transport company, and my first desktop/sql application at all. Dont remember the details, but one weakness of SQL stuck out -that it's really hard to select data that isn't there.
I'm not sure why you couldn't do this in memory. An effecient data model would cope with a large number of dentists/patients/appointments. I'd doubt it's 1GB in memory done correctly.
Let's try an example. Say 1000 clinics with on average 10 dentists, so 10,000 dentists. You want to store in memory just the numeric data required to be able to book appointments.
Per dentists you'd have their availability encoded as a set of generic repeating calendar avail or unavail. Generating data would be a bad idea. 1K for each is generous but even if you were wasteful and it was 10K it would only be 10,000 * 10K = 100MB. So this is not a concern.
There will be a large number of appointments. Overestimate but let's say for 3 years we have 300 (days) * 20 (per day) * 3 years = 18,000 appointments per dentist. We only need time and duration of each appointment as we can lookup the patient and details based on (dentistid, start). The time and duration can be 5 min granularity so the time and duration can easily be stored in an int and int arrays.
So we have 10,000 dentist * 18,000 appointments * 4 bytes ~ 700MB
In memory you'd have some object and indexing overhead but it look like you could do this easily in 1GB - 2GB. It would be about 1000 times faster than a database solution.
Thinking about it there is probably a better optimised Availability data structure that represents the dentists available slots. Would be able to go out many years also.
You are materializing the values at this point, because you have
Booked = 'N', which is basically the same thing as I suggested.
What I'm talking about is if you have only the
Booked = 'Y'and no entries for the availability, that you'll need to compute.
Doing that in memory is pretty easy, sure. But you need to load / manage the data. Assume that you have:
Then you may be able to do some filtering, but then you have to do a run over all the matches, which can be _expensive_. And you are basically doing the generation as I suggested, but in memory, and then filtering on that.
Account for each dentist their Holiday Calendar, their vacation requests, etc. Then you have to keep around _all the actual appointments_, so you won't double book.
And now, do the same with concurrency and atomicity. So if I have two requests coming in, I have to avoid double booking, and if I marked an appointment, I have to make sure that I'll roll it back if I failed to write the appointment to the database. Now you have a distributed transaction problem to solve.
I did that, it is not fun, and a lot harder than it seems. Solving the problem as an ad hoc scenario is fairly simple, sure, but you need to integrate that into your overall solution.
And materializing the data to the database directly will simplify all of those concerns, won't take a lot of space and allow to index the data.
I'd argue this is a complexity not really needed for 'dentist appointment' system.
The important part here is that the patient rarely wants 'any timeslot, any dentist, any branch'. The reason to go to dentist is either because you have tooth ache, go for an (semi)annual check up, or you continue your treatment.
In the first case you want something as soon as possible, so the query is 'give me a timeslot within next couple of days, for any doctor'. Rinse-repeat for multiple branches. Short timeframe makes it not very compute intensive regardless of the approach you take, be it query the existing appointments and compute free slots in the app, pre-generate all slots up front, or generate all slots in memory on database side at query time and subtract booked slots.
In the case of annual check up, the timeframe is bigger, but personally I wouldn't look for anything later than 2-3 weeks from now. Still manageable in terms of compute time.
In the last case, the next visit is booked for specific doctor, but the timeframe can be much longer. Still doable by querying next 2 weeks, if that doesn't work then next 2 weeks and so on.
In fact, for small to medium clinics, the dataset would be small so pretty much anything will work and you don't need to do such querying anyway - just present the user with regular calendar, they will 'eyeball' a free slot quickly enough to warrant not spending money on developing more complex system.
For a big network of dentists with thousands of branches and tens of thousands of patients (e.g. private medical insurance company that lets you book appts with their partners via their website), I would go with:
I'm trying to generalize here. You typically have your dentist, for sure. But consider the case of wanting to go to a
dental hygienistinstead. In this case, usually you want any dental hygienist that fits my location and timeframe. Computing the result on a two weeks boundary means that you have to scan potentially a large number of them. And you have to do that as a computation, not a query.
Note that in order to generate the calendar, you need to figure out availability. You can usually just render it directly, sure, but then you run into an issue if you aren't trying to get a dentist, but any dental hygienist, for example. Which calendar are you showing?
Your first option is what I recommend, actually. You materialized the availability and then can query on that.
Your second option works for a single person, but not if I'm trying to schedule between effectively interchangeable appointments to different people.
seems to me that we are agreeing that once you grow to use cases like 'any dental hygienist in any clinic within reasonable commute time and larger timeframe', computing on the fly is not a sustainable option. The first part of my post was meant for 'design a system for a small clinic, maybe with a handful of branches within single city', which in fact doesn't really require to scale that much.
To render the calendar, you just need 'working time' and a list of appointments. Any holidays, short breaks etc, can be modeled as appointments too. UI akin to 'scheduling assistant' in Outlook. If the choice of dental hygienists is relatively small (say up to 10), then this could work, otherwise see second part of my post.
Regarding point 2 in the second part of my post, it wasn't really meant for scheduling purposes, but so that the dentist can have visibility. Granted you can do that by quering the pre-generated availability table too, it's just that I think having two data sources for two different use-cases might be more efficient.