Ayende @ Rahien

Refunds available at head office

Avoid Soft Deletes

One of the annoyances that we have to deal when building enterprise applications is the requirement that no data shall be lost. The usual response to that is to introduce a WasDeleted or an IsActive column in the database and implement deletes as an update that would set that flag.

Simple, easy to understand, quick to implement and explain.

It is also, quite often, wrong.

The problem is that deletion of a row or an entity is rarely a simple event. It effect not only the data in the model, but also the shape of the model. That is why we have foreign keys, to ensure that we don’t end up with Order Lines that don’t have a parent Order. And that is just the simplest of issues. Let us consider the following model:

public class Order
{
    public virtual ICollection<OrderLines> OrderLines { get;set; }
}

public class OrderLine
{
    public virtual Order Order { get;set; }
}

public class Customer
{
    public virtual Order LastOrder { get;set; }
}

Let us say that we want to delete an order. What should we do? That is a business decision, actually. But it is one that is enforced by the DB itself, keeping the data integrity.

When we are dealing with soft deletes, it is easy to get into situations where we have, for all intents and purposes, corrupt data, because Customer’s LastOrder (which is just a tiny optimization that no one thought about) now points to a soft deleted order.

Now, to be fair, if you are using NHibernate there are very easy ways to handle that, which actually handle cascades properly. It is still a more complex solution.

For myself, I much rather drop the requirement for soft deletes in the first place and head directly to why we care for that. Usually, this is an issue of auditing. You can’t remove data from the database once it is there, or the regulator will have your head on a silver platter and the price of the platter will be deducted from your salary.

The fun part is that it is so much easier to implement that with NHibernate, for that matter, I am not going to show you how to implement that feature, I am going to show how to implement a related one and leave building the auditable deletes as an exercise for the reader :-)

Here is how we implement audit trail for updates:

public class SeparateTableLogEventListener : IPreUpdateEventListener
{
    public bool OnPreUpdate(PreUpdateEvent updateEvent)
    {
        var sb = new StringBuilder("Updating ")
            .Append(updateEvent.Entity.GetType().FullName)
            .Append("#")
            .Append(updateEvent.Id)
            .AppendLine();

        for (int i = 0; i < updateEvent.OldState.Length; i++)
        {
            if (Equals(updateEvent.OldState[i], updateEvent.State[i])) continue;

            sb.Append(updateEvent.Persister.PropertyNames[i])
                .Append(": ")
                .Append(updateEvent.OldState[i])
                .Append(" => ")
                .Append(updateEvent.State[i])
                .AppendLine();
        }

        var session = updateEvent.Session.GetSession(EntityMode.Poco);
        session.Save(new LogEntry
        {
            WhatChanged = sb.ToString()
        });
        session.Flush();

        return false;
    }
}

It is an example, so it is pretty raw, but it should be create what is going on.

Your task, if you agree to accept it, is to build a similar audit listener for deletes. This message will self destruct whenever it feels like.

Comments

Rafal
08/30/2009 08:14 PM by
Rafal

I find it easier to live with soft deletes - in typical business applications the entities are interconnected in many different ways, sometimes even with circular dependencies and physically removing records would be complicated and could even corrupt the data. Some entities are just not deletable, like customers or users, also historic data shouldn't be touched by ongoing updates. The only data that are physically removed by default are the records in many-to-many auxiliary tables.

Richard Dingwall
08/30/2009 11:00 PM by
Richard Dingwall

Lately I've been starting to consider soft-delete an anti-pattern. For exactly the reason Rafal says - it makes your database very fragile with FKs to deleted items, and adds a mandatory additional layer of complexity to queries and joins (where deleted = 0) that is not ORM-friendly and can cause serious problems if you forget.

We've been slowly replacing soft deletes with AFTER DELETE/UPDATE triggers that copy it to an xyz_log table.

It's already accepted not to mix reporting concerns with your transactional database. I think you shouldn't mix audit logging in the live tables either.

mattmc3
08/30/2009 11:02 PM by
mattmc3

I find that the whole purpose of an ORM is to allow you to have a real RDBMS designed to protect the data and facilitate reporting, while being able to control data entry via business objects. Handling modification tracking and enforcing non-permanent-deletion of data is one of those grey areas of DBA concern versus business object concern.

I'm not really a fan of soft deletes as it always seems to have such a negative impact on reporting. Most enterprise reporting tools don't use the DAL, so all the reporting SQL needs to have "WHERE tb1.IsDeleted <> 1 AND tbl2.IsDeleted <> 1 etc. etc." which complicates the whole thing. And, if you're doing outer joins then those statements need to go in the ON clause instead of the WHERE and it gets even messier.

I've seen all sorts of different solutions to the deletion problem, all of which have flaws of their own. Auditing triggers on every table, secondary "delete" versions of all the tables, the select-star-view-with-the-soft-deletion-filter approach.

It's also really hard to prove to an auditor that all data is protected from a delete, if the logic is subject to developer whims and buried in the DAL code.

I see delete auditing as a persistence concern. And, that deleted or changed data rarely needs to be queried except during an audit. Often a simple AuditLog table will work - something with a very denormalized general purpose structure: ID, TableName, ColumnName, PrimaryKey, OldValue, NewValue, ModifiedDate, ModfiedUser. It's a pain to query, and the audit table will almost certainly be the largest in your database, but it's also the simplest solution. And, with a good ETL tool you could get at the data if you needed to answer one-off queries. And then for special circumstances where you really do want to query the soft-deletes as part of your legitimate business needs, you can make those the limited exceptions and not the rule.

Phil
08/31/2009 12:10 AM by
Phil

I generally agree that soft deletes are bad, and that a requirement that "no data shall be lost" can be handled by audit/history tracking 99% of the time.

However, there are cases where certain data can't be deleted, and having an IsActive column makes sense. The most common example is a user table. If you track “who did what” in a system (e.g. "CreatedBy" or "LastUpdatedBy"), you’ll want to keep a record that that user existed in the system at some point, even if that user should no longer be active or be able to login (like if they left the company). I can’t imagine naming a column “WasDeleted” under any circumstances. If it’s set to false, it’s redundant; if it’s set to true, it’s a lie. :)

Also, I've done auditing both with a single table (as both Matt and Oren suggested), and with an auditor table for each entity under audit (as Richard suggested). The latter is far easier to track and query, IMHO, and makes it much easier to reconstruct what happened later.

I generally create a single "master" audit table (id, date, user) and then one individual audit table for each entity under audit. The individual table is an exact mirror of the base table, except it has a FK to the master audit table and a column to represent the action that took place (create/update/delete). A row in the master table loosely represents the database transaction, so when a transaction occurs that involves one or more rows in a table(s) under audit, each corresponding audit table will get rows added representing the change(s) made, tied back to a single row in the master table.

I’m sure no one cares but I just finished doing this on a new system, so it’s very top of mind. :)

pete w
08/31/2009 01:18 AM by
pete w

I have been building data warehouses a cubes based on historical transaction information, and I can tell you that hard deletes are a BAD idea if you want to build an OLAP cube any time in the future.

If you want to build a historic fact table in your warehouse based on transactional information, you need:

  • DateCreated (not null) - this is important for reporting, and it prompts the ETL/warehouse process to load the record when it is new. This can be done with a getdate() default value in the table design.

*DateUpdated (nullable) - this is is set/updated every time the record is changed. it is used for the same reasons as above. LogEventListener might be handy here

*DateDeleted (nullable) - Yes this is a soft-delete field. If it is null the record is active. It will be important to add this to a ledgered fact table to keep warehouse reports up-to-date. Once properly logged in the warehouse, you can delete this record in a grooming process. LogEventListener might be handy here as well

Of course checking that deleted flag can be a pain, but if you start with this kind of convention in tables that represent facts and dimensions, an OLAP cube and ETL process will be easier to build out in the future.

In regard to your challenge, two thoughts:

  1. Can you make "LastOrder" a query, or a SQL derived/calculated column, not a stored value?

  2. Ayende, youre the architect wiz, if #1 is not desirable, is it ugly to update "LastOrder" somewhere in a domain class when an order is killed?

Dennis
08/31/2009 04:20 AM by
Dennis

Most of the problems in Soft Deletes can be avoided if it is seen as a "human" thing.

Only when you want to display lists of something, do you need to filter the deleted objects.

Otherwise, you simple accept that direct key references to a deleted object ignores the fact that the object is deleted.

And in reporting parts of the code, you implement features to allow/disallow the inclusion of deleted objects.

Atleast that is how we avoid many of the problems.

HBD
08/31/2009 05:37 AM by
HBD

Ayende must be some kind of a magician that he knows the topic we developers want to read about. :-) I saw "Avoid Soft Deletes" coming in 2 days, and I waited and sure enough here it is!

This one is very important and has bothered me so much that I have not yet implemented Delete in my app, instead waiting for the time I got the full story right.

I think Soft Deletes is a good option, but it adds to the complexity of the code. So... can Deletion be a 'Cross Cutting Concern' instead?

Also, conventional soft delete in combination with Auditing tables will take care of most of the issues while we work out the AOP way of handling it.

Can we have a blog post on auditing tables now :-) ? Or a pointer to a good one if it exists.

Anders
08/31/2009 07:22 AM by
Anders

+1 Dennis!

Deletes are very much just about filtering in the GUI. In our app deleting a customer simply means that this customer is inactive and cannot be used anymore. But we can still report on that customer and the orders they once placed. So it's not even about aditing at all, in fact we have separate auditing as well.

Ayende Rahien
08/31/2009 07:45 AM by
Ayende Rahien

Pete,

I don't actually see the difference. I can manage those updates to OLAP cube very easily using an audit table rather than using the same table.

It might make the OLAP load procedure slightly more complex, but I am willing to take that burden because the alternative is to make all the other stuff more complex

efdee
08/31/2009 10:40 AM by
efdee

Will this post self-destruct or self-soft-destruct? :-)

Jim Sally
08/31/2009 06:27 PM by
Jim Sally

I'm still confused as to why the concept of temporal systems ( http://en.wikipedia.org/wiki/Temporal_database) has yet to really take off in the IT world. Perhaps making temporal querying a first class citizen in a popular ORM such as NHibernate will once and for all bring this concept into the mainstream. Ultimately concepts such as "soft deletes" become a non-issue when properly utilizing temporal systems (specifically bi-temporal ones). Oren, while I've successfully built a number of these systems in the past (based on .NET DataSets at the time) am I crazy for hoping such an "advanced concept" such as temporal databases will become mainstream??? The work Snodgrass, et al did back in the late 80's and early 90's seems so long ago that it's crazy that this concept is still relegated to the edge cases, rather than simply being how all business systems get created.

Uriel Katz
08/31/2009 07:53 PM by
Uriel Katz

you could make some kind of base class that handles all the logic,like adding the soft-delete field,and returning query objects which are already filtered,i can see this done in django ORM pretty easily

Ayende Rahien
08/31/2009 08:01 PM by
Ayende Rahien

Jim,

The problem with that is that it is (a) complex, (b) hard to understand (c) hard to optimize.

Especially if you go to multiple temporal times for different properties or associations.

In other words, it is just too hard to do for real, and there are just as good solutions out there for it anyway

Ayende Rahien
08/31/2009 08:02 PM by
Ayende Rahien

Uriel,

You can certainly do this, I done it in a number of systems with NH.

It is just complex, and there are simpler solutions available

Uriel Katz
08/31/2009 08:05 PM by
Uriel Katz

apart from audit tables,what other options exist?

Pete is right that for creating OLAP columns keeping those 3 columns on each table makes making a cube very easy.

Jim Sally
08/31/2009 08:09 PM by
Jim Sally

Oren, in all seriousness, I thought that problems that were "(a) complex, (b) hard to understand (c) hard to optimize" were the kinds that folks like you and I get paid to solve...

EdCh
08/31/2009 08:13 PM by
EdCh

I can't say I disagree. But, auditing is not the only reason for soft deletes. Sometimes there's a business requirement to "undelete" (i.e. undo or reverse a previous delection). Given this requirement would you do anything differently with respect to soft deletes?

Ayende Rahien
08/31/2009 08:37 PM by
Ayende Rahien

Uriel,

A history table, with an Append Only mode, is the simplest, you can pour it into OLAP directly, and it doesn't impact anything else that you do

Ayende Rahien
08/31/2009 08:37 PM by
Ayende Rahien

EdCh,

I would "restore" the row, insert it back into the original table.

Erik L
08/31/2009 09:28 PM by
Erik L

This Seems like a case where like everything else it depends. Take a discussion forum where you want administrators to see Deleted entries with the note that they were deleted. Well then deleted is just a filter, that is applied to everyone except administrators/moderators and the rest of your code would become more complex to go back and throw those deleted entries in for the administrator to see.

It also seems like a lot of the arguments for soft deletes is coming from case that you aren't really deleting, but deactiviating. e.g. Customer.

Ayende, How does this same logic apply to something like

ayende.com/.../...ample-design-considerations.aspx

I guess the main difference is that above you are mostly adding things and getting the benefit of an add-only(mostly?) model. A soft delete is still a modification?

Ayende Rahien
08/31/2009 09:36 PM by
Ayende Rahien

Jim,

I'll respond to that tomorrow, too late now to really think.

@Erik,

Soft delete & insert only are two very different models.

And I am talking specifically about deletes here, deactivating a customer is something else.

Jan Willem B
08/31/2009 11:16 PM by
Jan Willem B

Deletion is an overloaded term:

1) technical deletions ("hard delete")

2) domain deletions ("soft delete")

Soft deletes should only be used if the deletion itself is relevant data. For example visualised by a line through the text to indicate the item was "deleted".

To avoid confusion it should always be clear that delete is a technical, hard delete. And to avoid confusion I agree there should not be a column "Deleted". Better call the column "Died" or "Dismissed" or "Cancelled" or whatever your domain concept is.

David Williams
08/31/2009 11:51 PM by
David Williams

When you live and work in a regulated environment, you do not have a choice - the data must live on. If it is added to the database, it shall remain in the database - no exceptions. Plus, a secondary rule often comes into play - thou shall not read from the audit trail for anything other than audit reports. Therefore, you live with soft deletes and status tables.

The effect of these requirement is that the working of cascading updates and deletions MUST reside in code - whether in the database or in the data layer. The only way that I have seen to get them to work cleanly is to accomplish them in both layers - update triggers in the database and update events/messages in the data layer.

Ayende Rahien
09/02/2009 10:21 AM by
Ayende Rahien

I did, I have no further comments, since I mostly agree with him

Tad
09/03/2009 07:39 AM by
Tad

Hi,

When i implemented audit on preUpdateEvent I got error : collection was not processed by flush()

I'm comparing complex types and don't know how to do this that it would work

Any solutions ?

Ayende Rahien
09/03/2009 08:20 AM by
Ayende Rahien

Tad,

This really should go to the nh users mailing list

You need to use a child session in the pre update event

Mulder
09/10/2009 08:26 AM by
Mulder

I think there's perfectly reasonable scenarios for both soft and hard deletes, with and without audit trails.

  1. Soft delete:

A user signs off of your service, but you need to keep (anonymized) stats data consistent for your customers - here you cannot remove the entire user, you just blank his personal data (name, email, phone, ...) and keep the anonymized statistical data (country, birth year, profession).

  1. Soft (or super-soft) delete with audit trail:

Any financial transaction data, even if entered by error, may only be corrected by adding a correction entry, not by deleting the erroneous entry, or the IRS will be all over you. So either you flag it as deleted (soft delete) or you correct it by adding another entry ("super-soft delete").

  1. Hard delete with audit trail:

A user unsubscribes from your newsletter. No need to keep the info "he once was subscribed to it" in the live database, but need to keep the info somewhere accessible in case he sues you for spamming and you can prove "but back in August when you got the mail, you were still subscribed".

  1. Hard delete with no audit trail:

Personal data as in #1 if your local data protection laws require. (This means no more storing, anywhere, technically speaking not even in last month's backup, but that's another issue altogether.)

Revin
09/28/2009 11:53 PM by
Revin

I also have problem with: collection was not processed by flush()

Any solution so far?

Comments have been closed on this topic.