Grey Hill

On Accidentally Quadratic Memory Usage

As eluded to in "On Nested Loops", I encountered a problem that was solved by applying data-oriented design while at Helm Operations. I solved the problem by converting accidentally-quadratic memory usage into linear memory usage. This is not an exact recreation of the issue, but instead a simplified version.

The Context

Our company makes checklist software for regulated industries. Checklists contain multiple tasks which can be completed in any order and by many people. Once all the tasks in a checklist are complete, the checklist itself is marked as complete. Certain users are able to delete tasks and checklists. If all tasks in a checklist are deleted, the checklist is also deleted.

Since our software is used in regulated industries, we have to maintain an audit log. We will also use aggregate roots to maintain the integrity of a checklist.

Audit Log

An audit log is a list of all changes done in a system that can be written to and read but not updated. Everytime a user

the change must be recorded. These changes are saved in a Domain Event: a snapshot of the data after a change.

Since all changes have to be saved, we cannot delete tasks or checklists: instead, we archive them. This means any queries for data must filter out archived items.

Aggregate Root

To maintain integrity of checklists, the software uses the concept of an aggregate:

... a cluster of domain objects that can be treated as a single unit.

For our software, the checklist and its tasks would be an aggregate.

Any change to an aggregate, at the root or any of its children, causes the whole aggregate to be saved. This allows

The Code

Given the context, out data structures for checklists and tasks would look like this:

public class Checklist
{
    public Guid Id { get; set; }
    public string Title { get; set; }

    public List<ChecklistTask> Tasks { get; set; }

    public DateTimeOffset? CompletedOn { get; set; }
    public DateTimeOffset? ArchivedOn { get; set; }

    public void Update(ChecklistTask task)
    {
        var index = Tasks.FindIndex(t => t.Id == task.Id);
        Tasks.RemoveAt(index);
        Tasks.Insert(index, task);

        // Update derived/computed/cached data from the tasks
        CompletedOn = Tasks.Any(task => task.CompletedOn == null)
            ? null
            : Tasks.Max(task => task.CompletedOn);
        ArchivedOn = Tasks.Any(task => task.ArchivedOn == null)
            ? null
            : Tasks.Max(task => task.ArchivedOn);
    }
}

public class ChecklistTask
{
    public Guid Id { get; set; }
    public Guid ChecklistId { get; set; }

    public string Description { get; set; }

    public Guid? CompletedBy { get; set; }
    public DateTimeOffset? CompletedOn { get; set; }

    public DateTimeOffset? ArchivedOn { get; set; }
}

For aggregates, a domain event would look something like this:

public class DomainEvent
{
    public Guid Id { get; set; }
    public DateTimeOffset CreatedOn { get; set; }

    public Guid ChecklistId { get; set; }
    public string JSON { get; set; }
}

We'll also have a class for creating domain events from checklists and tasks. This class would perform serialization and saving to the database.

public class ChangeQueue
{
    private readonly IDatabaseService _database;

    private readonly List<DomainEvent> _changes;

    public void Queue(Checklist checklist)
    {
        var domainEvent = new DomainEvent
        {
            Id = Guid.NewGuid(),
            CreatedOn = DateTimeOffset.UtcNow,

            ChecklistId = checklist.Id,
            JSON = JsonSerializer.Serialize(checklist),
        };

        _changes.Add(domainEvent);
    }

    public void Queue(ChecklistTask task)
    {
        var checklist = GetChecklist(task.ParentId);

        checklist.Update(task);

        Queue(checklist);
    }

    public void Save()
    {
        foreach (var event in _changes)
        {
            _database.Save(event);
        }
    }

    // Largely unimportant for our problem.
    // This is written for "completeness".
    private Checklist GetChecklist(Guid id)
    {
        // We check our changes before the database to prevent
        // overwriting previous changes.

        var orderedChanges =
            from change in _changes
            where change.ChecklistId == id
            orderby change.CreatedOn descending
            select change;

        var latestChange = orderedChanges.FirstOrDefault();
        if (latestChange != null)
        {
            return JsonSerializer.Deserialize<Checklist>(change.JSON);
        }

        return _database.GetChecklist(id);
    }
}

Finally, a service for access to the database.

public interface IDatabaseService
{
    Checklist GetChecklist(Guid id);

    // Not only saves the domain event, but also
    // updates the tables for checklists and tasks.
    void Save(DomainEvent event);
}

The Problem

Our team received a report from a user that their system was crashing when they deleted a bunch of checklists. Even when they tried deleting a single checklist, the system would crash. Investigation revealed the user was deleting checklists with over 100 tasks and the system was crashing due to an out-of-memory exception.

In order to delete/archive a checklist, the system would archive each task in the checklist. By archiving all the tasks in a checklist, the checklist itself would also be archived.

The code responsible for the crash was the following:

public class ChecklistHelper
{
    private readonly ChangeQueue _changeQueue;

    private List<ChecklistTask> _changedTasks;

    public void Archive(ChecklistTask task)
    {
        task.ArchivedOn = DateTimeOffset.UtcNow;

        _changedTasks.Add(task);
    }

    public void SaveChanges()
    {
        foreach (var task in _changedTasks)
        {
            // Creates a domain event to update the entire checklist
            _changeQueue.Queue(task);
        }

        // Save all the domain events
        _changeQueue.Save();
    }
}

A domain event would be created for each `Archive` on a task, which requires grabbing the task's checklist and all the checklist's tasks. Thus, the entire checklist was being saved for each task in the checklist. For example, for a checklist with 3 tasks, there would be 3 domain events:

Domain Event Domain Event Domain Event
Checklist
  1. Task (Archived)
  2. Task
  3. Task
Checklist
  1. Task (Archived)
  2. Task (Archived)
  3. Task
Checklist (Archived)
  1. Task (Archived)
  2. Task (Archived)
  3. Task (Archived)

Suppose a task takes up 10KB and a checklist without its children takes up 1KB. For n checklists with m tasks each, there would be n m domain events when deleting n checklists. Each domain event would be 1 + 10 m KB in size, which results in a total memory usage of n m ( 1 + 10 m ) KB. For 10 checklists with 100 tasks each, this would be 1GB.

Solution

The initial solution was to simply batch the changes in ChecklistHelper.Save. This was an easy fix and kept a ceiling on memory usage.

I proposed to consolidate the task changes to the checklist they're a part of. We would

  1. find the checklists affected by the task changes
  2. load the affected checklists
  3. apply the task changes to their respective checklist
  4. save the checklists.

We directly apply operations to the checklists instead of indirectly through a checklist's tasks. In short, we follow how the data is laid out: the essence of data-oriented design.

public class ChecklistHelper
{
    public void SaveChanges()
    {
        var checklistsToSave = new HashSet<Guid>();

        foreach (var task in TaskChanges)
        {
            checklistsToSave.Add(task.ChecklistId);
        }

        var checklists = new Dictionary<Guid, Checklist>();
        foreach (var checklistId in checklistsToSave)
        {
            var checklist = GetChecklist(checklistId);

            checklists[checklistId] = checklist;
        }

        foreach (var task in TaskChanges)
        {
            var checklist = checklists[task.ChecklistId];

            // Replace the task within the checklist
            checklist.Update(task);
        }

        foreach (var checklist in checklists)
        {
            _changeQueue.Queue(checklist.Value);
        }

        _changeQueue.Save();
    }
}

Analysis

By following data-oriented design, an operation to archive n m tasks would take up:

mem usage = mem usage of checklists lookup + mem usage of domain events = number of checklists * size of checklist + number of domain events * size of domain event = n ( 1 + 10 m ) KB + n ( 1 + 10 m ) KB = 2 n ( 1 + 10 m ) KB

Using our previous numbers of 10 checklists with 100 tasks each, our memory usage is 20MB.

Not only did we shrink memory usage, we reduced database usage as well: we went from saving 1,000 domain events to 10. By consolidating changes to a checklist, we also made it easier for auditors to track a trail of changes to a checklist. Upon further testing, we even saw improvements to overall system performance. It was common in other areas of the system to update multiple tasks within a single checklist. By changing ChecklistHelper.SaveChanges, which is used throughout the system, we inadvertently improved those areas as well.

Conclusion

We saw how following data-oriented design turned an operation that took up 1GB of memory into 20MB, a 50x improvement. This also reduced the number of domain events from 1,000 to 10, which resulted in reduced database storage. As an unintended consequence, overall system performance improved.