You’re All Doing It All Wrong

That title is clearly clickbait that probably will not work, but I am trying to make a point: the way that most of us implement headless CMS is probably not optimal (and by you, I meant me, which was supposed to be a joke). I do not want a static site, but I also don’t want my application servers (such as ASP.NET) or clients (such as JavaScript in browsers) constantly requesting the same data from the CMS, even if the CMS caches the data at the HTTP endpoint for that call. I want to put my application servers as close as possible to my clients, I want those clients to place as few HTTP calls as possible, and I want all HTTP calls to return the smallest payloads possible, whether HTML, JSON, script, CSS, image or otherwise. Since I cannot control a SaaS headless CMS vendor’s Content Delivery Network (CDN) infrastructure and I may have to pay based on bandwidth exchanged with the CMS or even its CDN, to scale, perform, and keep costs down, I want to cache as much JSON as possible in my own content delivery tiers in the geographies that matter to me.

A repository layer under the application should integrate and manage caching of data from the CMS and other systems. If a browser needs the JSON representation of an Entry, to avoid the need for the client to place an additional HTTP request for that data after receiving the main page HTML, the application server can serialize that Entry as JSON in a <script> block rendered toward the end of the HTML of the page. Then, instead of every visitor placing the HTTP request for that same data after loading the first page, that data is (a potentially cached) part of the original HTTP response. By the time the client becomes aware of the need for the data, it has already arrived with the response to their initial HTTP request. Headless with ASP.NET Core fits this pattern perfectly.

My objection is about how a repository that abstracts a CMS should get its data as much as it is about how visitors should get that data. A typical implementation could use the CMS vendor’s content delivery APIs to load Entries into application server memory when needed or to preload some or all Entries (possibly in a background thread) at application initialization, and to cache that data somewhere. Publishing should evict entries from, update entries in, and clear entire caches, and due to challenges evicting, most implementations are likely to clear (maybe at a limited frequency regardless of the frequency of publishing), which will result in reloading the caches, which will typically result in a large number of RESTful API calls that are redundant because the data for most Entries did not change during the publish.

The best place for a repository to get its data is probably from a search index rather than from the CMS. If possible, solutions should populate Entry Models directly from the JSON representation of the Entry stored in the search index. This can reduce the number of HTTP calls for solutions that would otherwise get identifiers from search and then get the corresponding Entries from the CMS.

But what should populate the search index (or any repository ), and how does publishing update that index/repository? Typically, publishing triggers webhooks that you intercept and use (to pass the JSON representation of the Entry to the search engine for indexing and) clear caches.

For performance, avoid clearing caches too often, if you can. Publishing is often sequential, one Entry at a time. Would you want to clear the caches after publishing each Entry? What if you need to preload data into the caches after evicting from or clearing them, how often do you want to do that? What if you clear some Entries from the cache will publishing is still in process?

A better approach may be to poll the CMS periodically for changes. All CMS should expose something like synchronization APIs that you can use for this purpose. The repository periodically calls the synchronization API to see if there are any changes and evicts, updates, and clears all caches accordingly, whether it gets new data from the search index, the CMS, or otherwise (you could use a publishing webhook to trigger that logic, which could limit its own frequency). The repository can populate a new cache in the background and then replace the existing cache with the new one.

You could use webhooks to implement systems that use an event model. You might want a repository to store and replay these events, if the CMS vendor does not have sync APIs that expose the same data. One challenge with webhooks is that my development workstation does not have a public DNS entry and is behind firewalls. I need to look into opening the inbound HTTPS port in the firewalls and using HTTP tunnels to resolve the DNS issue.

In addition to supporting multiple techniques for loading data, a generic CMS repository abstraction should probably support multiple techniques for managing caches.

One thought on “You’re All Doing It All Wrong

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: