This blog post provides some perspectives on what data from headless content management and potentially other systems to index for search operations and options for implementing those indexing operations, where indexing makes search engines aware of data.
All headless CMS systems provide APIs that allow developers to select zero or more records that match specific criteria. For keyword search, stemming, and other advanced search features, many CMS require integration with third-party search engines.
Search indexes can work like tables from which developers select rows called documents. All CMS records can feed a single search index, for example to map URL paths to CMS records, in which case records in the CMS and documents in the search index are largely equivalent.
Search indexes can flatten relational and hierarchical data into a single source. For example, a CMS may represent pages as a hierarchy of different types of records, where the index represents all documents in a common schema and has no facilities for hierarchy or relations. It is not uncommon for data that appears once in a source system to appear duplicated many times in multiple search indexes.
Documents can contain data of different types merged from separate systems. For example, documents that contain product marketing data content CMS records may contain additional data from Product Information Management and other systems.
What to Index
Rather than retrieving some data from search and some from CMS, developers may prefer to access the search engine whenever possible. In other words, index everything. To select data managed in the CMS from a search index, the search engine must not only index all data used for query and keyword search operations, but must also store all data required for content delivery operations.
Some solutions may choose to index metadata for media such as images and documents managed in the CMS. Excluding specific use cases such as PDF documents, most solutions are unlikely to index binary data or store that binary data in a search index. Solutions may implement dedicated indexes for binary data and media metadata.
All search hits should have URLs to support linking from search results. Web solutions can use search indexes to map URL paths to CMS records. CMS records that represent pages have URLs. Some records, such as content fragments reused by multiple pages, may not have default URLs, requiring alternate solutions to map documents from the search index to URLs.
The following fields may be appropriate for a generic search index that supports simple queries and maps URLs to CMS records:
- URL Path: The path part of the URL of the page that corresponds to the CMS record associated with the search document, for mapping URL paths to CMS records.
- Link Path: For rendering links in search results to documents that do not have default URL Paths (not for mapping URL paths to CMS records); alternatively, specify the record that has the URL to use when this document appears in search results.
- URL Domain: In some cases, separate from URL path for flexibility and processing efficiency.
- Content Type: CMS content type identifier (for filtering by content type).
- Record Identifier: The CMS identifier of the record/document.
- Navigation Title: Text to use in links to documents.
- Prerendered HTML: Index HTML fragments to improve content delivery performance.
- JSON: Make JSON representation of CMS record or translation thereof available to calling code.
- Parent: ID or URL path of parent.
- Children: IDs, URL paths, or other child document/record identifiers.
- Facets: Tags and other values used in faceted navigation and other filtering scenarios.
For each index, solutions that support multiple languages must determine whether to implement multiple indexes or include a field in the index to identify the language and then always filter search results by language.
How to Index
All solutions require processes to build and rebuild search indexes. Some solutions may implement processes to update existing search indexes when data changes rather than rebuilding entire indexes.
Building or rebuilding a search index means indexing or reindexing all records. Building an index can be an expensive operation and should be minimized. It can be important to consider building or rebuilding search indexes as part of the solution deployment process, when troubleshooting, and possibly as part of significant or all publishing processes. Solutions that depend on search must have a facility that allows rebuilding search indexes without affecting the production solution, such as by building an index, moving that index to production, and removing the previous incarnation of that index.
Updating a search index means reindexing any data that has changed. As well as new records and updates to existing records, index update processes must account for data removal. Some solutions may use webhooks to update indexes. Specifically, when the CMS publishes a record, it can invoke a webhook that calls an external service that implements search indexing.
The CMS raises separate webhooks for each record changed but indexing large volumes of data is more efficient than indexing each record separately. Some CMS support batch publication, which can raise webhooks after publishing to signal when to rebuild or update an index. Other solutions can store a timestamp to indicate when they last received a publishing webhook, and then rebuild or update the index if that value has not changed after some short period of time.
Some headless CMS products provide content delivery synchronization APIs that return all records or all records that have changed since a given time. Rather than using content delivery APIs or GraphQL, for greater efficiency, some solutions can use CMS synchronization APIs to build or update search indexes, or both. Specifically, the solution would invoke the synchronization APIs to build or rebuild the index, store a synchronization token from the CMS API or a timestamp, and periodically invoke the synchronization APIs to process updates. Alternatively or in addition, solutions can employ event service bus mechanisms to control the flow of data between systems.
If you have any suggestions for what to index for search and how to perform indexing of data from headless content management systems, please comment on this blog post.