Generate sitemap.xml and robots.txt from Content Management Metadata

This blog post suggests that solutions use metadata from the content management system to generate sitemap.xml and robots.txt. You can use sitemap XML and the robots exclusion standard to influence SEO for your website and to affect the functionality of search engines including google.

Each website can have a URL that exposes XML containing metadata to help search engines evaluate the priority of search hits. Sitemap XML is import to SEO. Details are not within the scope of this blog post.

The robots exclusion standard informs search engines about URL paths to exclude from search results. You cannot rely on robots following the instructions in robots.txt. Details are not within the scope of this blog post.

Robots that follow exclusion directions Robots.txt do not attempt to secure content robots.txt, but you can use it where it helps.

The default implementation of the sitemap and robots standards are the files at /sitemap.xml and /robots.txt. You can hard-code the content of these files, you can generate these files during the solution build process, or you can generate these files in response to the first requests for them after each publishing operation that might have affected their contents.

Rather than using static files, you can use an application server to generate responses to requests for the sitemap and robot exclusion URLs at runtime, in which case you may want to cache those responses and evict those cache entries after every publishing operation that might have affected their contents.

If you generate sitemap and robots files during the build process or use an application server to construct their contents at runtime, then you can embed metadata from the CMS in their content. In the CMS, define fields for the various values used by both sitemap and robots. If possible, place the fields in a container so that all your content types that represent web pages can reuse those field definitions. If possible, separate these fields visually into something like a container named Search or Metadata.

CMS users enter values into these fields, and the sitemap and robots generation processes retrieve those values. In addition to URL, here are potential fields to include in the content types:

  • Exclude from Search (Checkbox): Used in robots exclusion for records to exclude from search results.
  • Include in Sitemap (Checkbox): Whether to include the record in sitemap XML.
  • Search Change Frequency (Dropdown): Used in sitemap.XML.
  • Search Priority (Number or Dropdown): Used in sitemap XML.
  • Search Title (Text): To link in search results (or use an existing field, such as that used for the HTML <title> tag).
  • Search Description (Text): To display in search results (used in HTML <meta> description tag).
  • Search Keywords: Keywords to influence search engines (used in HTML <meta> keywords tag).

You may want to include additional fields for OpenGraph and other purposes, though these appear in HTML pages rather than sitemap XML or robots.txt.

If a record should not appear in sitemap or robots, then their generation logic should not process the descendants of that record.

When retrieving data from the CMS, exclude records that do not have URLs.

If the CMS does not represent records in a hierarchy, then consider generating output in an order matching the natural hierarchy defined by the URLs of those records.

If you have suggestions for managing sitemap.xml, robots.txt, additional metadata to capture, awareness of other resources that could use similar techniques with a content management system, alternative techniques to manage this data and service these URLs, links to implementations for specific CMS systems, or any other relevant feedback, please comment on this blog post.

One thought on “Generate sitemap.xml and robots.txt from Content Management Metadata

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

<span>%d</span> bloggers like this: