This blog post discusses some issues mapping URLs to file system paths and entries in headless content management systems. At least in the context of web solutions driven by content management systems, solutions that involve static files must implement file and directory naming rules. If you have experience or perspectives on this issue, please comment on this blog post.
Update 9.November.2021: See:
I have been working with content management systems for more than 20 years, and headless content management systems for more than two years now. Many content management systems have the concept of a URL hierarchy, but do not have the concept of a file system. Certain entries, where entries are records in the content management system, represent pages, meaning that they contain the data required to render a page. Such entries are associated with URLs, such as /hr. Entries can be associated with URLs that appear nested with the URL of another entry, such as /hr/jobs. There is no concept of a directory, a file name extension, or a default file naming convention. blog post discusses some issues mapping URLs to file system paths and entries in headless content management systems.
I have generally implemented architectures that generate HTML dynamically, at runtime, after deployment, when the browser or other client requests the page. Jamstack solutions tend use static HTML generation, at build time, before deployment.
I have been experimenting with Vercel as a content delivery network (CDN), which is a sort of global edge caching layer that functions as a webserver for static files. Vercel builds and deploys the static files directly from source code management systems, but also supports edge functions, React server controls, and other runtime server-side HTML generation technologies using Next.js. Technologies such as Next.js come with their own HTTP request routing mechanisms but using them complicates moving to alternate CDNs such as netlify.
I have not used static HTML files in decades, but from my memory:
- The web server maps hostnames such as mysite.tld to directories called document roots. For example, https://mysite.tld (and hence https://mysite.tld/, with the trailing slash) may map to a directory such as /var/www/html/ or C:\html\.
- If the requested URL includes a file system path, then the web server serves that file from within the specified subdirectory path within the document root subdirectory. For example, a request for https://mysite.tld/directory/path/some.file may serve /var/www/html/directory/path/some.file.
- If the requested URL does not correspond to a file, then the web server looks for files that match naming conventions. For example, a request for https://mysite.tld/directory/path/some may serve /var/www/html/directory/path/some.html.
- If no matching file exists, then the web server serves the file named index.html from within the directory corresponding to the URL path. For example, a request for https://mysite.tld/directory/path/some may serve /var/www/html/directory/path/some/index.html.
My first issue with this logic is that I do not know whether the web server checks for a matching file with the .html extension or an index.html file first, and generally prefer to avoid architecting solutions that depend on such details. I believe that the ability to map /directory/path/some to the /var/www/html/directory/path/some/index.html file or the /var/www/html/director/path/some.html file or even the /var/www/html/directory/path/some file to be an undefined condition to avoid. I seem to remember that web servers would not even serve files without extensions such as /var/www/html/directory/path/some. I think this was because of security concerns and/or because there would be no way to determine the mime type of the file so that the browser would render it rather than downloading it.
The home page (https://mysite.tld or https://mysite.tld/) is the only URL that cannot explicitly specify a file. A request for the root (/) cannot specify a file name, and therefore must map to index.html or equivalent. To improve explicit mapping of URL paths to file paths, some solutions may prohibit files named index.html except for the home page.
The .html extension is an exception to the goal of explicitly mapping URL paths to files. URLs should not expose technologies, changing technologies should not affect URLs and hence search indexes, and URLs are friendlier without the .html extension.
Maybe my most significant concern with the ability to map a single URL path explicitly to a file or implicitly to an index.html file within a directory specified explicitly is that what may be a file today could need to be a directory tomorrow. For example, a site might start with a single page within a section, such as /hr/jobs, which could be implemented as /hr/jobs.html. Over time, the need for nested pages such as /hr/jobs/sales and /hr/jobs/marketing may arise, requiring the need to move /hr/jobs/index.html to /hr/jobs/index.html. This would not change the URL of the page at /hr/jobs but requires logic in or a change to the build process.
I noticed that Vercel, at least by default, does not apply the .html extension as a default. A request for /file is HTTP 404 even if /file.html exists. If /file exists, even if it contains HTML, the browser will try to download that file rather than rendering it.
Some pages, or specific entries, or entries based on specific content types, or URL routes, and so forth, may never have descendants, and therefore may be implemented as static files (/category/product.html). Pages that may have descendants should be implemented as directories containing an index.html. Alternatively, data and logic can determine file naming logic. For example, CMS users could select a checkbox if a page cannot have children, or logic could use the list of all URLs to determine whether to create a file or directory containing an index.html file for each.
I don’t know what the rules for file naming should be, but I think it is important to define them. The easiest rule is that every page should be a directory containing an index.html file.