Make sure google knows about all pages of your site

1 Intro

Noticed this when trying to implement "programmable search" into my site.

You can not search the files of a website if google does not know about all the possible routes(pages) in your site.

2 Verify Your Site is Indexed by Google

Go to Google and type site:your-website.com in the search bar.
If no results appear, it means Google hasn't indexed your site.
Pages that aren’t indexed can’t be served on Google

3 Indexing your site

Go to Google Search Console.
Add your website to the Search Console.
Verify the domain name by taking the provided link and going to iv.lt in my case, adding a new @ TXT record with the value. Like here
Note: DNS changes may take some time to apply. If Search Console doesn’t find the record immediately, wait a day and then try to verify again
After 15min it was verified
Once verified, you can submit a sitemap to help Google index your site more efficiently.

4 Create a Sitemap

If you don't already have a sitemap, you can create one. For static websites, a simple way is to use an online sitemap generator:

Use a tool like https://www.xml-sitemaps.com/ to generate a sitemap.
Upload the generated sitemap.xml to the root directory of your website.
Submit the sitemap url in Google Search Console
Google will periodically process it and look for changes. You will be notified if anything goes wrong with it in the future.

4.1 TODO Automatically genrerate sitemap.xml with each run

Now sitemap has to be updated each time the site changes.

Use emacs lisp to understand it better

Find a way to autogenerate sitemap, maybe during github actions step? Or rather better during build.sh, not necessarily in elisp, but during the build.

Seeing that the most common values are, like in pnvz site - https://panevezys.lt/sitemap.xml:

url
loc
lastmob
changefreq
priority (0.5 most of the times)

<url>
<loc>https://panevezys.lt/lt/veiklos-sritys/architekturos-ir-urbanistikos-skyrius/teritoriju-planavimas-1985/parengti-detalieji-planai.html</loc>
<lastmod>2014-06-19</lastmod>
<changefreq>weekly</changefreq>
<priority>0.50</priority>
</url>

Here lastmod is in a completely different format - https://www.lexus.lt/sitemap.xml

Other sites use only loc and lastmod, like here - https://www.kaunas.lt/sitemap.xml

Vilnius has a clean one, seems like priority rating is different for some pages - https://vilnius.lt/sitemap.xml

Not going to lie, used chatgpt help for this, learned cool things, like:

How for loops work
Build in functions to do anything you want with files/directories in your system
Lots of C code in emacs source code (dired.c for example)

Here is the full code (current version of it):

(defun ag-generate-xml-head ()
  "Generate the head part of the XML."
  (concat "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"
          "<urlset\n"
          "    xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\"\n"
          "    xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"\n"
          "    xsi:schemaLocation=\"http://www.sitemaps.org/schemas/sitemap/0.9\n"
          "          http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd\">\n"
          "\n\n"))

(defun ag-generate-first-sitemap-entry (timestamp)
  "Generate the first entry of the sitemap XML with the given TIMESTAMP."
  (concat "<url>\n"
          "  <loc>http://arvydas.dev/</loc>\n"
          "  <lastmod>" timestamp "</lastmod>\n"
          "  <priority>1.00</priority>\n"
          "</url>\n"))

(defun ag-generate-sitemap-entry (filename timestamp)
  "Generate a sitemap entry for a given FILENAME with the given TIMESTAMP."
  (concat "<url>\n"
          "  <loc>http://arvydas.dev/" filename "</loc>\n"
          "  <lastmod>" timestamp "</lastmod>\n"
          "  <priority>0.80</priority>\n"
          "</url>\n"))

(defun ag-generate-sitemap-dot-xml (directory)
  "Generate an XML file with the names of HTML files in the specified DIRECTORY."
  (message "Generation of sitemap.xml START")
  (let ((files (reverse (directory-files directory t "\\.html$")))
        (xml-file (expand-file-name "sitemap.xml" directory))
        (timestamp (format-time-string "%Y-%m-%dT%H:%M:%S%z")))
    (with-temp-file xml-file
      (insert (ag-generate-xml-head))
      (insert (ag-generate-first-sitemap-entry timestamp))
      (dolist (file files)
        (let ((filename (file-name-nondirectory file)))
          (insert (ag-generate-sitemap-entry filename timestamp))))
      (insert "</urlset>\n\n")) ;; Add the two newlines here
    (message "Generated %s" xml-file)
    (message "Generation of sitemap.xml END")))

;; call out function
(ag-generate-sitemap-dot-xml "../../arvydasg.github.io")

And here is a cute little script to list all the files in a directory, just for reference:

(defun generate-xml (directory)
  "Print the names of HTML files in the specified DIRECTORY."
  (let ((files (directory-files directory t "\\.html$")))
    (dolist (file files)
      (let ((filename (file-name-nondirectory file)))
        (message "%s" filename)))))

(generate-xml "/home/nixos/GIT/arvydasg.github.io")

5 Ensure Your Pages are Crawlable

Make sure your pages are not blocked by robots.txt:

Check the robots.txt file at your-website.com/robots.txt.
Ensure there are no disallow rules that prevent Google from crawling your pages.

Example of a permissive robots.txt:

User-agent: *
Disallow:

6 Implement Meta Tags

Ensure each page has relevant meta tags, particularly the meta description tag, to help Google understand the content of your pages.

7 Wait for Indexing

After completing the steps above, it might take some time for Google to index your site. You can check the indexing status in Google Search Console. Also can check by going to site:your-website.com.