Make sure google knows about all pages of your site
1 Intro
Noticed this when trying to implement "programmable search" into my site.
You can not search the files of a website if google does not know about all the possible routes(pages) in your site.
2 Verify Your Site is Indexed by Google
- Go to Google and type
site:your-website.com
in the search bar. - If no results appear, it means Google hasn't indexed your site.
- Pages that aren’t indexed can’t be served on Google
3 Indexing your site
- Go to Google Search Console.
- Add your website to the Search Console.
- Verify the domain name by taking the provided link and going to iv.lt in my case, adding a new @ TXT record with the value. Like here
- Note: DNS changes may take some time to apply. If Search Console doesn’t find the record immediately, wait a day and then try to verify again
- After 15min it was verified
- Once verified, you can submit a sitemap to help Google index your site more efficiently.
4 Create a Sitemap
If you don't already have a sitemap, you can create one. For static websites, a simple way is to use an online sitemap generator:
- Use a tool like https://www.xml-sitemaps.com/ to generate a sitemap.
- Upload the generated
sitemap.xml
to the root directory of your website. - Submit the sitemap url in Google Search Console
- Google will periodically process it and look for changes. You will be notified if anything goes wrong with it in the future.
4.1 TODO Automatically genrerate sitemap.xml with each run
Now sitemap has to be updated each time the site changes.
Use emacs lisp to understand it better
Find a way to autogenerate sitemap, maybe during github actions step? Or rather better during build.sh, not necessarily in elisp, but during the build.
Seeing that the most common values are, like in pnvz site - https://panevezys.lt/sitemap.xml:
- url
- loc
- lastmob
- changefreq
- priority (0.5 most of the times)
<url> <loc>https://panevezys.lt/lt/veiklos-sritys/architekturos-ir-urbanistikos-skyrius/teritoriju-planavimas-1985/parengti-detalieji-planai.html</loc> <lastmod>2014-06-19</lastmod> <changefreq>weekly</changefreq> <priority>0.50</priority> </url>
Here lastmod is in a completely different format - https://www.lexus.lt/sitemap.xml
Other sites use only loc and lastmod, like here - https://www.kaunas.lt/sitemap.xml
Vilnius has a clean one, seems like priority rating is different for some pages - https://vilnius.lt/sitemap.xml
Not going to lie, used chatgpt help for this, learned cool things, like:
- How for loops work
- Build in functions to do anything you want with files/directories in your system
- Lots of C code in emacs source code (dired.c for example)
Here is the full code (current version of it):
(defun ag-generate-xml-head () "Generate the head part of the XML." (concat "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" "<urlset\n" " xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\"\n" " xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"\n" " xsi:schemaLocation=\"http://www.sitemaps.org/schemas/sitemap/0.9\n" " http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd\">\n" "\n\n")) (defun ag-generate-first-sitemap-entry (timestamp) "Generate the first entry of the sitemap XML with the given TIMESTAMP." (concat "<url>\n" " <loc>http://arvydas.dev/</loc>\n" " <lastmod>" timestamp "</lastmod>\n" " <priority>1.00</priority>\n" "</url>\n")) (defun ag-generate-sitemap-entry (filename timestamp) "Generate a sitemap entry for a given FILENAME with the given TIMESTAMP." (concat "<url>\n" " <loc>http://arvydas.dev/" filename "</loc>\n" " <lastmod>" timestamp "</lastmod>\n" " <priority>0.80</priority>\n" "</url>\n")) (defun ag-generate-sitemap-dot-xml (directory) "Generate an XML file with the names of HTML files in the specified DIRECTORY." (message "Generation of sitemap.xml START") (let ((files (reverse (directory-files directory t "\\.html$"))) (xml-file (expand-file-name "sitemap.xml" directory)) (timestamp (format-time-string "%Y-%m-%dT%H:%M:%S%z"))) (with-temp-file xml-file (insert (ag-generate-xml-head)) (insert (ag-generate-first-sitemap-entry timestamp)) (dolist (file files) (let ((filename (file-name-nondirectory file))) (insert (ag-generate-sitemap-entry filename timestamp)))) (insert "</urlset>\n\n")) ;; Add the two newlines here (message "Generated %s" xml-file) (message "Generation of sitemap.xml END"))) ;; call out function (ag-generate-sitemap-dot-xml "../../arvydasg.github.io")
And here is a cute little script to list all the files in a directory, just for reference:
(defun generate-xml (directory) "Print the names of HTML files in the specified DIRECTORY." (let ((files (directory-files directory t "\\.html$"))) (dolist (file files) (let ((filename (file-name-nondirectory file))) (message "%s" filename))))) (generate-xml "/home/nixos/GIT/arvydasg.github.io")
5 Ensure Your Pages are Crawlable
Make sure your pages are not blocked by robots.txt
:
- Check the robots.txt file at your-website.com/robots.txt.
- Ensure there are no disallow rules that prevent Google from crawling your pages.
Example of a permissive robots.txt
:
User-agent: * Disallow:
6 Implement Meta Tags
Ensure each page has relevant meta tags, particularly the meta description tag, to help Google understand the content of your pages.
7 Wait for Indexing
After completing the steps above, it might take some time for Google to index
your site. You can check the indexing status in Google Search Console. Also can
check by going to site:your-website.com
.