SEO findings

This document will summarize the findings of our first round of SEO experiments that we did in GUDMAP. You can find the full results, report, and plans for the next experiment here.

What we learned

We should set up the Google search console for all deployments.
We couldn't find any evidence of Google finding Chaise pages by crawling static sites. Therefore we need to define sitemap and list the records we want Google to crawl. After adding the sitemap, Google could properly crawl Chaise pages. It's worth mentioning that Google usually prefers the pages that load faster and will crawl those pages more frequently than others. That's why most of the crawled pages in our experiment were in the server-side rendered (SSR) list.
Google allocates a limited budget to each project. Therefore,
- Don't overload your sitemap and only include the records of interest.
- Block indexing of the pages you don't want Google to see. This can be done by adding a disallow in robots.txt, or adding a noindex meta tag.
JSON-LD structured data helps with the discovery of static pages. It also allows us to define the page's structure and other important data related to the project. For example, you can look at the one described on the https://gudmap.org landing page.
We didn't see any improvement in the Google search result for the records with google-dataset annotation. The annotation only affected the Google's Dataset search, as only records with the annotation showed up in the Dataset search. Although, because of the limited crawl budget, most of the records with dataset annotation were ignored since Google preferred the server-side rendered records.
Google marked a lot of Gene and Specimen records as duplicates. We couldn't find more information about why these pages are marked as duplicates (some of them were from different tables and had completely different page structures).
- We suspect using includeCanonicalTag chaise-config property should help. Unfortunately, GUDMAP is not using this property so we couldn't determine its effectiveness. By defining this property, Chaise will ensure to Use a rel="canonical" link tag for all record pages.

Details

Google search console

Search Console tools and reports help you measure your site's Search traffic and performance, fix issues, and make your site shine in Google Search results. Please refer to this help page for more information about setting up search console.

In our experiment, we mainly used the following tools in the Google search console:

Index coverage report: See which pages Google has found on your site, which pages have been indexed, and any indexing problems encountered. This is the main tool that can be used for getting information about the indexing process. Unfortunately, we couldn't find any API or automated process that would give us better/more information, so we would have to stick to this UI tool for now. The main issue with this tool is that you cannot fetch every page and details about them. Google only shows the first 1000 pages in each category, and in most cases, you cannot even see any detailed information other than whether the URL has been indexed or not.
URL inspection tool: Allows us to inspect an individual URL and get information about how Google has indexed it.

Google crawl budget

Google allocates a limited budget to each project, and because of this, it will not crawl all the known pages of a website in one page and will separate them into multiple batches. We couldn't find any definite answer about how the budget is determined, and their official document mentions websites with roughly 10k pages. However, in our first experiment, GUDMAP had around 9k pages, but we observed Google is not crawling all the pages at once and doing them in smaller batches.

robots.txt

A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. You can find more information about generating this file here.

Using robots.txt, you can disallow crawling on specific directories or pages.
- We recommend defining something like this on dev servers or pages that you don't Google to crawl.
- We recommend disallowing crawling on locations that you don't want Google to look at. For example, if all the files under a hatrac location are not human-readable and you don't want them to appear in Google's results, you should disallow crawling of it. Remember that Google has a limited budget, and you should be conservative about URLs that you want Google to crawl as much as possible.
You can list your sitemap files in this file.

Sitemap

A sitemap tells Google which pages and files you think are important to your site and provides valuable information about these files. To help Google find pages of interest, you should create your sitemap files. You can find some information about creating a sitemap here. You can also use this script to generate sitemap files for your deployment.

You can have as many sitemap files as you want. We recommend having multiple sitemap files containing URLs from a specific predefined category (static vs. Chaise, different schemas, etc). This will help with the Index coverage report as you can filter based on sitemap files.
Make sure to submit your sitemap files using Google search sitemap report and also include it in your robots.txt.
If your goal is to experiment and compare how different tables/categories perform in Google:
- Don't pollute the sitemap with a lot of datasets. The more datasets, the longer it takes for google to crawl/index them.
- Limit the number of rows and make sure they are equal if our goal is to compare different types of datasets. If we have a lot of datasets for a type, the indexing budget will most probably be spent on those and nothing else.
After submitting your sitemap (or updating it), you should have a good idea of how Google did in about 2-3 months, and after that, it will just reindex. However, this is very subjective and depends on the number of URLs on the page.

SEO findings

SEO findings

What we learned

Details

Google search console

Google crawl budget

robots.txt

Sitemap

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

General Guides

Dev Guides

Test Guides

Others

Clone this wiki locally