Skip to content

Third Parties 2025#4200

Merged
tunetheweb merged 52 commits intoHTTPArchive:mainfrom
jazlan01:thirdparties-sql-2025
Jan 11, 2026
Merged

Third Parties 2025#4200
tunetheweb merged 52 commits intoHTTPArchive:mainfrom
jazlan01:thirdparties-sql-2025

Conversation

@jazlan01
Copy link
Contributor

@jazlan01 jazlan01 commented Aug 28, 2025

Closes #4089

Staged chapter: https://third-parties-2025-dot-webalmanac.uk.r.appspot.com/en/2025/third-parties
Staged home page quotes: https://third-parties-2025-dot-webalmanac.uk.r.appspot.com/en/2025/?feat=third-parties#featured-chapter

Prevalence

  • Pages with third parties, grouped by rank
  • Median number of third party domain, grouped by rank
  • Median number of third party requests per page, grouped by rank
  • Third party domain categories per page, grouped by rank
  • Third party requests by content type
  • Top third parties by number of pages

Consent Sharing

  • Prevalence of specific consent signals (USP, TCF, GPP) in third-party requests
  • Top third-party domains ranked by volume of received consent signals
  • Consent signal prevalence broken down by third-party category
  • Analysis of standard vs. non-standard consent parameter usage
  • Consent signal survival rate through redirect and inclusion chains
  • Prevalence of consent signals in top third-party requests

Inclusion chains

  • Distribution depth from inclusion chains
  • Length of chains, grouped by initiators

@jazlan01 jazlan01 marked this pull request as draft August 28, 2025 00:42
@tunetheweb tunetheweb added the analysis Querying the dataset label Aug 31, 2025
@tunetheweb
Copy link
Member

@jazlan01 what's the latest with this? It's still marked as draft. Is it ready for review?

@jazlan01
Copy link
Contributor Author

This is ready for review. There are a few chores left, but I need some specific help with them.

  1. Fixing the linter errors and cleaning up the queries.
  2. Adding dynamic links to graphs.

Other than that, the text is complete.

@jazlan01 jazlan01 marked this pull request as ready for review January 1, 2026 19:21
@tunetheweb tunetheweb changed the title Third Parties SQL 2025 Third Parties 2025 Jan 3, 2026
@jazlan01
Copy link
Contributor Author

jazlan01 commented Jan 9, 2026

Hey folks, apologies for the delay. I was traveling.

Please give me 24-36 hours to resolve all the issues with this PR.

@jazlan01
Copy link
Contributor Author

@tunetheweb I have completed the featured stats and have added the missing names. I think I messed up the linting in 3 files, so I will run them as soon as the workflow reports the errors.

Do I also need to generate the DOI? If yes, please let me know where I can find the relevant guide to generate it.

Comment on lines +132 to +138
chart_url="https://docs.google.com/spreadsheets/d/e/2PACX-1vTrElluFB6gvlkt65HjzZMJ4PtgJ53tVnez46cBrhQNtNxUjDxvNPuS_xmlQBUmhSHZkOMAjd0bTJyr/pubchart?oid=1133634663&format=interactive",
sheets_gid="445864775",
sql_file="number_of_third_parties_by_rank_and_category.sql"
)
}}

The top categories include `ad`, `analytics` and `cdn` categories.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite different from 2024 so I dug in. I'm not sure what query you use, but I get different results.

Here are my results:
https://docs.google.com/spreadsheets/d/1FPssodcLgX8iFWFXDrthWVkBCUTl5_IJon2cyaZVudU/edit?gid=1468293248#gid=1468293248

That used this query:

#standardSQL
# Number of third-parties per websites by rank and category

WITH requests AS (
  SELECT
    client,
    page,
    url
  FROM
    `httparchive.crawl.requests`
  WHERE
    date = '2025-07-01' AND
    is_root_page
),

pages AS (
  SELECT
    client,
    page,
    rank
  FROM
    `httparchive.crawl.pages`
  WHERE
    date = '2025-07-01' AND
    is_root_page
),

third_party AS (
  SELECT
    domain,
    canonicalDomain,
    category,
    COUNT(DISTINCT page) AS page_usage
  FROM
    `httparchive.almanac.third_parties` tp
  JOIN
    requests r
  ON NET.HOST(r.url) = NET.HOST(tp.domain)
  WHERE
    date = '2025-07-01' AND
    category NOT IN ('hosting')
  GROUP BY
    domain,
    canonicalDomain,
    category
  HAVING
    page_usage >= 50
),

base AS (
  SELECT
    client,
    category,
    page,
    rank,
    COUNT(domain) AS third_parties_per_page
  FROM
    requests
  LEFT JOIN
    third_party
  ON
    NET.HOST(requests.url) = NET.HOST(third_party.domain)
  INNER JOIN
    pages
  USING (client, page)
  GROUP BY
    client,
    category,
    page,
    rank
)

SELECT
  client,
  category,
  rank_grouping,
  CASE
    WHEN rank_grouping = 100000000 THEN 'all'
    ELSE FORMAT("%'d", rank_grouping)
  END AS ranking,
  APPROX_QUANTILES(third_parties_per_page, 1000)[OFFSET(500)] AS p50_third_parties_per_page
FROM
  base,
  UNNEST([1000, 10000, 100000, 1000000, 10000000, 100000000]) AS rank_grouping
WHERE
  rank <= rank_grouping
GROUP BY
  client,
  category,
  rank_grouping
ORDER BY
  client,
  category,
  rank_grouping

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tunetheweb just to chime in on the 2024 results regarding this figure. In the published Third Parties chapter of the 2024 Web Almanac (https://almanac.httparchive.org/en/2024/third-parties
), Figure 6.4 lists the top categories as consent provider, video, and customer success. When I looked at the data linked to the figure (https://docs.google.com/spreadsheets/d/18uTDBygSqgT_PNFldOz4guLSuXyMzDthRGnAG5if4sU/edit?gid=1662995860#gid=1662995860
), it seemed to suggest that ad, analytics, and cdn should've been the top categories instead. I might be missing something, but there seems to be a discrepancy between the published figure in the 2024 Third Parties chapter and the associated raw data.

I did not work on the queries directly, so @jazlan01 can speak to the query details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tunetheweb the query you are using is reflecting the number of third parties (number_of_third_parties_by_rank_and_category.sql), not the number of third party providers.

The correct query is number_of_third_party_providers_by_rank_and_category.sql.

I will update the correct sql file in the .md file

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they looked at the all category for the text. Which is a bit confusing I agree!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tunetheweb the query you are using is reflecting the number of third parties (number_of_third_parties_by_rank_and_category.sql), not the number of third party providers.
The correct query is number_of_third_party_providers_by_rank_and_category.sql.

I like that change! I seems a better measure of true third-party usage.

Though does seem a little on the low side to me (basically one per category for most categories). But maybe that makes sense?

Either way it is a change from 2024, and a subtle one that might not easily be understood (I didn't initially!). so I think it's worth making that explicit in the text. And maybe also including the other stats for a more direct comparison?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numbers may appear a bit low because we are now reporting the median number of third-party providers per page. Since the analysis spans thousands of web pages across different websites, not every page/website may include providers from every category. In addition, a small number of dominant providers within certain categories (for example, Google in analytics and advertising) can account for much of the observed usage.

I’ve also updated the text to make this change explicit:

“The bar chart shows the median number of third-party providers per page by rank and category. In the previous edition, this analysis focused on the number of third-party domains per page by rank and category, whereas this year we measure the number of unique third-party providers, which results in lower counts overall. This year, the top categories are ad, analytics, and cdn.”

@jazlan01 could you take a look at the updated description for this plot and let me know if there are any issues?

Copy link
Member

@tunetheweb tunetheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is almost there once we address the outstanding comments.

Copy link
Member

@tunetheweb tunetheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there. Just a few more questions and comments.

Copy link
Member

@tunetheweb tunetheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK let's merge this. Thanks for all your quick responses to the review.

@tunetheweb tunetheweb merged commit a845fb0 into HTTPArchive:main Jan 11, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

analysis Querying the dataset writing Related to wording and content

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Third Parties 2025

3 participants