Conversation
|
@jazlan01 what's the latest with this? It's still marked as draft. Is it ready for review? |
|
This is ready for review. There are a few chores left, but I need some specific help with them.
Other than that, the text is complete. |
|
Hey folks, apologies for the delay. I was traveling. Please give me 24-36 hours to resolve all the issues with this PR. |
|
@tunetheweb I have completed the featured stats and have added the missing names. I think I messed up the linting in 3 files, so I will run them as soon as the workflow reports the errors. Do I also need to generate the DOI? If yes, please let me know where I can find the relevant guide to generate it. |
src/content/en/2025/third-parties.md
Outdated
| chart_url="https://docs.google.com/spreadsheets/d/e/2PACX-1vTrElluFB6gvlkt65HjzZMJ4PtgJ53tVnez46cBrhQNtNxUjDxvNPuS_xmlQBUmhSHZkOMAjd0bTJyr/pubchart?oid=1133634663&format=interactive", | ||
| sheets_gid="445864775", | ||
| sql_file="number_of_third_parties_by_rank_and_category.sql" | ||
| ) | ||
| }} | ||
|
|
||
| The top categories include `ad`, `analytics` and `cdn` categories. |
There was a problem hiding this comment.
This is quite different from 2024 so I dug in. I'm not sure what query you use, but I get different results.
Here are my results:
https://docs.google.com/spreadsheets/d/1FPssodcLgX8iFWFXDrthWVkBCUTl5_IJon2cyaZVudU/edit?gid=1468293248#gid=1468293248
That used this query:
#standardSQL
# Number of third-parties per websites by rank and category
WITH requests AS (
SELECT
client,
page,
url
FROM
`httparchive.crawl.requests`
WHERE
date = '2025-07-01' AND
is_root_page
),
pages AS (
SELECT
client,
page,
rank
FROM
`httparchive.crawl.pages`
WHERE
date = '2025-07-01' AND
is_root_page
),
third_party AS (
SELECT
domain,
canonicalDomain,
category,
COUNT(DISTINCT page) AS page_usage
FROM
`httparchive.almanac.third_parties` tp
JOIN
requests r
ON NET.HOST(r.url) = NET.HOST(tp.domain)
WHERE
date = '2025-07-01' AND
category NOT IN ('hosting')
GROUP BY
domain,
canonicalDomain,
category
HAVING
page_usage >= 50
),
base AS (
SELECT
client,
category,
page,
rank,
COUNT(domain) AS third_parties_per_page
FROM
requests
LEFT JOIN
third_party
ON
NET.HOST(requests.url) = NET.HOST(third_party.domain)
INNER JOIN
pages
USING (client, page)
GROUP BY
client,
category,
page,
rank
)
SELECT
client,
category,
rank_grouping,
CASE
WHEN rank_grouping = 100000000 THEN 'all'
ELSE FORMAT("%'d", rank_grouping)
END AS ranking,
APPROX_QUANTILES(third_parties_per_page, 1000)[OFFSET(500)] AS p50_third_parties_per_page
FROM
base,
UNNEST([1000, 10000, 100000, 1000000, 10000000, 100000000]) AS rank_grouping
WHERE
rank <= rank_grouping
GROUP BY
client,
category,
rank_grouping
ORDER BY
client,
category,
rank_groupingThere was a problem hiding this comment.
@tunetheweb just to chime in on the 2024 results regarding this figure. In the published Third Parties chapter of the 2024 Web Almanac (https://almanac.httparchive.org/en/2024/third-parties
), Figure 6.4 lists the top categories as consent provider, video, and customer success. When I looked at the data linked to the figure (https://docs.google.com/spreadsheets/d/18uTDBygSqgT_PNFldOz4guLSuXyMzDthRGnAG5if4sU/edit?gid=1662995860#gid=1662995860
), it seemed to suggest that ad, analytics, and cdn should've been the top categories instead. I might be missing something, but there seems to be a discrepancy between the published figure in the 2024 Third Parties chapter and the associated raw data.
I did not work on the queries directly, so @jazlan01 can speak to the query details.
There was a problem hiding this comment.
@tunetheweb the query you are using is reflecting the number of third parties (number_of_third_parties_by_rank_and_category.sql), not the number of third party providers.
The correct query is number_of_third_party_providers_by_rank_and_category.sql.
I will update the correct sql file in the .md file
There was a problem hiding this comment.
I think they looked at the all category for the text. Which is a bit confusing I agree!
There was a problem hiding this comment.
@tunetheweb the query you are using is reflecting the number of third parties (number_of_third_parties_by_rank_and_category.sql), not the number of third party providers.
The correct query is number_of_third_party_providers_by_rank_and_category.sql.
I like that change! I seems a better measure of true third-party usage.
Though does seem a little on the low side to me (basically one per category for most categories). But maybe that makes sense?
Either way it is a change from 2024, and a subtle one that might not easily be understood (I didn't initially!). so I think it's worth making that explicit in the text. And maybe also including the other stats for a more direct comparison?
There was a problem hiding this comment.
The numbers may appear a bit low because we are now reporting the median number of third-party providers per page. Since the analysis spans thousands of web pages across different websites, not every page/website may include providers from every category. In addition, a small number of dominant providers within certain categories (for example, Google in analytics and advertising) can account for much of the observed usage.
I’ve also updated the text to make this change explicit:
“The bar chart shows the median number of third-party providers per page by rank and category. In the previous edition, this analysis focused on the number of third-party domains per page by rank and category, whereas this year we measure the number of unique third-party providers, which results in lower counts overall. This year, the top categories are ad, analytics, and cdn.”
@jazlan01 could you take a look at the updated description for this plot and let me know if there are any issues?
tunetheweb
left a comment
There was a problem hiding this comment.
I think this is almost there once we address the outstanding comments.
sql/2025/third-parties/number_of_third_parties_by_rank_and_category.sql
Outdated
Show resolved
Hide resolved
sql/2025/third-parties/number_of_third_parties_by_rank_and_category.sql
Outdated
Show resolved
Hide resolved
tunetheweb
left a comment
There was a problem hiding this comment.
Almost there. Just a few more questions and comments.
sql/2025/third-parties/consent_signal_survival_rate_through_redirects.sql
Show resolved
Hide resolved
Co-authored-by: Barry Pollard <barrypollard@google.com>
tunetheweb
left a comment
There was a problem hiding this comment.
OK let's merge this. Thanks for all your quick responses to the review.
Closes #4089
Staged chapter: https://third-parties-2025-dot-webalmanac.uk.r.appspot.com/en/2025/third-parties
Staged home page quotes: https://third-parties-2025-dot-webalmanac.uk.r.appspot.com/en/2025/?feat=third-parties#featured-chapter
Prevalence
Consent Sharing
Inclusion chains