Sign disagreement between published QTL feathers and Supp. Table 4 for caqtl_microglia and dsqtl_yoruba

## Sign disagreement between published QTL feathers and Supp. Table 4 for caqtl_microglia and dsqtl_yoruba

Hi! While reproducing the QTL coefficient evaluation against your paper's Suppl. Table 4 numbers, I ran into an inconsistency that I think is most likely a feather export bug, but it could also be that there's a documented sign-correction step I'm missing. Wanted to flag it either way.

### Summary

For 3 of the 5 caQTL/dsQTL coefficient datasets you publish at `gs://alphagenome/evals/`, the signed `pearsonr(prediction, target)` computed directly on the feather matches the paper's Suppl. Table 4 value. For 2 datasets (`caqtl_microglia` and `dsqtl_yoruba`), the magnitude matches but the **sign is opposite**.

| Dataset | Direct from feather | Suppl. Table 4 |
|---|---|---|
| caqtl_african | **+0.7367** | +0.7368 ✅ |
| caqtl_european | **+0.5914** | +0.5916 ✅ |
| caqtl_smc | **+0.6870** | +0.6870 ✅ |
| **caqtl_microglia** | **−0.6354** | **+0.6357** ❌ |
| **dsqtl_yoruba** | **−0.8323** | **+0.8323** ❌ |

### Reproducer (no install needed, runs on any machine with pandas + scipy)

```python
import os
import pandas as pd
from scipy.stats import pearsonr
from urllib.request import urlretrieve

BASE = "https://storage.googleapis.com/alphagenome/evals/"
cases = [
    ("caqtl_african",   "caqtl_african_variant_coefficient_human_predictions"),
    ("caqtl_european",  "caqtl_european_variant_coefficient_human_predictions"),
    ("caqtl_smc",       "caqtl_smc_variant_coefficient_human_predictions"),
    ("caqtl_microglia", "caqtl_microglia_variant_coefficient_human_predictions"),
    ("dsqtl_yoruba",    "dsqtl_yoruba_variant_coefficient_human_predictions"),
]
for label, name in cases:
    local = f"/tmp/{name}.feather"
    if not os.path.exists(local):
        urlretrieve(BASE + name + ".feather", local)
    df = pd.read_feather(local)
    r, _ = pearsonr(df["prediction"], df["target"])
    print(f"{label:<20} pearsonr(prediction, target) = {r:+.4f}")
```

Output:
```
caqtl_african        pearsonr(prediction, target) = +0.7367
caqtl_european       pearsonr(prediction, target) = +0.5914
caqtl_smc            pearsonr(prediction, target) = +0.6870
caqtl_microglia      pearsonr(prediction, target) = -0.6354
dsqtl_yoruba         pearsonr(prediction, target) = -0.8323
```

### What I checked

- The paper distinguishes signed vs unsigned Pearson explicitly (e.g. caQTL Fig. 5d: *"Signed Pearson r = 0.74; unsigned Pearson r = 0.45"*), so Suppl. Table 4's `pearsonr` column is clearly the signed version — the magnitudes match perfectly across all 5 datasets, only the sign disagrees on 2.
- I separately verified by running `model.predict_variant` from this SDK on a few `caqtl_microglia` variants — the values reproduce the feather's `prediction` column to within bf16 numerics (`r ≈ 0.999`). So the **`prediction` column is faithful to model output**; the apparent disagreement is between `target` and the paper number.
- The other 3 datasets process through the same loading code I'm using (`target` → effect_size, `prediction` → score) and they line up with the paper exactly. So this isn't something on my end being applied inconsistently.

### Question

Is there a per-dataset sign-correction step in your table-generation pipeline that the published feathers don't reflect (e.g. a polarity flip based on which allele was assigned `REF` in the upstream QTL study for those two)? Or is the `target` column for `caqtl_microglia` and `dsqtl_yoruba` simply exported with the wrong sign?

Either is fine — either the feathers should be re-exported, or the convention should be documented somewhere users can find it. Right now anyone naively running `pearsonr(prediction, target)` on the published artifacts gets the opposite sign of the paper for those two datasets.

Happy to send a PR with whatever fix you prefer (re-sign the feather column, or add a documentation note + a small `apply_sign_convention()` utility).

### Why this matters downstream

Anyone benchmarking against your numbers using `gs://alphagenome/evals/` directly hits this — for instance our radical-eval pipeline cross-published baselines for all 5 datasets and the microglia/dsqtl_yoruba ones came out with the wrong sign, traceable entirely to this. 

Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sign disagreement between published QTL feathers and Supp. Table 4 for caqtl_microglia and dsqtl_yoruba #45