TNBC/TNBC_v2.Rmd

---
title: "Triple Negative Breast Cancer Data Analysis"
author: "Anni Liu"
date: '`r format(Sys.Date(), "%B %d, %Y")`'
knit: knitautomator::knit_filename
output: 
  word_document:
    fig_caption: no
    highlight: null
    toc: yes
    reference_docx: manuscript_style_V0.docx
params:
  date.analysis: !r format(Sys.Date(), "%Y%b%d")
  plot.fig: TRUE
  results.folder: FALSE
editor_options: 
  chunk_output_type: console
---

```{r shorcut, include=FALSE}
#################################################################
##                  RStudio keyboard shortcut                  ##
#################################################################
# Cursor at the beginning of a command line: Ctrl + A
# Cursor at the end of a command line: Ctrl + E
# Clear all the code from your console: Ctrl + L
# Create an assignment operator <-: Alt + - (Windows) or Option + - (Mac).
# Create a pipe operator %>%: Ctrl + Shift + M (Windows) or Cmd + Shift + M (Mac)
# Knit a document (knitr): Ctrl + Shift + K (Windows) or Cmd + Shift + K (Mac)
# Comment or uncomment current selection: Ctrl + Shift + C (Windows) or Cmd + Shift + C (Mac)
```


```{r attach_lib_func, include=F}
##------Attach libraries and functions------
easypackages::libraries("multcomp", "BTKR", "readxl", "tidyverse", 
                        "bannerCommenter", "parallel", "formatR",
                        "tidycmprsk", "ggsurvfit", "ragg", "magrittr",
                        "foreach", "future.apply", "fst", "data.table") |> suppressPackageStartupMessages()
"%_%" <- function(m, n) paste0(m, "_", n)
"%0%" <- function(m, n) paste0(m, n)

walk(c("uni.coxph.R", "fcphuni.stat2.R", "fcphuni.tbl2.R"), source)
```


```{r global_options, include=F}
#################################################################
##                          Automator                          ##
#################################################################
if (params$plot.fig) {
  dir.fig <- "../report/figs" %_% params$date.analysis %0% "/"
  # Need "/", otherwise, the images are saved directly under the report folder
  
  if (!dir.exists(dir.fig)) { 
    # If the figure directory does not exist, we create a new directory under the folder report using the name figs + current date passed from the params$date.analysis in YAML
    dir.create(dir.fig) 
  }
  
  knitr::opts_chunk$set( # Setting parameters when figures are plotted
    fig.width = 4, fig.height = 4, 
    fig.path = dir.fig, dev = "png", dpi = 300,
    echo = FALSE, warning = FALSE, message = FALSE,
    cache = FALSE,
    comment = ""
  )
} else { # Setting parameters when figures are not plotted
  knitr::opts_chunk$set(
    echo = FALSE, warning = FALSE, message = FALSE,
    cache = FALSE,
    comment = ""
  )
}

if (params$results.folder) { # Suitable when the results need to be stored outside the microsoft word report
  dir.result <- "../report/results" %_% params$date.analysis
  
  if (!dir.exists(dir.result)) {
    # If the directory does not exist, we create a new directory under the folder report using the name results + current date passed from the params$date.analysis in YAML 
    
    dir.create(dir.result)
  }
}
```


```{r check_data, include=F}
##------Load preprocessed data------
dat.work <- read_fst(path = "../data/derived/2023Mar15_dat_TNBC.RData")
```


# Data preparation

Among `r count <- sum(dat.work$Overall.Subsequent.BC == "Yes", na.rm = T); count` (`r n <- nrow(dat.work); round(count/n * 100, 2)`%, n = `r n`) patients with the subsequent breast cancer events, `r sum(dat.work$Overall.Subsequent.BC == "Yes" & (is.na(dat.work$Date.Subsequent.BC.Event)), na.rm = T)` patients do not have the date of the subsequent breast cancer events. Among `r count <- sum(dat.work$Overall.Subsequent.BC == "No", na.rm = T); count` (`r round(count/n * 100, 2)`%, n = `r n`) patients without the subsequent breast cancer events, `r sum(dat.work$Overall.Subsequent.BC == "No" & (is.na(dat.work$Date.of.Diagnosis)), na.rm = T)` patients do not have the date of diagnosis of triple negative breast cancer (TNBC).

In the current analysis, we classify `r sum(dat.work$Race.Ethnicity == "Arabic/Mideastern", na.rm = T)` patient with Arabic/Mideastern as White American (WA). 

The overall subsequent breast cancer events free survival composed of variables `SBE.event` and `SBE.time` is calculated from the date of TNBC diagnosis to the date of event of interest and censored at the date of death or the date of last follow-up whichever is earlier for patients not experiencing the event of interest.

The overall subsequent breast cancer events at 3, 5, or 6 years represented by the variables `SBE.3year`, `SBE.5year`, and `SBE.6year` are evaluated among patients who develop the event of interest within 3, 5, or 6 years from the diagnosis of TNBC and patients who are at risk at the 3rd, 5th, or 6th year from the diagnosis of triple negative breast cancer, respectively.

We use the cumulative incidence function, which considers the competing risk between the positive marker and the negative marker (e.g., ER positive vs ER negative), to estimate the cumulative incidence rates of ER-specific, PR-specific, or HER2-specific subsequent breast cancer events in 1, 2, 3, 5, 6, and 10 years, respectively. We use the Kaplan-Meier estimator to estimate the incidence rates of the overall subsequent breast cancer events in 1, 2, 3, 5, 6, and 10 years, respectively. 

The association between a categorical variable and a grouping variable (e.g., `Race.Ethnicity2`) is examined using the Fisher's exact test. The difference in the value of a continuous variable among patients of different groups is examined using the Wilcoxon rank sum test (for two groups comparison) or Kruskal Wallis rank sum test (for more than two groups comparison). Notice that the p-value for the Wilcoxon rank sum test or Kruskal Wallis rank sum test is aligned with the median (IQR) summmaries in a summary table. For variable with missing values, the difference in the proportion of missingness across different groups is also examined using the Fisher's exact test. 

All p-values are two-sided with statistical significance evaluated at the 0.05 alpha level. All analyses are performed in R Version 4.2.2 (R Foundation for Statistical Computing, Vienna, Austria).


```{r clean_data_start, eval=F}
##------Load the original data------
dat0 <- read_xlsx(
  path = "../data/raw/TNBC WCM 1998-2018_Deidentified_3_3_23.xlsx",
  sheet = "All patients", 
  range = "A1:BQ605",
  na = c("Unknown", "", "NA", "Not applicable ", "Not applicable", "Not aplicable", "n/a", "Unknown grade (not reported or unavailable)", "Unknown (not reported or unavailable)")) %>%
  select(-c("ER expression (%)", "PR expression (%)", "HER2/neu (IHC)", "HER2/neu (FISH)", "%...62", "%...64")) |>
  data.frame()


##------Fix the column names------
names(dat0) <- sapply(strsplit(names(dat0), split="\\."), function(x)
  paste(x[x!=""], collapse="."))


##------Rename the variables------
dat0 <- as.data.table(dat0)
dat0[,Tumor.Size.by.Path:=Tumor.Size.best.estimate.By.Path.Primary.surg.case]
dat0[,Tumor.Size.by.Imaging:=Tumor.Size.best.estimate.By.Breast.Imaging.NACT.case]
dat0[,Histology.Primary:=Histology.14]
dat0[,Histology.Subsequent:=Histology.60]
dat0[,Laterality.Subsequent:=Laterality.Subsequent.BC.Event]
dat0[,Genetic.Testing.Results:=Genetic.Testing.Result.s]
dat0[,Overall.Subsequent.BC:=Any.Subsequent.BC.event]


##------Create T2SBE/T2Death/T2LFU/T2ER/T2PR/T2HER2 (in years) ------
dat0[,T2SBE:=as.numeric(as.Date(Date.Subsequent.BC.Event)-as.Date(Date.of.Diagnosis))/365.25]
dat0[,T2Death:=as.numeric(as.Date(Date.Death)-as.Date(Date.of.Diagnosis))/365.25]
dat0[,T2LFU:=as.numeric(as.Date(Date.Last.Follow.up)-as.Date(Date.of.Diagnosis))/365.25]
dat0[ER=="Positive",T2ER:=as.numeric(as.Date(Date.Subsequent.BC.Event)-as.Date(Date.of.Diagnosis))/365.25]
dat0[PR=="Positive",T2PR:=as.numeric(as.Date(Date.Subsequent.BC.Event)-as.Date(Date.of.Diagnosis))/365.25]
dat0[HER2=="Positive",T2HER2:=as.numeric(as.Date(Date.Subsequent.BC.Event)-as.Date(Date.of.Diagnosis))/365.25]
# sapply(c("T2SBE", "T2Death", "T2LFU", "T2ER", "T2PR", "T2HER2"), function(x) with(dat0, summary(get(x))))
#               T2SBE     T2Death     T2LFU       T2ER       T2PR      T2HER2
# Min.      0.1943874   0.2902122  0.000000   1.023956   1.859001   0.9144422
# 1st Qu.   1.3018480   1.4688569  2.224504   2.648871   3.347707   1.5865845
# Median    2.7488022   2.5598905  5.919233   3.644079   4.481862   2.2587269
# Mean      4.5239847   3.9947029  6.871188   5.719370   6.950947   2.2587269
# 3rd Qu.   5.3408624   5.9739904 10.603012   9.036277  12.196441   2.9308693
# Max.     34.7542779  11.9069131 37.160849  13.412731  13.412731   3.6030116
# NA's    508.0000000 558.0000000  8.000000 580.000000 586.000000 602.0000000


##------Create SBE.3year, SBE.5year, SBE.6year (binary variables) ------
dat0[,SBE.3year:=rep(0,dim(dat0)[1])]
dat0[Overall.Subsequent.BC=="Yes"&(!is.na(T2SBE))&T2SBE<3,SBE.3year:=1]
dat0[Overall.Subsequent.BC=="No"&T2LFU<3,SBE.3year:=NA]
# dat0[Overall.Subsequent.BC=="Yes",T2SBE] |> summary()
# dat0[Overall.Subsequent.BC=="No",T2SBE] |> summary()
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   NA      NA      NA     NaN      NA      NA 
# NA's 
   #  501 

dat0[,SBE.5year:=rep(0,dim(dat0)[1])]
dat0[Overall.Subsequent.BC=="Yes"&(!is.na(T2SBE))&T2SBE<5,SBE.5year:=1]
dat0[Overall.Subsequent.BC=="No"&T2LFU<5,SBE.5year:=NA]

dat0[,SBE.6year:=rep(0,dim(dat0)[1])]
dat0[Overall.Subsequent.BC=="Yes"&(!is.na(T2SBE))&T2SBE<6,SBE.6year:=1]
dat0[Overall.Subsequent.BC=="No"&T2LFU<6,SBE.6year:=NA]
# sapply(c("SBE.3year", "SBE.5year", "SBE.6year"), function(x) with(dat0, table(get(x), useNA = "ifany")))
#      SBE.3year SBE.5year SBE.6year
# 0          400       328       281
# 1           51        71        73
# <NA>       153       205       250


##------Create SBE.event/SBE.time (for Kaplan-Meier estimator and cox model)------

# with(dat0, table(paste(Overall.Subsequent.BC, 
#                        ifelse(is.na(T2SBE), "noT2SBE", "T2SBE"), 
#                        ifelse(is.na(T2Death), "noT2Death", "T2Death"), 
#                        ifelse(is.na(T2LFU), "noT2LFU", "T2LFU"), 
#                        sep = ":")))

#   NA:noT2SBE:noT2Death:T2LFU     NA:noT2SBE:T2Death:T2LFU 
#                            4*                            1* 
# No:noT2SBE:noT2Death:noT2LFU   No:noT2SBE:noT2Death:T2LFU 
#                            8*                          460 
#     No:noT2SBE:T2Death:T2LFU  Yes:noT2SBE:noT2Death:T2LFU 
#                           33                            2* 
#    Yes:T2SBE:noT2Death:T2LFU      Yes:T2SBE:T2Death:T2LFU 
#                           84                           12 

# dat0[SBE.event==0, T2Death <= T2LFU]
# dat0[SBE.event==1, T2SBE <= T2LFU]

dat0$SBE.event <- rep(NA, dim(dat0)[1])
dat0[Overall.Subsequent.BC=="Yes"&(!is.na(T2SBE))]$SBE.event <- 1
dat0[Overall.Subsequent.BC=="No"&((!is.na(T2LFU)) | (!is.na(T2Death)))]$SBE.event <- 0
dat0[,SBE.event:=gsub(TRUE,1,
                      gsub(FALSE,0,SBE.event)) |> as.numeric()]
# table(dat0$SBE.event, useNA = "ifany")

dat0$SBE.time <- rep(NA, dim(dat0)[1])
dat0[,SBE.time:=case_when(SBE.event==0&is.na(T2Death)&(!is.na(T2LFU)) ~ T2LFU,
                          SBE.event==0&(!is.na(T2Death))&(!is.na(T2LFU))&(T2Death<T2LFU) ~ T2Death,
                          SBE.event==0&(!is.na(T2Death))&(!is.na(T2LFU))&(T2Death>=T2LFU) ~ T2LFU,
                          SBE.event==1 ~ T2SBE)]
# summary(dat0$SBE.time)


##------Create SBE.ER/SBE.PR/SBE.HER2 (for cumulative incidence function)------
# dat0[,Death:=ifelse(!is.na(Date.Death),"Dead","Alive")]
dat0$SBE.ER = dat0$SBE.PR = dat0$SBE.HER2 = rep(NA, dim(dat0)[1])

# lapply(c("ER", "PR", "HER2"), function(x) with(dat0, table(paste(Overall.Subsequent.BC, get(x), ifelse(is.na(T2SBE), "noT2SBE", "T2SBE"), ifelse(is.na(T2LFU), "noT2LFU", "T2LFU"), sep = ":"))))
# [[1]]
# 
#        NA:NA:noT2SBE:T2LFU      No:NA:noT2SBE:noT2LFU 
#                          5*                          8* 
#        No:NA:noT2SBE:T2LFU       Yes:NA:noT2SBE:T2LFU 
#                        493                          1* 
#         Yes:NA:T2SBE:T2LFU   Yes:Negative:T2SBE:T2LFU 
#                          3*                         69 
# Yes:Positive:noT2SBE:T2LFU   Yes:Positive:T2SBE:T2LFU 
#                          1*                         24 
# 
# [[2]]
# 
#        NA:NA:noT2SBE:T2LFU      No:NA:noT2SBE:noT2LFU 
#                          5*                          8* 
#        No:NA:noT2SBE:T2LFU       Yes:NA:noT2SBE:T2LFU 
#                        493                          1* 
#         Yes:NA:T2SBE:T2LFU Yes:Negative:noT2SBE:T2LFU 
#                          3*                          1* 
#   Yes:Negative:T2SBE:T2LFU   Yes:Positive:T2SBE:T2LFU 
#                         75                         18 
# 
# [[3]]
# 
#        NA:NA:noT2SBE:T2LFU      No:NA:noT2SBE:noT2LFU 
#                          5*                          8* 
#        No:NA:noT2SBE:T2LFU       Yes:NA:noT2SBE:T2LFU 
#                        493                          1* 
#         Yes:NA:T2SBE:T2LFU Yes:Negative:noT2SBE:T2LFU 
#                         12*                          1* 
#   Yes:Negative:T2SBE:T2LFU   Yes:Positive:T2SBE:T2LFU 
#                         82                          2 

dat0[,SBE.ER:=case_when(SBE.event==1&ER=="Positive" ~ "SBEYesERPos",
                        SBE.event==1&ER=="Negative" ~ "SBEYesERNeg",
                        SBE.event==0&(!is.na(T2LFU)) ~ "censor") |> 
       factor(levels=c("censor","SBEYesERPos","SBEYesERNeg"))]

dat0[,SBE.PR:=case_when(SBE.event==1&PR=="Positive" ~ "SBEYesPRPos",
                        SBE.event==1&PR=="Negative" ~ "SBEYesPRNeg",
                        SBE.event==0&(!is.na(T2LFU)) ~ "censor") |> 
       factor(levels=c("censor","SBEYesPRPos","SBEYesPRNeg"))]

dat0[,SBE.HER2:=case_when(SBE.event==1&HER2=="Positive" ~ "SBEYesHER2Pos",
                          SBE.event==1&HER2=="Negative" ~ "SBEYesHER2Neg",
                          SBE.event==0&(!is.na(T2LFU)) ~ "censor") |> 
       factor(levels=c("censor","SBEYesHER2Pos","SBEYesHER2Neg"))]

# sapply(c("SBE.ER", "SBE.PR", "SBE.HER2"), function(x) with(dat0, table(get(x), useNA = "ifany")))
#             SBE.ER SBE.PR SBE.HER2
# censor         493    493      493
# SBEYesERPos     24     18        2
# SBEYesERNeg     69     75       82
# <NA>            18     18       27


##------Create T2SBE.ER/T2SBE.PR/T2SBE.HER2 (for cumulative incidence function)------
dat0[,T2SBE.ER:=case_when(SBE.ER%in%c("SBEYesERPos","SBEYesERNeg") ~ T2SBE,
                          SBE.ER=="censor"&(T2Death<T2LFU) ~ T2Death,
                          SBE.ER=="censor"&(T2Death>=T2LFU) ~ T2LFU)]

dat0[,T2SBE.PR:=case_when(SBE.PR%in%c("SBEYesPRPos","SBEYesPRNeg") ~ T2SBE,
                          SBE.PR=="censor"&(T2Death<T2LFU) ~ T2Death,
                          SBE.PR=="censor"&(T2Death>=T2LFU) ~ T2LFU)]

dat0[,T2SBE.HER2:=case_when(SBE.HER2%in%c("SBEYesHER2Pos","SBEYesHER2Neg") ~ T2SBE,
                            SBE.HER2=="censor"&(T2Death<T2LFU) ~ T2Death,
                            SBE.HER2=="censor"&(T2Death>=T2LFU) ~ T2LFU)]

# sapply(c("T2SBE.ER", "T2SBE.PR", "T2SBE.HER2"), function(x) with(dat0, summary(get(x))))
#            T2SBE.ER    T2SBE.PR  T2SBE.HER2
# Min.      0.1368925   0.1368925   0.1368925
# 1st Qu.   1.4216290   1.4216290   1.3935661
# Median    2.7775496   2.7775496   2.6173854
# Mean      4.3946851   4.3946851   4.2382398
# 3rd Qu.   5.5386721   5.5386721   4.8377823
# Max.     34.7542779  34.7542779  34.7542779
# NA's    478.0000000 478.0000000 487.0000000


##------Re-categorize clinical T/N stages------
dat0[,Clinical.T.Stage2:=gsub("T1a|T1b|T1c|T2","T1|T2",
                             gsub("T3|T4","T3|T4",Clinical.T.Stage)) |>
  factor(levels=c("T1|T2","T3|T4"))]
dat0[,Clinical.N.Stage2:=gsub("N1|N2|N3","N1|N2|N3",Clinical.N.Stage) |>
  factor(levels=c("N0","N1|N2|N3"))]


##------Re-categorize race/ethnicity------
##------5 categories------
dat0[,Race.Ethnicity2:=gsub(".*Asian.*","Asian",
                            gsub("^Hisp.*","Hisp/Latina",
                                 gsub("NHW|Arabic.*","WA",
                                      gsub("^American.*|^Pacific.*|^Other$","Other",Race.Ethnicity)))) |> factor(levels=c("WA","AA","Asian","Hisp/Latina","Other"))]
##------4 categories------
dat0.race1 <- dat0[Race.Ethnicity2!="Other"]
dat0.race1$Race.Ethnicity2 <- droplevels(dat0.race1$Race.Ethnicity2)
# table(dat0.race1$Race.Ethnicity2, useNA = "ifany")
##------2 categories------
dat0[,Race.Ethnicity3:=ifelse(Race.Ethnicity2=="WA","WA","All.Others") |> factor(levels=c("WA","All.Others"))]
##------2 categories------
dat0.race2 <- dat0[Race.Ethnicity2%in%c("WA","AA")]
dat0.race2$Race.Ethnicity2 <- droplevels(dat0.race2$Race.Ethnicity2)
# table(dat0.race2$Race.Ethnicity2, useNA = "ifany")


##------Check the categorical variables------
all.char <- sapply(names(dat0)[sapply(dat0, is.character)], function(x) with(dat0, table(get(x), useNA = "ifany"))) 
all.fac <- sapply(names(dat0)[sapply(dat0, is.factor)], function(x) with(dat0, table(get(x), useNA = "ifany"))) 
all.cat <- list(all.char, all.fac)
View(all.cat)


##------Save the cleaned data------
date.analysis <- format(Sys.Date(), "%Y%b%d")
write_fst(dat0, path = paste0("../data/derived/", date.analysis, "_dat_TNBC.RData"), compress = 50)
fwrite(dat0, file = paste0("../data/derived/", date.analysis, "_dat_TNBC.csv"))

for (i in 1:2) {
  race.i.file <- sprintf("dat0.race%s", i)
  write_fst(get(race.i.file), path = paste0("../data/derived/", date.analysis, "_dat_", race.i.file, "_TNBC.RData"), compress = 50)
}
```


```{r}
dat.work <- read_fst(path = "../data/derived/2023Mar15_dat_TNBC.RData")
```


## Peek inside the overall distributions of patient variables
```{r set_variable, results="hide"}
vars.all <- c(
  ###  Demographical variables   
  "Age.at.Diagnosis", 
  "Race.Ethnicity2", "Race.Ethnicity3", 
  
  ###  Disease history variables  
  "Clinical.T.Stage", "Clinical.T.Stage2", "Clinical.N.Stage", "Clinical.N.Stage2", 
  "Index.Tumor.Status",
  "Past.Ipsilateral.Br.CA", "Past.Contralateral.Br.CA",
  "Tumor.Size.by.Path", "Tumor.Size.by.Imaging",
  "Needle.biopsy.proven.nodal.metastases.at.Dx",
  "Histology.Primary", "Laterality.Primary", "Any.high.grade.Disease", "Any.LVI", 
  "Ki67",

  ###. Screening/diagnosis/treatment variables
  "Mammo.Screen.Detected", "Mammo.Occult", "MRI.Screen.Detected", # Screen
  "First.Treatment.Modality", "Any.Attempt.at.Lumpectomy", "Mastectomy.Surgery",
  "CPM", "Any.SLN.Biopsy", "ALND",
  "If.Primary.Surgery.pT.Stage", "If.Primary.Surgery.pN.Stage",
  "Neoadjuvant.Chemotherapy",
  "In.NACT.did.pt.complete.all.planned.preop.CTX",
  "If.NACT.CTX.components.check.all.delivered.choice.A",
  "If.NACT.CTX.components.check.all.delivered.choice.T",
  "If.NACT.CTX.components.check.all.delivered.choice.C",
  "If.NACT.preop.Herceptin", "If.NACT.preop.Perjeta",
  "If.NACT.ypT.Stage", "If.NACT.ypN.Stage", "if.NACT.pCR", # pCR
  "If.NACT.postop.Capecitabine", "If.NACT.postop.Kadcyla.TDM.1",
  "Postop.Adjuvant.CTX",
  "If.Postop.Adjuvant.CTX.did.pt.complete.all.planned.postop.CTX",
  "Postop.Adjuvant.XRT",
  "Genetic.Testing.Done", "Genetic.Testing.Results",
  
  ### Outcome-related variables
  "Histology.Subsequent", "Laterality.Subsequent",
  "ER", "PR", "HER2",
  "SBE.event", "SBE.3year", "SBE.5year", "SBE.6year",
  "SBE.ER", "SBE.PR", "SBE.HER2",
  "T2SBE", "T2ER", "T2PR", "T2HER2", "T2LFU", "T2Death", 
  "SBE.time", "T2SBE.ER", "T2SBE.PR", "T2SBE.HER2" # Time variables for survival analysis
)

# Create an indicator [0, 1] for all categorical variables 
# vars.cat <- rep(1, length(vars.all))
# vars.cat[vars.all %in% c("Age.at.Diagnosis", "Tumor.Size.by.Imaging",
#                          "T2SBE", "T2ER", "T2PR", "T2HER2", "T2LFU", "T2Death", 
#                          "SBE.time", "T2SBE.ER", "T2SBE.PR", "T2SBE.HER2")] <- 0
vars.cat <- rep(x = 1, times = length(vars.all))
vars.cat[vars.all %in% names(dat.work)[sapply(dat.work, is.numeric)]] <- 0
```


```{r, results="hide"}
out <- fsmry.dmgrph(dat = dat.work,
                    vars = vars.all,
                    vars.cat = vars.cat,
                    by = NULL)
```


```{r, results="asis"}
knitr::kable(out[[1]], row.names = F)
```


# Distributions of patients manifesting the ER-positive/PR-positive/HER2-positive subsequent breast cancer events
Among `r sum(dat.work$Overall.Subsequent.BC == "Yes" & (!is.na(dat.work$Date.Subsequent.BC.Event)) & (!is.na(dat.work$Date.of.Diagnosis)), na.rm = T)` patients who are diagnosed with subsequent breast cancer events and who have both dates of initial TNBC diagnosis and subsequent breast cancer event diagnosis during the entire follow-up time: 

* there are `r round(sum(dat.work$SBE.event == 1 & dat.work$ER == "Positive", na.rm = T)/sum(dat.work$SBE.event == 1, na.rm = T) * 100, 2)`% (n = `r sum(dat.work$SBE.event == 1 & dat.work$ER == "Positive", na.rm = T)`) patients manifest the ER-positive subsequent breast cancer events; 

* there are `r round(sum(dat.work$SBE.event == 1 & dat.work$PR == "Positive", na.rm = T)/sum(dat.work$SBE.event == 1, na.rm = T) * 100, 2)`% (n = `r sum(dat.work$SBE.event == 1 & dat.work$PR == "Positive", na.rm = T)`) patients manifest the PR-positive subsequent breast cancer events; 

* there are `r round(sum(dat.work$SBE.event == 1 & dat.work$HER2 == "Positive", na.rm = T)/sum(dat.work$SBE.event == 1, na.rm = T) * 100, 2)`% (n = `r sum(dat.work$SBE.event == 1 & dat.work$HER2 == "Positive", na.rm = T)`) patients manifest the HER2-positive subsequent breast cancer events.


```{r eval=F, include=F}
round(sum(dat.work$SBE.event == 1 & dat.work$ER == "Positive", na.rm = T)/sum(dat.work$SBE.event == 1, na.rm = T) * 100, 2)

round(sum(dat.work$SBE.event == 1 & dat.work$PR == "Positive", na.rm = T)/sum(dat.work$SBE.event == 1, na.rm = T) * 100, 2)

round(sum(dat.work$SBE.event == 1 & dat.work$HER2 == "Positive", na.rm = T)/sum(dat.work$SBE.event == 1, na.rm = T) * 100, 2)
```


# Distributions of ER, PR, or HER2-specific subsequent breast cancer events in 1, 2, 3, 5, 6, and 10 years 
```{r competing_risk_analysis_marker, results='hide'}
##------Estimate the cumulative incidence rates of ER, PR, or HER2-specific SBE------
var.time.SBE <- c("T2SBE.ER", "T2SBE.PR", "T2SBE.HER2")
var.event.SBE <- c("SBE.ER", "SBE.PR", "SBE.HER2")


##------Check the distributions of each survival time------
sapply(c("T2SBE.ER", "T2SBE.PR", "T2SBE.HER2"), function(x) with(dat.work, summary(get(x))))


##------Set the customized theme for all plots------
theme_set(theme_classic())
theme_update(
  legend.position = "bottom",
  strip.background = element_blank(),
  axis.text = element_text(size = 10),
  axis.title = element_text(size = 12),
  legend.text = element_text(size = 10))
theme.list <- theme_get()

out.i.SBE <- mclapply(1:3,
  function(i){ 
    est.i.SBE <- cuminc(
      formula(paste0("Surv(", var.time.SBE[i], ", ", var.event.SBE[i], ") ~ 1")), 
      data = dat.work, conf.level = 0.95)
    est.i.SBE.timepoint <- est.i.SBE %>% 
      tidy(time = c(1, 2, 3, 5, 6, 10)) %>% 
      mutate(across(.cols = c(estimate, std.error, conf.low, conf.high), num, digits = 4))
    plot.i.SBE <- ggcuminc(x = est.i.SBE, 
                           outcome = names(table(dat.work[[var.event.SBE[i]]]))[-1],
                           color = c("#A73030FF"),
                           theme = list(theme.list)) +
                   add_risktable() +
                   ggplot2::scale_x_continuous(limits = c(0, 36),
                                               breaks = c(seq(0, 36, 3), 36), 
                                               labels = c(seq(0, 36, 3), 36)) +
                   ggplot2::scale_y_continuous(labels = scales::label_number(accuracy = 0.01),
                                               breaks = seq(0, 1, 0.2),
                                               limits = c(0, 1)) + 
                   ggplot2::labs(x = "Year")
    return(list(est = est.i.SBE, est.timepoint = est.i.SBE.timepoint, curve = plot.i.SBE)) 
    },
  mc.cores = 4)

dir.fig <- "../report/figs_2023Mar13/"
fig.name <- c("SBE_ER", "SBE_PR", "SBE_HER2")
plan(multisession)
future_lapply(1:3, function(i) {
  R.devices::suppressGraphics({
    ggsave(filename = dir.fig %0% fig.name[i] %0% ".png", 
    plot = out.i.SBE[[i]][["curve"]],
    device = agg_png, 
    width = 5, height = 4, units = "in", res = 300)})
})
```


## Cumulative incidence rates of ER-specific subsequent breast cancer events
```{r}
knitr::kable(out.i.SBE[[1]][["est.timepoint"]])
cat("\n")
knitr::include_graphics(dir.fig %0% "SBE_ER" %0% ".png")
```


## Cumulative incidence rates of PR-specific subsequent breast cancer events
```{r}
knitr::kable(out.i.SBE[[2]][["est.timepoint"]])
cat("\n")
knitr::include_graphics(dir.fig %0% "SBE_PR" %0% ".png")
```


## Cumulative incidence rates of HER2-specific subsequent breast cancer events
```{r}
knitr::kable(out.i.SBE[[3]][["est.timepoint"]])
cat("\n")
knitr::include_graphics(dir.fig %0% "SBE_HER2" %0% ".png")
```


# Distributions of follow-up time, time to overall subsequent breast cancer events, and other time
The variables `T2LFU` (follow-up time), `T2SBE` (time to the overall subsequent breast cancer events), and `T2Death` (time to the death) are estimated by subtracting the date of diagnosis from the date of last follow-up, from the date of overall subsequent breast cancer events, or from the date of death, respectively. The variables `T2ER` (time to the ER-positive subsequent breast cancer events), `T2PR` (time to the PR-positive subsequent breast cancer events), and `T2HER2` (time to the HER2-positive subsequent breast cancer events) are estimated by subtracting the date of diagnosis from the date of overall subsequent breast cancer events in the patients with ER-positive, PR-positive, or HER2-positive subsequent breast cancer events, respectively.

As the following table shows, the median of follow-up time in the entire cohort is `r median(dat.work$T2LFU, na.rm = T) |> round(2)` (in years), and the median time to the overall subsequent breast cancer events from the initial diagnosis of TNBC is `r median(dat.work$T2SBE, na.rm = T) |> round(2)` (in years).
```{r}
sapply(c("T2LFU", "T2SBE", "T2Death", "T2ER", "T2PR", "T2HER2"), function(x) with(dat.work, c(summary(get(x)), `SD` = sd(get(x), na.rm = T)) |> round(2))) |> knitr::kable()
```


# Distributions of overall subsequent breast cancer events
Among `r nrow(dat.work)` patients, a total of `r sum(dat.work$Overall.Subsequent.BC=="Yes"&(!is.na(dat.work$T2SBE)))` patients manifest the overall subsequent breast cancer events. These patients have both dates of initial TNBC diagnosis and subsequent breast cancer event diagnosis.


## Incidence rates of the overall subsequent breast cancer events in 1, 2, 3, 5, 6, and 10 years
```{r km_analysis_nomarker, results='hide'}
##------Estimate the incidence rates of SBE------
var.time <- c("SBE.time")
var.event <- c("SBE.event")
out.i <- mclapply(1:1,
  function(i){ 
    fit.i <- survfit2(formula(paste0("Surv(", var.time[i], ", ", var.event[i], ") ~ 1")), data = dat.work) 
    res.i <- summary(fit.i, times = c(1, 2, 3, 5, 6, 10)) 
    out.i <- data.frame(time = res.i$time, 
                        n.risk = res.i$n.risk, 
                        estimate = 1 - res.i$surv, 
                        std.error = res.i$std.err, 
                        conf.low = 1 - res.i$upper, 
                        conf.high = 1 - res.i$lower) %>% 
      mutate(across(.cols = c(estimate, std.error, conf.low, conf.high), num, digits = 4))
    plot.i <- ggsurvfit(x = fit.i,
                        type = "risk",
                        color = c("#A73030FF"),
                        theme = list(theme.list)) +
                add_risktable() +
                ggplot2::scale_x_continuous(limits = c(0, 36),
                                            breaks = c(seq(0, 36, 3), 36), 
                                            labels = c(seq(0, 36, 3), 36)) +
                ggplot2::scale_y_continuous(labels = scales::label_number(accuracy = 0.01),
                                            breaks = seq(0, 1, 0.2),
                                            limits = c(0, 1)) + 
                ggplot2::labs(x = "Year", y = "Risk = 1 - survival probability")
    return(list(fit = fit.i, res = res.i, out = out.i, plot = plot.i))},
  mc.cores = 4)

fig.name <- c("SBE")
plan(multisession) 
future_lapply(1:1, function(i) {
  R.devices::suppressGraphics({
    ggsave(filename = dir.fig %0% fig.name[i] %0% ".png", 
    plot = out.i[[i]][["plot"]],
    device = agg_png, 
    width = 5, height = 4, units = "in", res = 300)})
})
```


```{r}
knitr::kable(out.i[[1]][["out"]])
cat("\n")
knitr::include_graphics(dir.fig %0% "SBE" %0% ".png")
```


# Univariate analysis between patient factors and the overall subsequent breast cancer events
## Factors associated with 3-year overall subsequent breast cancer events (column proportion)
```{r set_variable2, results="hide"}
vars.all <- c(
  ###  Demographical variables   
  "Age.at.Diagnosis", 
  "Race.Ethnicity2", "Race.Ethnicity3", 
  
  ###  Disease history variables  
  "Clinical.T.Stage", "Clinical.T.Stage2", "Clinical.N.Stage", "Clinical.N.Stage2", 
  "Index.Tumor.Status",
  "Past.Ipsilateral.Br.CA", "Past.Contralateral.Br.CA",
  "Tumor.Size.by.Imaging",
  "Needle.biopsy.proven.nodal.metastases.at.Dx",
  "Histology.Primary", "Laterality.Primary", "Any.high.grade.Disease", "Any.LVI",

  ###. Screening/diagnosis/treatment variables
  "Mammo.Screen.Detected", "Mammo.Occult", "MRI.Screen.Detected", 
  "First.Treatment.Modality", "Any.Attempt.at.Lumpectomy", "Mastectomy.Surgery",
  "CPM", "Any.SLN.Biopsy", "ALND",
  "If.Primary.Surgery.pT.Stage", "If.Primary.Surgery.pN.Stage",
  "Neoadjuvant.Chemotherapy",
  "In.NACT.did.pt.complete.all.planned.preop.CTX",
  "If.NACT.CTX.components.check.all.delivered.choice.A",
  "If.NACT.CTX.components.check.all.delivered.choice.T",
  "If.NACT.CTX.components.check.all.delivered.choice.C",
  "If.NACT.preop.Herceptin", "If.NACT.preop.Perjeta",
  "If.NACT.ypT.Stage", "If.NACT.ypN.Stage", "if.NACT.pCR", # pCR
  "If.NACT.postop.Capecitabine", "If.NACT.postop.Kadcyla.TDM.1",
  "Postop.Adjuvant.CTX",
  "If.Postop.Adjuvant.CTX.did.pt.complete.all.planned.postop.CTX",
  "Postop.Adjuvant.XRT",
  "Genetic.Testing.Done",
  
  ### Outcome-related variables
  "Histology.Subsequent", "Laterality.Subsequent",
  "SBE.event", "SBE.3year", "SBE.5year", "SBE.6year",
  "ER", "PR", "HER2", "SBE.ER", "SBE.PR", "SBE.HER2",
  "T2SBE", "T2ER", "T2PR", "T2HER2", "T2LFU", "T2Death", 
  "SBE.time", "T2SBE.ER", "T2SBE.PR", "T2SBE.HER2" # Time variables for survival analysis
)

# Create an indicator [0, 1] for all categorical variables 
vars.cat <- rep(1, length(vars.all))
vars.cat[vars.all %in% c("Age.at.Diagnosis", "Tumor.Size.by.Imaging",
                         "T2SBE", "T2ER", "T2PR", "T2HER2", "T2LFU", "T2Death", 
                         "SBE.time", "T2SBE.ER", "T2SBE.PR", "T2SBE.HER2")] <- 0
```


```{r, results="hide"}
id.keep <- which(!is.na(dat.work$SBE.3year))
vars.cat.rm <- which(vars.all %in% c(
  "Histology.Subsequent", "Laterality.Subsequent",
  "SBE.event", "SBE.3year", "SBE.5year", "SBE.6year",
  "ER", "PR", "HER2", "SBE.ER", "SBE.PR", "SBE.HER2",
  "T2SBE", "T2ER", "T2PR", "T2HER2", "T2LFU", "T2Death", 
  "SBE.time", "T2SBE.ER", "T2SBE.PR", "T2SBE.HER2" 
))

out <- fsmry.dmgrph(dat = dat.work[id.keep, ],
                    vars = vars.all[-vars.cat.rm],
                    vars.cat = vars.cat[-vars.cat.rm],
                    by = "SBE.3year")
```


```{r, results="asis"}
knitr::kable(out[[1]], row.names = F)
```


## Factors associated with 3-year overall subsequent breast cancer events (row proportion)
```{r, results="hide"}
out <- fsmry.dmgrph(dat = dat.work[id.keep, ],
                    vars = vars.all[-vars.cat.rm],
                    vars.cat = vars.cat[-vars.cat.rm],
                    by = "SBE.3year",
                    prop.by.row = T)
```


```{r, results="asis"}
knitr::kable(out[[1]], row.names = F)
```


## Factors associated with 5-year overall subsequent breast cancer events (column proportion)
```{r, results="hide"}
id.keep <- which(!is.na(dat.work$SBE.5year))
vars.cat.rm <- which(vars.all %in% c(
  "Histology.Subsequent", "Laterality.Subsequent",
  "SBE.event", "SBE.3year", "SBE.5year", "SBE.6year",
  "ER", "PR", "HER2", "SBE.ER", "SBE.PR", "SBE.HER2",
  "T2SBE", "T2ER", "T2PR", "T2HER2", "T2LFU", "T2Death", 
  "SBE.time", "T2SBE.ER", "T2SBE.PR", "T2SBE.HER2" 
))

out <- fsmry.dmgrph(dat = dat.work[id.keep, ],
                    vars = vars.all[-vars.cat.rm],
                    vars.cat = vars.cat[-vars.cat.rm],
                    by = "SBE.5year")
```


```{r, results="asis"}
knitr::kable(out[[1]], row.names = F)
```


## Factors associated with 5-year overall subsequent breast cancer events (row proportion)
```{r, results="hide"}
out <- fsmry.dmgrph(dat = dat.work[id.keep, ],
                    vars = vars.all[-vars.cat.rm],
                    vars.cat = vars.cat[-vars.cat.rm],
                    by = "SBE.5year",
                    prop.by.row = T)
```


```{r, results="asis"}
knitr::kable(out[[1]], row.names = F)
```


## Factors associated with 6-year overall subsequent breast cancer events (column proportion)
```{r, results="hide"}
id.keep <- which(!is.na(dat.work$SBE.6year))
vars.cat.rm <- which(vars.all %in% c(  
  "Histology.Subsequent", "Laterality.Subsequent",
  "SBE.event", "SBE.3year", "SBE.5year", "SBE.6year",
  "ER", "PR", "HER2", "SBE.ER", "SBE.PR", "SBE.HER2",
  "T2SBE", "T2ER", "T2PR", "T2HER2", "T2LFU", "T2Death", 
  "SBE.time", "T2SBE.ER", "T2SBE.PR", "T2SBE.HER2" 
))

out <- fsmry.dmgrph(dat = dat.work[id.keep, ],
                    vars = vars.all[-vars.cat.rm],
                    vars.cat = vars.cat[-vars.cat.rm],
                    by = "SBE.6year")
```


```{r, results="asis"}
knitr::kable(out[[1]], row.names = F)
```


## Factors associated with 6-year overall subsequent breast cancer events (row proportion)
```{r, results="hide"}
out <- fsmry.dmgrph(dat = dat.work[id.keep, ],
                    vars = vars.all[-vars.cat.rm],
                    vars.cat = vars.cat[-vars.cat.rm],
                    by = "SBE.6year",
                    prop.by.row = T)
```


```{r, results="asis"}
knitr::kable(out[[1]], row.names = F)
```


## Factors associated with the survival time till the overall subsequent breast cancer events
```{r, results="hide"}
vars.ana <- vars.all[!vars.all %in% c(
  "Histology.Subsequent", "Laterality.Subsequent",
  "SBE.event", "SBE.3year", "SBE.5year", "SBE.6year",
  "ER", "PR", "HER2", "SBE.ER", "SBE.PR", "SBE.HER2",
  "T2SBE", "T2ER", "T2PR", "T2HER2", "T2LFU", "T2Death", 
  "SBE.time", "T2SBE.ER", "T2SBE.PR", "T2SBE.HER2")]

out <- uni.coxph(surv.time = "SBE.time", surv.event = "SBE.event")

table(unlist(lapply(out, length)))

out.stat <- lapply(out, function(x) lapply(x, fcphuni.stat))
out.tbl <- lapply(out.stat, fcphuni.tbl)
out.tbl <- do.call(rbind, out.tbl)
row.names(out.tbl) <- NULL
```


```{r}
knitr::kable(out.tbl)
```


## Factors associated with the survival time till the overall subsequent breast cancer events (with Cox PH assumption examined)
```{r, results="hide"}
out.stat <- lapply(out, function(x) lapply(x, fcphuni.stat2))
out.tbl <- lapply(out.stat, fcphuni.tbl2)
out.tbl <- do.call(rbind, out.tbl)
row.names(out.tbl) <- NULL
```


```{r}
knitr::kable(out.tbl)
```

# Distributions of patient characteristics across the race/ethnicity groups 
## Summary of patient characteristics by 5 categories of race/ethnicity (WA, AA, Asian, Hisp/Latina, Other)
```{r, results="hide"}
vars.cat.rm <- which(vars.all %in% c("Race.Ethnicity2", "Race.Ethnicity3", "T2HER2"))
# tapply(dat.work$T2HER2, dat.work$Race.Ethnicity2, mean, na.rm = T)
#          WA          AA       Asian Hisp/Latina       Other 
#    2.258727         NaN         NaN         NaN         NaN 
# tapply(dat.work$T2HER2, dat.work$Race.Ethnicity2, summary)

out <- fsmry.dmgrph(dat = dat.work,
                    vars = vars.all[-vars.cat.rm],
                    vars.cat = vars.cat[-vars.cat.rm],
                    by = "Race.Ethnicity2")
```


```{r, results="asis"}
knitr::kable(out[[1]], row.names = F)
```


## Summary of patient characteristics by 4 categories of race/ethnicity excluding the "Other" (WA, AA, Asian, Hisp/Latina)
```{r, results="hide"}
dat0.race1 <- read_fst(path = "../data/derived/2023Mar13_dat_dat0.race1_TNBC.RData")

vars.cat.rm <- which(vars.all %in% c("Race.Ethnicity2", "Race.Ethnicity3", "T2HER2"))
out <- fsmry.dmgrph(dat = dat0.race1,
                    vars = vars.all[-vars.cat.rm],
                    vars.cat = vars.cat[-vars.cat.rm],
                    by = "Race.Ethnicity2")
```


```{r, results="asis"}
knitr::kable(out[[1]], row.names = F)
```


## Summary of patient characteristics by 2 categories of race/ethnicity (WA, All other races/ethnicities)
```{r, results="hide"}
vars.cat.rm <- which(vars.all %in% c("Race.Ethnicity2", "Race.Ethnicity3"))
out <- fsmry.dmgrph(dat = dat.work,
                    vars = vars.all[-vars.cat.rm],
                    vars.cat = vars.cat[-vars.cat.rm],
                    by = "Race.Ethnicity3")
```


```{r, results="asis"}
knitr::kable(out[[1]], row.names = F)
```


## Summary of patient characteristics by 2 categories of race/ethnicity (WA, AA)
```{r, results="hide"}
dat0.race2 <- read_fst(path = "../data/derived/2023Mar13_dat_dat0.race2_TNBC.RData")

vars.cat.rm <- which(vars.all %in% c("Race.Ethnicity2", "Race.Ethnicity3", "T2HER2"))
out <- fsmry.dmgrph(dat = dat0.race2,
                    vars = vars.all[-vars.cat.rm],
                    vars.cat = vars.cat[-vars.cat.rm],
                    by = "Race.Ethnicity2")
```


```{r, results="asis"}
knitr::kable(out[[1]], row.names = F)
```


# Session info
```{r}
sessionInfo()
```