-
Notifications
You must be signed in to change notification settings - Fork 23
Description
Hello, Thanks for your amazing work :)
I tried to reproduce ICDAR 2015 result from paper.
But I can't get the result from paper with pre-trained weights.
I'm not changing any code. download dataset and pre-trained weights.
train with pre-trained weight. but I got loss almost 30.0~
it looks like not converge.
below is my log.
[07/25 14:21:07] detectron2 INFO: Rank of current process: 0. World size: 8
[07/25 14:21:11] detectron2 INFO: Environment info:
sys.platform linux
Python 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]
numpy 1.23.4
detectron2 0.6 @/usr/local/lib/python3.8/dist-packages/detectron2
Compiler GCC 9.4
CUDA compiler CUDA 11.3
detectron2 arch flags 8.6
DETECTRON2_ENV_MODULE
PyTorch 1.12.1+cu113 @/usr/local/lib/python3.8/dist-packages/torch
PyTorch debug build False
torch._C._GLIBCXX_USE_CXX11_ABI False
GPU available Yes
GPU 0,1,2,3,4,5,6,7 Tesla T4 (arch=7.5)
Driver version 450.80.02
CUDA_HOME /usr/local/cuda
Pillow 9.2.0
torchvision 0.13.1+cu113 @/usr/local/lib/python3.8/dist-packages/torchvision
torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore 0.1.5.post20221221
iopath 0.1.9
cv2 4.1.2
PyTorch built with:
- GCC 9.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.3
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
- CuDNN 8.3.2 (built against CUDA 11.5)
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
[07/25 14:21:11] detectron2 INFO: Command line arguments: Namespace(config_file='configs/TESTR/ICDAR15/TESTR_R_50_Polygon.yaml', dist_url='tcp://127.0.0.1:59588', eval_only=False, machine_rank=0, num_gpus=8, num_machines=1, opts=[], resume=False)
[07/25 14:21:11] detectron2 INFO: Contents of args.config_file=configs/TESTR/ICDAR15/TESTR_R_50_Polygon.yaml:
BASE: "Base-ICDAR15-Polygon.yaml"
MODEL:
WEIGHTS: "weights/TESTR/pretrain_testr_R_50_polygon.pth"
RESNETS:
DEPTH: 50
TRANSFORMER:
NUM_FEATURE_LEVELS: 4
INFERENCE_TH_TEST: 0.3
ENC_LAYERS: 6
DEC_LAYERS: 6
DIM_FEEDFORWARD: 1024
HIDDEN_DIM: 256
DROPOUT: 0.1
NHEADS: 8
NUM_QUERIES: 100
ENC_N_POINTS: 4
DEC_N_POINTS: 4
SOLVER:
IMS_PER_BATCH: 8
BASE_LR: 1e-5
LR_BACKBONE: 1e-6
WARMUP_ITERS: 0
STEPS: (200000,)
MAX_ITER: 200000
CHECKPOINT_PERIOD: 10000
TEST:
EVAL_PERIOD: 10000
OUTPUT_DIR: "output/TESTR/icdar15/TESTR_R_50_Polygon"
[07/25 14:21:11] detectron2 INFO: Running with full config:
CUDNN_BENCHMARK: false
DATALOADER:
ASPECT_RATIO_GROUPING: true
FILTER_EMPTY_ANNOTATIONS: true
NUM_WORKERS: 4
REPEAT_THRESHOLD: 0.0
SAMPLER_TRAIN: TrainingSampler
DATASETS:
PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000
PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000
PROPOSAL_FILES_TEST: []
PROPOSAL_FILES_TRAIN: []
TEST:
- icdar2015_test
TRAIN: - icdar2015_train
GLOBAL:
HACK: 1.0
INPUT:
CROP:
CROP_INSTANCE: false
ENABLED: true
SIZE:- 0.1
- 0.1
TYPE: relative_range
FORMAT: RGB
HFLIP_TRAIN: false
MASK_FORMAT: polygon
MAX_SIZE_TEST: 4000
MAX_SIZE_TRAIN: 2333
MIN_SIZE_TEST: 1440
MIN_SIZE_TRAIN:
- 800
- 832
- 864
- 896
- 1000
- 1200
- 1400
MIN_SIZE_TRAIN_SAMPLING: choice
RANDOM_FLIP: horizontal
MODEL:
ANCHOR_GENERATOR:
ANGLES:-
- -90
- 0
- 90
ASPECT_RATIOS:
-
- 0.5
- 1.0
- 2.0
NAME: DefaultAnchorGenerator
OFFSET: 0.0
SIZES:
-
- 32
- 64
- 128
- 256
- 512
BACKBONE:
ANTI_ALIAS: false
FREEZE_AT: 2
NAME: build_resnet_backbone
BASIS_MODULE:
ANN_SET: coco
COMMON_STRIDE: 8
CONVS_DIM: 128
IN_FEATURES:
- p3
- p4
- p5
LOSS_ON: false
LOSS_WEIGHT: 0.3
NAME: ProtoNet
NORM: SyncBN
NUM_BASES: 4
NUM_CLASSES: 80
NUM_CONVS: 3
BATEXT:
CANONICAL_SIZE: 96
CONV_DIM: 256
CUSTOM_DICT: ''
IN_FEATURES: - p2
- p3
- p4
NUM_CHARS: 25
NUM_CONV: 2
POOLER_RESOLUTION: - 8
- 32
POOLER_SCALES: - 0.25
- 0.125
- 0.0625
RECOGNITION_LOSS: ctc
RECOGNIZER: attn
SAMPLING_RATIO: 1
USE_AET: false
USE_COORDCONV: false
VOC_SIZE: 96
BLENDMASK:
ATTN_SIZE: 14
BOTTOM_RESOLUTION: 56
INSTANCE_LOSS_WEIGHT: 1.0
POOLER_SAMPLING_RATIO: 1
POOLER_SCALES: - 0.25
POOLER_TYPE: ROIAlignV2
TOP_INTERP: bilinear
VISUALIZE: false
BOXINST:
BOTTOM_PIXELS_REMOVED: 10
ENABLED: false
PAIRWISE:
COLOR_THRESH: 0.3
DILATION: 2
SIZE: 3
WARMUP_ITERS: 10000
BiFPN:
IN_FEATURES: - res2
- res3
- res4
- res5
NORM: ''
NUM_REPEATS: 6
OUT_CHANNELS: 160
CONDINST:
BOTTOM_PIXELS_REMOVED: -1
MASK_BRANCH:
CHANNELS: 128
IN_FEATURES:- p3
- p4
- p5
NORM: BN
NUM_CONVS: 4
OUT_CHANNELS: 8
SEMANTIC_LOSS_ON: false
MASK_HEAD:
CHANNELS: 8
DISABLE_REL_COORDS: false
NUM_LAYERS: 3
USE_FP16: false
MASK_OUT_STRIDE: 4
MAX_PROPOSALS: -1
TOPK_PROPOSALS_PER_IM: -1
DEVICE: cuda
DLA:
CONV_BODY: DLA34
NORM: FrozenBN
OUT_FEATURES:
- stage2
- stage3
- stage4
- stage5
FCOS:
BOX_QUALITY: ctrness
CENTER_SAMPLE: true
FPN_STRIDES: - 8
- 16
- 32
- 64
- 128
INFERENCE_TH_TEST: 0.05
INFERENCE_TH_TRAIN: 0.05
IN_FEATURES: - p3
- p4
- p5
- p6
- p7
LOC_LOSS_TYPE: giou
LOSS_ALPHA: 0.25
LOSS_GAMMA: 2.0
LOSS_NORMALIZER_CLS: fg
LOSS_WEIGHT_CLS: 1.0
NMS_TH: 0.6
NORM: GN
NUM_BOX_CONVS: 4
NUM_CLASSES: 80
NUM_CLS_CONVS: 4
NUM_SHARE_CONVS: 0
POST_NMS_TOPK_TEST: 100
POST_NMS_TOPK_TRAIN: 100
POS_RADIUS: 1.5
PRE_NMS_TOPK_TEST: 1000
PRE_NMS_TOPK_TRAIN: 1000
PRIOR_PROB: 0.01
SIZES_OF_INTEREST: - 64
- 128
- 256
- 512
THRESH_WITH_CTR: false
TOP_LEVELS: 2
USE_DEFORMABLE: false
USE_RELU: true
USE_SCALE: true
YIELD_BOX_FEATURES: false
YIELD_PROPOSAL: false
FPN:
FUSE_TYPE: sum
IN_FEATURES: []
NORM: ''
OUT_CHANNELS: 256
KEYPOINT_ON: false
LOAD_PROPOSALS: false
MASK_ON: false
MEInst:
AGNOSTIC: true
CENTER_SAMPLE: true
DIM_MASK: 60
FLAG_PARAMETERS: false
FPN_STRIDES: - 8
- 16
- 32
- 64
- 128
GCN_KERNEL_SIZE: 9
INFERENCE_TH_TEST: 0.05
INFERENCE_TH_TRAIN: 0.05
IN_FEATURES: - p3
- p4
- p5
- p6
- p7
IOU_LABELS: - 0
- 1
IOU_THRESHOLDS: - 0.5
LAST_DEFORMABLE: false
LOC_LOSS_TYPE: giou
LOSS_ALPHA: 0.25
LOSS_GAMMA: 2.0
LOSS_ON_MASK: false
MASK_LOSS_TYPE: mse
MASK_ON: true
MASK_SIZE: 28
NMS_TH: 0.6
NORM: GN
NUM_BOX_CONVS: 4
NUM_CLASSES: 80
NUM_CLS_CONVS: 4
NUM_MASK_CONVS: 4
NUM_SHARE_CONVS: 0
PATH_COMPONENTS: datasets/coco/components/coco_2017_train_class_agnosticTrue_whitenTrue_sigmoidTrue_60.npz
POST_NMS_TOPK_TEST: 100
POST_NMS_TOPK_TRAIN: 100
POS_RADIUS: 1.5
PRE_NMS_TOPK_TEST: 1000
PRE_NMS_TOPK_TRAIN: 1000
PRIOR_PROB: 0.01
SIGMOID: true
SIZES_OF_INTEREST: - 64
- 128
- 256
- 512
THRESH_WITH_CTR: false
TOP_LEVELS: 2
TYPE_DEFORMABLE: DCNv1
USE_DEFORMABLE: false
USE_GCN_IN_MASK: false
USE_RELU: true
USE_SCALE: true
WHITEN: true
META_ARCHITECTURE: TransformerDetector
MOBILENET: false
PANOPTIC_FPN:
COMBINE:
ENABLED: true
INSTANCES_CONFIDENCE_THRESH: 0.5
OVERLAP_THRESH: 0.5
STUFF_AREA_LIMIT: 4096
INSTANCE_LOSS_WEIGHT: 1.0
PIXEL_MEAN:
-
- 123.675
- 116.28
- 103.53
PIXEL_STD: - 58.395
- 57.12
- 57.375
PROPOSAL_GENERATOR:
MIN_SIZE: 0
NAME: RPN
RESNETS:
DEFORM_INTERVAL: 1
DEFORM_MODULATED: false
DEFORM_NUM_GROUPS: 1
DEFORM_ON_PER_STAGE:- false
- false
- false
- false
DEPTH: 50
NORM: FrozenBN
NUM_GROUPS: 1
OUT_FEATURES: - res3
- res4
- res5
RES2_OUT_CHANNELS: 256
RES5_DILATION: 1
STEM_OUT_CHANNELS: 64
STRIDE_IN_1X1: false
WIDTH_PER_GROUP: 64
RETINANET:
BBOX_REG_LOSS_TYPE: smooth_l1
BBOX_REG_WEIGHTS: &id002 - 1.0
- 1.0
- 1.0
- 1.0
FOCAL_LOSS_ALPHA: 0.25
FOCAL_LOSS_GAMMA: 2.0
IN_FEATURES: - p3
- p4
- p5
- p6
- p7
IOU_LABELS: - 0
- -1
- 1
IOU_THRESHOLDS: - 0.4
- 0.5
NMS_THRESH_TEST: 0.5
NORM: ''
NUM_CLASSES: 80
NUM_CONVS: 4
PRIOR_PROB: 0.01
SCORE_THRESH_TEST: 0.05
SMOOTH_L1_LOSS_BETA: 0.1
TOPK_CANDIDATES_TEST: 1000
ROI_BOX_CASCADE_HEAD:
BBOX_REG_WEIGHTS: - &id001
- 10.0
- 10.0
- 5.0
- 5.0
-
- 20.0
- 20.0
- 10.0
- 10.0
-
- 30.0
- 30.0
- 15.0
- 15.0
IOUS:
- 0.5
- 0.6
- 0.7
ROI_BOX_HEAD:
BBOX_REG_LOSS_TYPE: smooth_l1
BBOX_REG_LOSS_WEIGHT: 1.0
BBOX_REG_WEIGHTS: *id001
CLS_AGNOSTIC_BBOX_REG: false
CONV_DIM: 256
FC_DIM: 1024
FED_LOSS_FREQ_WEIGHT_POWER: 0.5
FED_LOSS_NUM_CLASSES: 50
NAME: ''
NORM: ''
NUM_CONV: 0
NUM_FC: 0
POOLER_RESOLUTION: 14
POOLER_SAMPLING_RATIO: 0
POOLER_TYPE: ROIAlignV2
SMOOTH_L1_BETA: 0.0
TRAIN_ON_PRED_BOXES: false
USE_FED_LOSS: false
USE_SIGMOID_CE: false
ROI_HEADS:
BATCH_SIZE_PER_IMAGE: 512
IN_FEATURES: - res4
IOU_LABELS: - 0
- 1
IOU_THRESHOLDS: - 0.5
NAME: Res5ROIHeads
NMS_THRESH_TEST: 0.5
NUM_CLASSES: 80
POSITIVE_FRACTION: 0.25
PROPOSAL_APPEND_GT: true
SCORE_THRESH_TEST: 0.05
ROI_KEYPOINT_HEAD:
CONV_DIMS: - 512
- 512
- 512
- 512
- 512
- 512
- 512
- 512
LOSS_WEIGHT: 1.0
MIN_KEYPOINTS_PER_IMAGE: 1
NAME: KRCNNConvDeconvUpsampleHead
NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: true
NUM_KEYPOINTS: 17
POOLER_RESOLUTION: 14
POOLER_SAMPLING_RATIO: 0
POOLER_TYPE: ROIAlignV2
ROI_MASK_HEAD:
CLS_AGNOSTIC_MASK: false
CONV_DIM: 256
NAME: MaskRCNNConvUpsampleHead
NORM: ''
NUM_CONV: 0
POOLER_RESOLUTION: 14
POOLER_SAMPLING_RATIO: 0
POOLER_TYPE: ROIAlignV2
RPN:
BATCH_SIZE_PER_IMAGE: 256
BBOX_REG_LOSS_TYPE: smooth_l1
BBOX_REG_LOSS_WEIGHT: 1.0
BBOX_REG_WEIGHTS: *id002
BOUNDARY_THRESH: -1
CONV_DIMS: - -1
HEAD_NAME: StandardRPNHead
IN_FEATURES: - res4
IOU_LABELS: - 0
- -1
- 1
IOU_THRESHOLDS: - 0.3
- 0.7
LOSS_WEIGHT: 1.0
NMS_THRESH: 0.7
POSITIVE_FRACTION: 0.5
POST_NMS_TOPK_TEST: 1000
POST_NMS_TOPK_TRAIN: 2000
PRE_NMS_TOPK_TEST: 6000
PRE_NMS_TOPK_TRAIN: 12000
SMOOTH_L1_BETA: 0.0
SEM_SEG_HEAD:
COMMON_STRIDE: 4
CONVS_DIM: 128
IGNORE_VALUE: 255
IN_FEATURES: - p2
- p3
- p4
- p5
LOSS_WEIGHT: 1.0
NAME: SemSegFPNHead
NORM: GN
NUM_CLASSES: 54
SOLOV2:
FPN_INSTANCE_STRIDES: - 8
- 8
- 16
- 32
- 32
FPN_SCALE_RANGES: -
- 1
- 96
-
- 48
- 192
-
- 96
- 384
-
- 192
- 768
-
- 384
- 2048
INSTANCE_CHANNELS: 512
INSTANCE_IN_CHANNELS: 256
INSTANCE_IN_FEATURES:
- p2
- p3
- p4
- p5
- p6
LOSS:
DICE_WEIGHT: 3.0
FOCAL_ALPHA: 0.25
FOCAL_GAMMA: 2.0
FOCAL_USE_SIGMOID: true
FOCAL_WEIGHT: 1.0
MASK_CHANNELS: 128
MASK_IN_CHANNELS: 256
MASK_IN_FEATURES: - p2
- p3
- p4
- p5
MASK_THR: 0.5
MAX_PER_IMG: 100
NMS_KERNEL: gaussian
NMS_PRE: 500
NMS_SIGMA: 2
NMS_TYPE: matrix
NORM: GN
NUM_CLASSES: 80
NUM_GRIDS: - 40
- 36
- 24
- 16
- 12
NUM_INSTANCE_CONVS: 4
NUM_KERNELS: 256
NUM_MASKS: 256
PRIOR_PROB: 0.01
SCORE_THR: 0.1
SIGMA: 0.2
TYPE_DCN: DCN
UPDATE_THR: 0.05
USE_COORD_CONV: true
USE_DCN_IN_INSTANCE: false
TOP_MODULE:
DIM: 16
NAME: conv
TRANSFORMER:
AUX_LOSS: true
DEC_LAYERS: 6
DEC_N_POINTS: 4
DIM_FEEDFORWARD: 1024
DROPOUT: 0.1
ENABLED: true
ENC_LAYERS: 6
ENC_N_POINTS: 4
HIDDEN_DIM: 256
INFERENCE_TH_TEST: 0.3
LOSS:
AUX_LOSS: true
BOX_CLASS_WEIGHT: 2.0
BOX_COORD_WEIGHT: 5.0
BOX_GIOU_WEIGHT: 2.0
FOCAL_ALPHA: 0.25
FOCAL_GAMMA: 2.0
POINT_CLASS_WEIGHT: 2.0
POINT_COORD_WEIGHT: 5.0
POINT_TEXT_WEIGHT: 4.0
NHEADS: 8
NUM_CHARS: 25
NUM_CTRL_POINTS: 16
NUM_FEATURE_LEVELS: 4
NUM_QUERIES: 100
POSITION_EMBEDDING_SCALE: 6.283185307179586
USE_POLYGON: true
VOC_SIZE: 96
VOVNET:
BACKBONE_OUT_CHANNELS: 256
CONV_BODY: V-39-eSE
NORM: FrozenBN
OUT_CHANNELS: 256
OUT_FEATURES: - stage2
- stage3
- stage4
- stage5
WEIGHTS: weights/TESTR/pretrain_testr_R_50_polygon.pth
OUTPUT_DIR: output/TESTR/icdar15/TESTR_R_50_Polygon
SEED: -1
SOLVER:
AMP:
ENABLED: false
BASE_LR: 1.0e-05
BASE_LR_END: 0.0
BIAS_LR_FACTOR: 1.0
CHECKPOINT_PERIOD: 10000
CLIP_GRADIENTS:
CLIP_TYPE: full_model
CLIP_VALUE: 0.1
ENABLED: true
NORM_TYPE: 2.0
GAMMA: 0.1
IMS_PER_BATCH: 8
LR_BACKBONE: 1.0e-06
LR_BACKBONE_NAMES:
- backbone.0
LR_LINEAR_PROJ_MULT: 0.1
LR_LINEAR_PROJ_NAMES: - reference_points
- sampling_offsets
LR_SCHEDULER_NAME: WarmupMultiStepLR
MAX_ITER: 200000
MOMENTUM: 0.9
NESTEROV: false
NUM_DECAYS: 3
OPTIMIZER: ADAMW
REFERENCE_WORLD_SIZE: 0
RESCALE_INTERVAL: false
STEPS: - 30000
WARMUP_FACTOR: 0.001
WARMUP_ITERS: 0
WARMUP_METHOD: linear
WEIGHT_DECAY: 0.0001
WEIGHT_DECAY_BIAS: null
WEIGHT_DECAY_NORM: 0.0
TEST:
AUG:
ENABLED: false
FLIP: true
MAX_SIZE: 4000
MIN_SIZES:- 400
- 500
- 600
- 700
- 800
- 900
- 1000
- 1100
- 1200
DETECTIONS_PER_IMAGE: 100
EVAL_PERIOD: 10000
EXPECTED_RESULTS: []
KEYPOINT_OKS_SIGMAS: []
LEXICON_TYPE: 3
PRECISE_BN:
ENABLED: false
NUM_ITER: 200
USE_LEXICON: true
WEIGHTED_EDIT_DIST: true
VERSION: 2
VIS_PERIOD: 0
[07/25 14:21:11] detectron2 INFO: Full config saved to output/TESTR/icdar15/TESTR_R_50_Polygon/config.yaml
[07/25 14:21:11] d2.utils.env INFO: Using a generated random seed 11819301
[07/25 14:21:13] d2.engine.defaults INFO: Model:
TransformerDetector(
(testr): TESTR(
(backbone): Joiner(
(0): MaskedBackbone(
(backbone): ResNet(
(stem): BasicStem(
(conv1): Conv2d(
3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
)
(res2): Sequential(
(0): BottleneckBlock(
(shortcut): Conv2d(
64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv1): Conv2d(
64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv2): Conv2d(
64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv3): Conv2d(
64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
)
(1): BottleneckBlock(
(conv1): Conv2d(
256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv2): Conv2d(
64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv3): Conv2d(
64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
)
(2): BottleneckBlock(
(conv1): Conv2d(
256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv2): Conv2d(
64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv3): Conv2d(
64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
)
)
(res3): Sequential(
(0): BottleneckBlock(
(shortcut): Conv2d(
256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv1): Conv2d(
256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv2): Conv2d(
128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv3): Conv2d(
128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
)
(1): BottleneckBlock(
(conv1): Conv2d(
512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv2): Conv2d(
128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv3): Conv2d(
128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
)
(2): BottleneckBlock(
(conv1): Conv2d(
512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv2): Conv2d(
128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv3): Conv2d(
128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
)
(3): BottleneckBlock(
(conv1): Conv2d(
512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv2): Conv2d(
128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv3): Conv2d(
128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
)
)
(res4): Sequential(
(0): BottleneckBlock(
(shortcut): Conv2d(
512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
(conv1): Conv2d(
512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(1): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(2): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(3): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(4): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(5): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
)
(res5): Sequential(
(0): BottleneckBlock(
(shortcut): Conv2d(
1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False
(norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
)
(conv1): Conv2d(
1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv2): Conv2d(
512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv3): Conv2d(
512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
)
)
(1): BottleneckBlock(
(conv1): Conv2d(
2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv2): Conv2d(
512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv3): Conv2d(
512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
)
)
(2): BottleneckBlock(
(conv1): Conv2d(
2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv2): Conv2d(
512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv3): Conv2d(
512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
)
)
)
)
)
(1): PositionalEncoding2D()
)
(text_pos_embed): PositionalEncoding1D()
(transformer): DeformableTransformer(
(encoder): DeformableTransformerEncoder(
(layers): ModuleList(
(0): DeformableTransformerEncoderLayer(
(self_attn): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout2): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout3): Dropout(p=0.1, inplace=False)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(1): DeformableTransformerEncoderLayer(
(self_attn): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout2): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout3): Dropout(p=0.1, inplace=False)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(2): DeformableTransformerEncoderLayer(
(self_attn): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout2): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout3): Dropout(p=0.1, inplace=False)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(3): DeformableTransformerEncoderLayer(
(self_attn): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout2): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout3): Dropout(p=0.1, inplace=False)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(4): DeformableTransformerEncoderLayer(
(self_attn): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout2): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout3): Dropout(p=0.1, inplace=False)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(5): DeformableTransformerEncoderLayer(
(self_attn): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout2): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout3): Dropout(p=0.1, inplace=False)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
)
)
(decoder): DeformableCompositeTransformerDecoder(
(layers): ModuleList(
(0): DeformableCompositeTransformerDecoderLayer(
(attn_cross): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout_cross): Dropout(p=0.1, inplace=False)
(norm_cross): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_intra): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_intra): Dropout(p=0.1, inplace=False)
(norm_intra): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_inter): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_inter): Dropout(p=0.1, inplace=False)
(norm_inter): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout3): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout4): Dropout(p=0.1, inplace=False)
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_intra_text): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_intra_text): Dropout(p=0.1, inplace=False)
(norm_intra_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_inter_text): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_inter_text): Dropout(p=0.1, inplace=False)
(norm_inter_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_cross_text): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout_cross_text): Dropout(p=0.1, inplace=False)
(norm_cross_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1_text): Linear(in_features=256, out_features=1024, bias=True)
(dropout3_text): Dropout(p=0.1, inplace=False)
(linear2_text): Linear(in_features=1024, out_features=256, bias=True)
(dropout4_text): Dropout(p=0.1, inplace=False)
(norm3_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(1): DeformableCompositeTransformerDecoderLayer(
(attn_cross): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout_cross): Dropout(p=0.1, inplace=False)
(norm_cross): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_intra): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_intra): Dropout(p=0.1, inplace=False)
(norm_intra): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_inter): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_inter): Dropout(p=0.1, inplace=False)
(norm_inter): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout3): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout4): Dropout(p=0.1, inplace=False)
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_intra_text): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_intra_text): Dropout(p=0.1, inplace=False)
(norm_intra_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_inter_text): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_inter_text): Dropout(p=0.1, inplace=False)
(norm_inter_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_cross_text): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout_cross_text): Dropout(p=0.1, inplace=False)
(norm_cross_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1_text): Linear(in_features=256, out_features=1024, bias=True)
(dropout3_text): Dropout(p=0.1, inplace=False)
(linear2_text): Linear(in_features=1024, out_features=256, bias=True)
(dropout4_text): Dropout(p=0.1, inplace=False)
(norm3_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(2): DeformableCompositeTransformerDecoderLayer(
(attn_cross): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout_cross): Dropout(p=0.1, inplace=False)
(norm_cross): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_intra): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_intra): Dropout(p=0.1, inplace=False)
(norm_intra): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_inter): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_inter): Dropout(p=0.1, inplace=False)
(norm_inter): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout3): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout4): Dropout(p=0.1, inplace=False)
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_intra_text): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_intra_text): Dropout(p=0.1, inplace=False)
(norm_intra_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_inter_text): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_inter_text): Dropout(p=0.1, inplace=False)
(norm_inter_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_cross_text): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout_cross_text): Dropout(p=0.1, inplace=False)
(norm_cross_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1_text): Linear(in_features=256, out_features=1024, bias=True)
(dropout3_text): Dropout(p=0.1, inplace=False)
(linear2_text): Linear(in_features=1024, out_features=256, bias=True)
(dropout4_text): Dropout(p=0.1, inplace=False)
(norm3_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(3): DeformableCompositeTransformerDecoderLayer(
(attn_cross): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout_cross): Dropout(p=0.1, inplace=False)
(norm_cross): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_intra): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_intra): Dropout(p=0.1, inplace=False)
(norm_intra): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_inter): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_inter): Dropout(p=0.1, inplace=False)
(norm_inter): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout3): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout4): Dropout(p=0.1, inplace=False)
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_intra_text): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_intra_text): Dropout(p=0.1, inplace=False)
(norm_intra_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_inter_text): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_inter_text): Dropout(p=0.1, inplace=False)
(norm_inter_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_cross_text): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout_cross_text): Dropout(p=0.1, inplace=False)
(norm_cross_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1_text): Linear(in_features=256, out_features=1024, bias=True)
(dropout3_text): Dropout(p=0.1, inplace=False)
(linear2_text): Linear(in_features=1024, out_features=256, bias=True)
(dropout4_text): Dropout(p=0.1, inplace=False)
(norm3_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(4): DeformableCompositeTransformerDecoderLayer(
(attn_cross): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout_cross): Dropout(p=0.1, inplace=False)
(norm_cross): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_intra): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_intra): Dropout(p=0.1, inplace=False)
(norm_intra): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_inter): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_inter): Dropout(p=0.1, inplace=False)
(norm_inter): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout3): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout4): Dropout(p=0.1, inplace=False)
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_intra_text): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_intra_text): Dropout(p=0.1, inplace=False)
(norm_intra_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_inter_text): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_inter_text): Dropout(p=0.1, inplace=False)
(norm_inter_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_cross_text): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout_cross_text): Dropout(p=0.1, inplace=False)
(norm_cross_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1_text): Linear(in_features=256, out_features=1024, bias=True)
(dropout3_text): Dropout(p=0.1, inplace=False)
(linear2_text): Linear(in_features=1024, out_features=256, bias=True)
(dropout4_text): Dropout(p=0.1, inplace=False)
(norm3_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(5): DeformableCompositeTransformerDecoderLayer(
(attn_cross): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout_cross): Dropout(p=0.1, inplace=False)
(norm_cross): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_intra): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_intra): Dropout(p=0.1, inplace=False)
(norm_intra): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_inter): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_inter): Dropout(p=0.1, inplace=False)
(norm_inter): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout3): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout4): Dropout(p=0.1, inplace=False)
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_intra_text): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_intra_text): Dropout(p=0.1, inplace=False)
(norm_intra_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_inter_text): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_inter_text): Dropout(p=0.1, inplace=False)
(norm_inter_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_cross_text): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout_cross_text): Dropout(p=0.1, inplace=False)
(norm_cross_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1_text): Linear(in_features=256, out_features=1024, bias=True)
(dropout3_text): Dropout(p=0.1, inplace=False)
(linear2_text): Linear(in_features=1024, out_features=256, bias=True)
(dropout4_text): Dropout(p=0.1, inplace=False)
(norm3_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
)
)
(enc_output): Linear(in_features=256, out_features=256, bias=True)
(enc_output_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(pos_trans): Linear(in_features=256, out_features=256, bias=True)
(pos_trans_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(bbox_class_embed): Linear(in_features=256, out_features=1, bias=True)
(bbox_embed): MLP(
(layers): ModuleList(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=4, bias=True)
)
)
)
(ctrl_point_class): ModuleList(
(0): Linear(in_features=256, out_features=1, bias=True)
(1): Linear(in_features=256, out_features=1, bias=True)
(2): Linear(in_features=256, out_features=1, bias=True)
(3): Linear(in_features=256, out_features=1, bias=True)
(4): Linear(in_features=256, out_features=1, bias=True)
(5): Linear(in_features=256, out_features=1, bias=True)
)
(ctrl_point_coord): ModuleList(
(0): MLP(
(layers): ModuleList(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=2, bias=True)
)
)
(1): MLP(
(layers): ModuleList(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=2, bias=True)
)
)
(2): MLP(
(layers): ModuleList(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=2, bias=True)
)
)
(3): MLP(
(layers): ModuleList(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=2, bias=True)
)
)
(4): MLP(
(layers): ModuleList(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=2, bias=True)
)
)
(5): MLP(
(layers): ModuleList(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=2, bias=True)
)
)
)
(bbox_coord): MLP(
(layers): ModuleList(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=4, bias=True)
)
)
(bbox_class): Linear(in_features=256, out_features=1, bias=True)
(text_class): Linear(in_features=256, out_features=97, bias=True)
(ctrl_point_embed): Embedding(16, 256)
(text_embed): Embedding(25, 256)
(input_proj): ModuleList(
(0): Sequential(
(0): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
(1): GroupNorm(32, 256, eps=1e-05, affine=True)
)
(1): Sequential(
(0): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1))
(1): GroupNorm(32, 256, eps=1e-05, affine=True)
)
(2): Sequential(
(0): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1))
(1): GroupNorm(32, 256, eps=1e-05, affine=True)
)
(3): Sequential(
(0): Conv2d(2048, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(1): GroupNorm(32, 256, eps=1e-05, affine=True)
)
)
)
(criterion): SetCriterion(
(enc_matcher): BoxHungarianMatcher()
(dec_matcher): CtrlPointHungarianMatcher()
)
)
[07/25 14:21:13] d2.data.dataset_mapper INFO: [DatasetMapper] Augmentations used in training: [RandomCrop(crop_type='relative_range', crop_size=[0.1, 0.1]), ResizeShortestEdge(short_edge_length=(800, 832, 864, 896, 1000, 1200, 1400), max_size=2333, sample_style='choice'), RandomFlip()]
[07/25 14:21:13] adet.data.dataset_mapper INFO: Rebuilding the augmentations. The previous augmentations will be overridden.
[07/25 14:21:13] adet.data.detection_utils INFO: Augmentations used in training: [ResizeShortestEdge(short_edge_length=(800, 832, 864, 896, 1000, 1200, 1400), max_size=2333, sample_style='choice')]
[07/25 14:21:13] adet.data.dataset_mapper INFO: Cropping used in training: RandomCropWithInstance(crop_type='relative_range', crop_size=[0.1, 0.1], crop_instance=False)
[07/25 14:21:13] adet.data.datasets.text INFO: Loaded 1000 images in COCO format from datasets/icdar2015/train_poly.json
[07/25 14:21:13] d2.data.build INFO: Removed 21 images with no usable annotations. 979 images left.
[07/25 14:21:13] d2.data.build INFO: Distribution of instances among all 1 categories:
�[36m| category | #instances |
|:----------:|:-------------|
| text | 4468 |
| | |�[0m
[07/25 14:21:13] d2.data.build INFO: Using training sampler TrainingSampler
[07/25 14:21:13] d2.data.common INFO: Serializing the dataset using: <class 'detectron2.data.common.TorchSerializedList'>
[07/25 14:21:13] d2.data.common INFO: Serializing 979 elements to byte tensors and concatenating them all ...
[07/25 14:21:13] d2.data.common INFO: Serialized dataset takes 1.64 MiB
[07/25 14:21:13] d2.checkpoint.detection_checkpoint INFO: [DetectionCheckpointer] Loading from weights/TESTR/pretrain_testr_R_50_polygon.pth ...
[07/25 14:21:13] fvcore.common.checkpoint INFO: [Checkpointer] Loading from weights/TESTR/pretrain_testr_R_50_polygon.pth ...
[07/25 14:21:14] adet.trainer INFO: Starting training from iteration 0
[07/25 17:20:06] d2.utils.events INFO: eta: 2 days, 13:01:22 iter: 9359 total_loss: 44.08 loss_ce: 0.783 loss_ctrl_points: 2.31 loss_texts: 3.764 loss_ce_0: 0.8143 loss_ctrl_points_0: 2.423 loss_texts_0: 3.801 loss_ce_1: 0.8142 loss_ctrl_points_1: 2.4 loss_texts_1: 3.759 loss_ce_2: 0.8032 loss_ctrl_points_2: 2.351 loss_texts_2: 3.756 loss_ce_3: 0.7866 loss_ctrl_points_3: 2.334 loss_texts_3: 3.758 loss_ce_4: 0.7786 loss_ctrl_points_4: 2.311 loss_texts_4: 3.77 loss_ce_enc: 0.8066 loss_bbox_enc: 0.3008 loss_giou_enc: 0.7569 time: 1.1431 last_time: 0.8115 data_time: 0.0088 last_data_time: 0.0066 lr: 1e-05 max_mem: 12183M
[07/25 17:20:28] d2.utils.events INFO: eta: 2 days, 13:02:11 iter: 9379 total_loss: 42.63 loss_ce: 0.7653 loss_ctrl_points: 2.407 loss_texts: 3.758 loss_ce_0: 0.8062 loss_ctrl_points_0: 2.635 loss_texts_0: 3.792 loss_ce_1: 0.7863 loss_ctrl_points_1: 2.568 loss_texts_1: 3.736 loss_ce_2: 0.7788 loss_ctrl_points_2: 2.537 loss_texts_2: 3.737 loss_ce_3: 0.77 loss_ctrl_points_3: 2.508 loss_texts_3: 3.748 loss_ce_4: 0.7641 loss_ctrl_points_4: 2.456 loss_texts_4: 3.748 loss_ce_enc: 0.7962 loss_bbox_enc: 0.2918 loss_giou_enc: 0.73 time: 1.1431 last_time: 0.9134 data_time: 0.0084 last_data_time: 0.0075 lr: 1e-05 max_mem: 12183M
[07/25 17:20:51] d2.utils.events INFO: eta: 2 days, 13:05:45 iter: 9399 total_loss: 44.09 loss_ce: 0.7944 loss_ctrl_points: 2.32 loss_texts: 3.633 loss_ce_0: 0.8154 loss_ctrl_points_0: 2.634 loss_texts_0: 3.668 loss_ce_1: 0.802 loss_ctrl_points_1: 2.506 loss_texts_1: 3.633 loss_ce_2: 0.8023 loss_ctrl_points_2: 2.369 loss_texts_2: 3.626 loss_ce_3: 0.7987 loss_ctrl_points_3: 2.281 loss_texts_3: 3.624 loss_ce_4: 0.7966 loss_ctrl_points_4: 2.309 loss_texts_4: 3.62 loss_ce_enc: 0.8003 loss_bbox_enc: 0.2937 loss_giou_enc: 0.7454 time: 1.1431 last_time: 1.1894 data_time: 0.0081 last_data_time: 0.0227 lr: 1e-05 max