Skip to content

AptaNet discrepancy between implementation and paper #34

@satvshr

Description

@satvshr

TL;DR: My PR (#29) currently faces two issues: (1) the amino acid frequencies are divided by 20 (number of possible amino acids (A...Y)) in the official implementation to calculate the denominator_val, which is used to calculate the feature vectors (instead of the target sequence length (normalization)), which seems to contradict the paper; and (2) even when using 20, the output vectors which we get at the end of pseaac don't match the original implementation: there are small but consistent discrepancies.

There are 2 main problems I would like to address with my PR:

  1. In the official repository, they divide all amino acid frequencies by a fixed number, which is the total number of available amino acids that can exist, whereas it should be divided by the length of the target sequence instead, as mentioned in the paper in page 14 in the start of the second last para.
  2. Even if I divide the frequency of the amino acids by 20, my final vectors do not match the ones being yielded by the official implementation in the end after the entire process is completed, and there are small discrepancies everywhere (we are comparing 2 vectors of length 350), with the error margin being 1e-3. The format for each entry is (<vector index>, <my value>, <official value>):
    AssertionError: Vector values mismatch at indices: [(4, 0.021, 0.025), (7, 0.064, 0.074), (8, 0.043, 0.05), (9, 0.021, 0.025), (10, 0.021, 0.025), (11, 0.021, 0.025), (12, 0.021, 0.025), (13, 0.032, 0.037), (14, 0.075, 0.087), (24, 0.012, 0.014), (25, 0.01, 0.012), (28, 0.014, 0.016), (29, 0.011, 0.013), (38, 0.008, 0.01), (39, 0.011, 0.013), (40, 0.024, 0.028), (41, 0.042, 0.048), (42, 0.046, 0.053), (43, 0.06, 0.069), (44, 0.048, 0.055), (45, 0.025, 0.029), (46, 0.03, 0.034), (47, 0.02, 0.023), (48, 0.064, 0.074), (49, 0.153, 0.177), (50, 0.013, 0.016), (51, 0.013, 0.016), (52, 0.013, 0.016), (54, 0.026, 0.031), (57, 0.078, 0.094), (58, 0.052, 0.063), (59, 0.026, 0.031), (60, 0.026, 0.031), (61, 0.026, 0.031), (62, 0.026, 0.031), (63, 0.039, 0.047), (64, 0.091, 0.11), (71, 0.01, 0.012), (72, 0.014, 0.017), (73, 0.01, 0.012), (76, 0.014, 0.017), (77, 0.011, 0.013), (79, 0.011, 0.013), (80, 0.008, 0.01), (82, 0.018, 0.022), (83, 0.014, 0.017), (84, 0.007, 0.009), (85, 0.011, 0.013), (86, 0.011, 0.013), (87, 0.01, 0.012), (88, 0.01, 0.012), (89, 0.016, 0.019), (90, 0.015, 0.018), (91, 0.017, 0.02), (92, 0.019, 0.022), (93, 0.018, 0.022), (94, 0.031, 0.038), (95, 0.041, 0.049), (96, 0.047, 0.057), (97, 0.061, 0.073), (98, 0.055, 0.066), (99, 0.062, 0.074), (104, 0.021, 0.025), (107, 0.064, 0.074), (108, 0.043, 0.05), (109, 0.021, 0.025), (110, 0.021, 0.025), (111, 0.021, 0.025), (112, 0.021, 0.025), (113, 0.032, 0.037), (114, 0.075, 0.087), (121, 0.008, 0.01), (122, 0.014, 0.016), (123, 0.008, 0.01), (125, 0.008, 0.01), (126, 0.016, 0.018), (132, 0.017, 0.02), (136, 0.012, 0.014), (137, 0.013, 0.015), (138, 0.015, 0.017), (139, 0.019, 0.022), (140, 0.022, 0.025), (141, 0.025, 0.03), (142, 0.028, 0.033), (143, 0.013, 0.015), (144, 0.034, 0.039), (145, 0.035, 0.04), (146, 0.041, 0.047), (147, 0.081, 0.094), (148, 0.064, 0.074), (149, 0.093, 0.108), (150, 0.016, 0.019), (151, 0.016, 0.019), (152, 0.016, 0.019), (154, 0.031, 0.039), (157, 0.093, 0.117), (158, 0.062, 0.078), (159, 0.031, 0.039), (160, 0.031, 0.039), (161, 0.031, 0.039), (162, 0.031, 0.039), (163, 0.047, 0.058), (164, 0.109, 0.136), (172, 0.008, 0.01), (173, 0.006, 0.008), (174, 0.004, 0.006), (175, 0.008, 0.01), (176, 0.01, 0.013), (177, 0.007, 0.009), (179, 0.007, 0.009), (180, 0.009, 0.011), (181, 0.009, 0.011), (182, 0.014, 0.017), (183, 0.01, 0.012), (184, 0.029, 0.036), (185, 0.048, 0.06), (186, 0.041, 0.051), (187, 0.025, 0.031), (192, 0.004, 0.006), (193, 0.008, 0.01), (194, 0.025, 0.031), (195, 0.04, 0.05), (196, 0.047, 0.058), (197, 0.051, 0.063), (198, 0.031, 0.039), (199, 0.015, 0.019), (200, 0.013, 0.015), (201, 0.013, 0.015), (202, 0.013, 0.015), (204, 0.026, 0.031), (207, 0.077, 0.092), (208, 0.051, 0.061), (209, 0.026, 0.031), (210, 0.026, 0.031), (211, 0.026, 0.031), (212, 0.026, 0.031), (213, 0.038, 0.046), (214, 0.089, 0.107), (222, 0.007, 0.009), (225, 0.014, 0.016), (226, 0.019, 0.023), (227, 0.008, 0.01), (230, 0.007, 0.009), (231, 0.006, 0.008), (232, 0.01, 0.012), (235, 0.009, 0.011), (237, 0.008, 0.01), (238, 0.006, 0.008), (239, 0.01, 0.012), (241, 0.009, 0.011), (242, 0.01, 0.012), (243, 0.014, 0.017), (244, 0.047, 0.056), (245, 0.072, 0.086), (246, 0.084, 0.101), (247, 0.096, 0.115), (248, 0.059, 0.07), (249, 0.034, 0.041), (254, 0.016, 0.018), (257, 0.048, 0.053), (258, 0.032, 0.035), (259, 0.016, 0.018), (260, 0.016, 0.018), (261, 0.016, 0.018), (262, 0.016, 0.018), (263, 0.024, 0.027), (264, 0.056, 0.062), (273, 0.01, 0.012), (277, 0.01, 0.012), (278, 0.01, 0.012), (285, 0.018, 0.02), (289, 0.022, 0.024), (290, 0.033, 0.037), (291, 0.04, 0.044), (292, 0.044, 0.049), (293, 0.033, 0.037), (294, 0.028, 0.031), (295, 0.027, 0.03), (296, 0.032, 0.036), (297, 0.067, 0.074), (298, 0.102, 0.113), (299, 0.146, 0.163), (300, 0.009, 0.011), (301, 0.009, 0.011), (302, 0.009, 0.011), (304, 0.019, 0.021), (307, 0.056, 0.064), (308, 0.038, 0.043), (309, 0.019, 0.021), (310, 0.019, 0.021), (311, 0.019, 0.021), (312, 0.019, 0.021), (313, 0.028, 0.032), (314, 0.066, 0.075), (320, 0.012, 0.014), (323, 0.013, 0.015), (333, 0.013, 0.015), (334, 0.011, 0.013), (337, 0.013, 0.015), (338, 0.014, 0.016), (339, 0.02, 0.023), (340, 0.035, 0.04), (341, 0.043, 0.05), (342, 0.048, 0.054), (343, 0.045, 0.051), (344, 0.02, 0.023), (345, 0.014, 0.016), (346, 0.016, 0.018), (347, 0.034, 0.039), (348, 0.103, 0.118), (349, 0.159, 0.182)]
    If we take error margin to be 1e-2 we get a smaller range of values:
    [(14, 0.075, 0.087), (49, 0.153, 0.177), (57, 0.078, 0.094), (58, 0.052, 0.063), (64, 0.091, 0.11), (97, 0.061, 0.073), (98, 0.055, 0.066), (99, 0.062, 0.074), (114, 0.075, 0.087), (147, 0.081, 0.094), (149, 0.093, 0.108), (157, 0.093, 0.117), (158, 0.062, 0.078), (163, 0.047, 0.058), (164, 0.109, 0.136), (185, 0.048, 0.06), (196, 0.047, 0.058), (197, 0.051, 0.063), (207, 0.077, 0.092), (214, 0.089, 0.107), (245, 0.072, 0.086), (246, 0.084, 0.101), (247, 0.096, 0.115), (248, 0.059, 0.07), (298, 0.102, 0.113), (299, 0.146, 0.163), (348, 0.103, 0.118), (349, 0.159, 0.182)]

Questions with respect to each bullet point mentioned above:

  1. Should I go forward with normalization and follow the official paper, as other implementations of pseaac also seem to be normalizing?
  2. Are the errors that I am facing due to floating-point precision errors in Python, or is there something wrong with my code?

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions