TL;DR: My PR (#29) currently faces two issues: (1) the amino acid frequencies are divided by 20 (number of possible amino acids (A...Y)) in the official implementation to calculate the denominator_val, which is used to calculate the feature vectors (instead of the target sequence length (normalization)), which seems to contradict the paper; and (2) even when using 20, the output vectors which we get at the end of pseaac don't match the original implementation: there are small but consistent discrepancies.
There are 2 main problems I would like to address with my PR:
- In the official repository, they divide all amino acid frequencies by a fixed number, which is the total number of available amino acids that can exist, whereas it should be divided by the length of the target sequence instead, as mentioned in the paper in page 14 in the start of the second last para.
- Even if I divide the frequency of the amino acids by 20, my final vectors do not match the ones being yielded by the official implementation in the end after the entire process is completed, and there are small discrepancies everywhere (we are comparing 2 vectors of length 350), with the error margin being
1e-3. The format for each entry is (<vector index>, <my value>, <official value>):
AssertionError: Vector values mismatch at indices: [(4, 0.021, 0.025), (7, 0.064, 0.074), (8, 0.043, 0.05), (9, 0.021, 0.025), (10, 0.021, 0.025), (11, 0.021, 0.025), (12, 0.021, 0.025), (13, 0.032, 0.037), (14, 0.075, 0.087), (24, 0.012, 0.014), (25, 0.01, 0.012), (28, 0.014, 0.016), (29, 0.011, 0.013), (38, 0.008, 0.01), (39, 0.011, 0.013), (40, 0.024, 0.028), (41, 0.042, 0.048), (42, 0.046, 0.053), (43, 0.06, 0.069), (44, 0.048, 0.055), (45, 0.025, 0.029), (46, 0.03, 0.034), (47, 0.02, 0.023), (48, 0.064, 0.074), (49, 0.153, 0.177), (50, 0.013, 0.016), (51, 0.013, 0.016), (52, 0.013, 0.016), (54, 0.026, 0.031), (57, 0.078, 0.094), (58, 0.052, 0.063), (59, 0.026, 0.031), (60, 0.026, 0.031), (61, 0.026, 0.031), (62, 0.026, 0.031), (63, 0.039, 0.047), (64, 0.091, 0.11), (71, 0.01, 0.012), (72, 0.014, 0.017), (73, 0.01, 0.012), (76, 0.014, 0.017), (77, 0.011, 0.013), (79, 0.011, 0.013), (80, 0.008, 0.01), (82, 0.018, 0.022), (83, 0.014, 0.017), (84, 0.007, 0.009), (85, 0.011, 0.013), (86, 0.011, 0.013), (87, 0.01, 0.012), (88, 0.01, 0.012), (89, 0.016, 0.019), (90, 0.015, 0.018), (91, 0.017, 0.02), (92, 0.019, 0.022), (93, 0.018, 0.022), (94, 0.031, 0.038), (95, 0.041, 0.049), (96, 0.047, 0.057), (97, 0.061, 0.073), (98, 0.055, 0.066), (99, 0.062, 0.074), (104, 0.021, 0.025), (107, 0.064, 0.074), (108, 0.043, 0.05), (109, 0.021, 0.025), (110, 0.021, 0.025), (111, 0.021, 0.025), (112, 0.021, 0.025), (113, 0.032, 0.037), (114, 0.075, 0.087), (121, 0.008, 0.01), (122, 0.014, 0.016), (123, 0.008, 0.01), (125, 0.008, 0.01), (126, 0.016, 0.018), (132, 0.017, 0.02), (136, 0.012, 0.014), (137, 0.013, 0.015), (138, 0.015, 0.017), (139, 0.019, 0.022), (140, 0.022, 0.025), (141, 0.025, 0.03), (142, 0.028, 0.033), (143, 0.013, 0.015), (144, 0.034, 0.039), (145, 0.035, 0.04), (146, 0.041, 0.047), (147, 0.081, 0.094), (148, 0.064, 0.074), (149, 0.093, 0.108), (150, 0.016, 0.019), (151, 0.016, 0.019), (152, 0.016, 0.019), (154, 0.031, 0.039), (157, 0.093, 0.117), (158, 0.062, 0.078), (159, 0.031, 0.039), (160, 0.031, 0.039), (161, 0.031, 0.039), (162, 0.031, 0.039), (163, 0.047, 0.058), (164, 0.109, 0.136), (172, 0.008, 0.01), (173, 0.006, 0.008), (174, 0.004, 0.006), (175, 0.008, 0.01), (176, 0.01, 0.013), (177, 0.007, 0.009), (179, 0.007, 0.009), (180, 0.009, 0.011), (181, 0.009, 0.011), (182, 0.014, 0.017), (183, 0.01, 0.012), (184, 0.029, 0.036), (185, 0.048, 0.06), (186, 0.041, 0.051), (187, 0.025, 0.031), (192, 0.004, 0.006), (193, 0.008, 0.01), (194, 0.025, 0.031), (195, 0.04, 0.05), (196, 0.047, 0.058), (197, 0.051, 0.063), (198, 0.031, 0.039), (199, 0.015, 0.019), (200, 0.013, 0.015), (201, 0.013, 0.015), (202, 0.013, 0.015), (204, 0.026, 0.031), (207, 0.077, 0.092), (208, 0.051, 0.061), (209, 0.026, 0.031), (210, 0.026, 0.031), (211, 0.026, 0.031), (212, 0.026, 0.031), (213, 0.038, 0.046), (214, 0.089, 0.107), (222, 0.007, 0.009), (225, 0.014, 0.016), (226, 0.019, 0.023), (227, 0.008, 0.01), (230, 0.007, 0.009), (231, 0.006, 0.008), (232, 0.01, 0.012), (235, 0.009, 0.011), (237, 0.008, 0.01), (238, 0.006, 0.008), (239, 0.01, 0.012), (241, 0.009, 0.011), (242, 0.01, 0.012), (243, 0.014, 0.017), (244, 0.047, 0.056), (245, 0.072, 0.086), (246, 0.084, 0.101), (247, 0.096, 0.115), (248, 0.059, 0.07), (249, 0.034, 0.041), (254, 0.016, 0.018), (257, 0.048, 0.053), (258, 0.032, 0.035), (259, 0.016, 0.018), (260, 0.016, 0.018), (261, 0.016, 0.018), (262, 0.016, 0.018), (263, 0.024, 0.027), (264, 0.056, 0.062), (273, 0.01, 0.012), (277, 0.01, 0.012), (278, 0.01, 0.012), (285, 0.018, 0.02), (289, 0.022, 0.024), (290, 0.033, 0.037), (291, 0.04, 0.044), (292, 0.044, 0.049), (293, 0.033, 0.037), (294, 0.028, 0.031), (295, 0.027, 0.03), (296, 0.032, 0.036), (297, 0.067, 0.074), (298, 0.102, 0.113), (299, 0.146, 0.163), (300, 0.009, 0.011), (301, 0.009, 0.011), (302, 0.009, 0.011), (304, 0.019, 0.021), (307, 0.056, 0.064), (308, 0.038, 0.043), (309, 0.019, 0.021), (310, 0.019, 0.021), (311, 0.019, 0.021), (312, 0.019, 0.021), (313, 0.028, 0.032), (314, 0.066, 0.075), (320, 0.012, 0.014), (323, 0.013, 0.015), (333, 0.013, 0.015), (334, 0.011, 0.013), (337, 0.013, 0.015), (338, 0.014, 0.016), (339, 0.02, 0.023), (340, 0.035, 0.04), (341, 0.043, 0.05), (342, 0.048, 0.054), (343, 0.045, 0.051), (344, 0.02, 0.023), (345, 0.014, 0.016), (346, 0.016, 0.018), (347, 0.034, 0.039), (348, 0.103, 0.118), (349, 0.159, 0.182)]
If we take error margin to be 1e-2 we get a smaller range of values:
[(14, 0.075, 0.087), (49, 0.153, 0.177), (57, 0.078, 0.094), (58, 0.052, 0.063), (64, 0.091, 0.11), (97, 0.061, 0.073), (98, 0.055, 0.066), (99, 0.062, 0.074), (114, 0.075, 0.087), (147, 0.081, 0.094), (149, 0.093, 0.108), (157, 0.093, 0.117), (158, 0.062, 0.078), (163, 0.047, 0.058), (164, 0.109, 0.136), (185, 0.048, 0.06), (196, 0.047, 0.058), (197, 0.051, 0.063), (207, 0.077, 0.092), (214, 0.089, 0.107), (245, 0.072, 0.086), (246, 0.084, 0.101), (247, 0.096, 0.115), (248, 0.059, 0.07), (298, 0.102, 0.113), (299, 0.146, 0.163), (348, 0.103, 0.118), (349, 0.159, 0.182)]
Questions with respect to each bullet point mentioned above:
- Should I go forward with normalization and follow the official paper, as other implementations of pseaac also seem to be normalizing?
- Are the errors that I am facing due to floating-point precision errors in Python, or is there something wrong with my code?
TL;DR: My PR (#29) currently faces two issues: (1) the amino acid frequencies are divided by 20 (number of possible amino acids (A...Y)) in the official implementation to calculate the
denominator_val, which is used to calculate the feature vectors (instead of the target sequence length (normalization)), which seems to contradict the paper; and (2) even when using 20, the output vectors which we get at the end of pseaac don't match the original implementation: there are small but consistent discrepancies.There are 2 main problems I would like to address with my PR:
1e-3. The format for each entry is(<vector index>, <my value>, <official value>):AssertionError: Vector values mismatch at indices: [(4, 0.021, 0.025), (7, 0.064, 0.074), (8, 0.043, 0.05), (9, 0.021, 0.025), (10, 0.021, 0.025), (11, 0.021, 0.025), (12, 0.021, 0.025), (13, 0.032, 0.037), (14, 0.075, 0.087), (24, 0.012, 0.014), (25, 0.01, 0.012), (28, 0.014, 0.016), (29, 0.011, 0.013), (38, 0.008, 0.01), (39, 0.011, 0.013), (40, 0.024, 0.028), (41, 0.042, 0.048), (42, 0.046, 0.053), (43, 0.06, 0.069), (44, 0.048, 0.055), (45, 0.025, 0.029), (46, 0.03, 0.034), (47, 0.02, 0.023), (48, 0.064, 0.074), (49, 0.153, 0.177), (50, 0.013, 0.016), (51, 0.013, 0.016), (52, 0.013, 0.016), (54, 0.026, 0.031), (57, 0.078, 0.094), (58, 0.052, 0.063), (59, 0.026, 0.031), (60, 0.026, 0.031), (61, 0.026, 0.031), (62, 0.026, 0.031), (63, 0.039, 0.047), (64, 0.091, 0.11), (71, 0.01, 0.012), (72, 0.014, 0.017), (73, 0.01, 0.012), (76, 0.014, 0.017), (77, 0.011, 0.013), (79, 0.011, 0.013), (80, 0.008, 0.01), (82, 0.018, 0.022), (83, 0.014, 0.017), (84, 0.007, 0.009), (85, 0.011, 0.013), (86, 0.011, 0.013), (87, 0.01, 0.012), (88, 0.01, 0.012), (89, 0.016, 0.019), (90, 0.015, 0.018), (91, 0.017, 0.02), (92, 0.019, 0.022), (93, 0.018, 0.022), (94, 0.031, 0.038), (95, 0.041, 0.049), (96, 0.047, 0.057), (97, 0.061, 0.073), (98, 0.055, 0.066), (99, 0.062, 0.074), (104, 0.021, 0.025), (107, 0.064, 0.074), (108, 0.043, 0.05), (109, 0.021, 0.025), (110, 0.021, 0.025), (111, 0.021, 0.025), (112, 0.021, 0.025), (113, 0.032, 0.037), (114, 0.075, 0.087), (121, 0.008, 0.01), (122, 0.014, 0.016), (123, 0.008, 0.01), (125, 0.008, 0.01), (126, 0.016, 0.018), (132, 0.017, 0.02), (136, 0.012, 0.014), (137, 0.013, 0.015), (138, 0.015, 0.017), (139, 0.019, 0.022), (140, 0.022, 0.025), (141, 0.025, 0.03), (142, 0.028, 0.033), (143, 0.013, 0.015), (144, 0.034, 0.039), (145, 0.035, 0.04), (146, 0.041, 0.047), (147, 0.081, 0.094), (148, 0.064, 0.074), (149, 0.093, 0.108), (150, 0.016, 0.019), (151, 0.016, 0.019), (152, 0.016, 0.019), (154, 0.031, 0.039), (157, 0.093, 0.117), (158, 0.062, 0.078), (159, 0.031, 0.039), (160, 0.031, 0.039), (161, 0.031, 0.039), (162, 0.031, 0.039), (163, 0.047, 0.058), (164, 0.109, 0.136), (172, 0.008, 0.01), (173, 0.006, 0.008), (174, 0.004, 0.006), (175, 0.008, 0.01), (176, 0.01, 0.013), (177, 0.007, 0.009), (179, 0.007, 0.009), (180, 0.009, 0.011), (181, 0.009, 0.011), (182, 0.014, 0.017), (183, 0.01, 0.012), (184, 0.029, 0.036), (185, 0.048, 0.06), (186, 0.041, 0.051), (187, 0.025, 0.031), (192, 0.004, 0.006), (193, 0.008, 0.01), (194, 0.025, 0.031), (195, 0.04, 0.05), (196, 0.047, 0.058), (197, 0.051, 0.063), (198, 0.031, 0.039), (199, 0.015, 0.019), (200, 0.013, 0.015), (201, 0.013, 0.015), (202, 0.013, 0.015), (204, 0.026, 0.031), (207, 0.077, 0.092), (208, 0.051, 0.061), (209, 0.026, 0.031), (210, 0.026, 0.031), (211, 0.026, 0.031), (212, 0.026, 0.031), (213, 0.038, 0.046), (214, 0.089, 0.107), (222, 0.007, 0.009), (225, 0.014, 0.016), (226, 0.019, 0.023), (227, 0.008, 0.01), (230, 0.007, 0.009), (231, 0.006, 0.008), (232, 0.01, 0.012), (235, 0.009, 0.011), (237, 0.008, 0.01), (238, 0.006, 0.008), (239, 0.01, 0.012), (241, 0.009, 0.011), (242, 0.01, 0.012), (243, 0.014, 0.017), (244, 0.047, 0.056), (245, 0.072, 0.086), (246, 0.084, 0.101), (247, 0.096, 0.115), (248, 0.059, 0.07), (249, 0.034, 0.041), (254, 0.016, 0.018), (257, 0.048, 0.053), (258, 0.032, 0.035), (259, 0.016, 0.018), (260, 0.016, 0.018), (261, 0.016, 0.018), (262, 0.016, 0.018), (263, 0.024, 0.027), (264, 0.056, 0.062), (273, 0.01, 0.012), (277, 0.01, 0.012), (278, 0.01, 0.012), (285, 0.018, 0.02), (289, 0.022, 0.024), (290, 0.033, 0.037), (291, 0.04, 0.044), (292, 0.044, 0.049), (293, 0.033, 0.037), (294, 0.028, 0.031), (295, 0.027, 0.03), (296, 0.032, 0.036), (297, 0.067, 0.074), (298, 0.102, 0.113), (299, 0.146, 0.163), (300, 0.009, 0.011), (301, 0.009, 0.011), (302, 0.009, 0.011), (304, 0.019, 0.021), (307, 0.056, 0.064), (308, 0.038, 0.043), (309, 0.019, 0.021), (310, 0.019, 0.021), (311, 0.019, 0.021), (312, 0.019, 0.021), (313, 0.028, 0.032), (314, 0.066, 0.075), (320, 0.012, 0.014), (323, 0.013, 0.015), (333, 0.013, 0.015), (334, 0.011, 0.013), (337, 0.013, 0.015), (338, 0.014, 0.016), (339, 0.02, 0.023), (340, 0.035, 0.04), (341, 0.043, 0.05), (342, 0.048, 0.054), (343, 0.045, 0.051), (344, 0.02, 0.023), (345, 0.014, 0.016), (346, 0.016, 0.018), (347, 0.034, 0.039), (348, 0.103, 0.118), (349, 0.159, 0.182)]If we take error margin to be
1e-2we get a smaller range of values:[(14, 0.075, 0.087), (49, 0.153, 0.177), (57, 0.078, 0.094), (58, 0.052, 0.063), (64, 0.091, 0.11), (97, 0.061, 0.073), (98, 0.055, 0.066), (99, 0.062, 0.074), (114, 0.075, 0.087), (147, 0.081, 0.094), (149, 0.093, 0.108), (157, 0.093, 0.117), (158, 0.062, 0.078), (163, 0.047, 0.058), (164, 0.109, 0.136), (185, 0.048, 0.06), (196, 0.047, 0.058), (197, 0.051, 0.063), (207, 0.077, 0.092), (214, 0.089, 0.107), (245, 0.072, 0.086), (246, 0.084, 0.101), (247, 0.096, 0.115), (248, 0.059, 0.07), (298, 0.102, 0.113), (299, 0.146, 0.163), (348, 0.103, 0.118), (349, 0.159, 0.182)]Questions with respect to each bullet point mentioned above: