-
Notifications
You must be signed in to change notification settings - Fork 1
Feature/orthoinference 71 updates #104
base: develop
Are you sure you want to change the base?
Conversation
…t just ones created on current run
private static StableIdentifierGenerator stableIdentifierGenerator; | ||
private static OrthologousPathwayDiagramGenerator orthologousPathwayDiagramGenerator; | ||
private static final String summationText = "This event has been computationally inferred from an event that has been demonstrated in another species.<p>The inference is based on the homology mapping from PANTHER. Briefly, reactions for which all involved PhysicalEntities (in input, output and catalyst) have a mapped orthologue/paralogue (for complexes at least 75% of components must have a mapping) are inferred to the other species. High level events are also inferred for these events to allow for easier navigation.<p><a href='/electronic_inference_compara.html' target = 'NEW'>More details and caveats of the event inference in Reactome.</a> For details on PANTHER see also: <a href='http://www.pantherdb.org/about.jsp' target='NEW'>http://www.pantherdb.org/about.jsp</a>"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the "target = 'NEW'" attribute that is used two times is allowed in modern HTML. For a new window or tab, 'target=_blank' is the attribute value to use:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this string be split on to several lines? This is a bit wide, even for me.
@@ -113,6 +115,7 @@ public static void inferEvents(Properties props, String species) throws Exceptio | |||
Map<String,String[]> homologueMappings = readHomologueMappingFile(species, "hsap", pathToOrthopairs); | |||
ProteinCountUtility.setHomologueMappingFile(homologueMappings); | |||
EWASInferrer.setHomologueMappingFile(homologueMappings); | |||
SkipInstanceChecker.setHomologueMappingFile(homologueMappings); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name setHomologueMappingFile
is a little misleading since it is taking the homologue map data structure itself rather than the file name. Could it be setHomologueMap
instead?
if (!seenPrecedingEvent.contains(inferrableEventInst)) { | ||
if (inferrableEventInst.getAttributeValue(precedingEvent) != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could these conditions be merged into one if-statement as !seenPrecedingEvent.contains(inferrableEventInst) && inferrableEventInst.getAttributeValue(precedingEvent) != null
?
I don't see an else-statement pairing with the inner if-statement so it seems like it should be OK, and would make the rest of this a little more readable.
infReactionInst.addAttributeValue(stableIdentifier, orthoStableIdentifierInst); | ||
dba.storeInstance(infReactionInst); | ||
logger.info("Inferred RlE instance: " + infReactionInst); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an if-statement that only contains a return
? Could this be modified to be the negation of the same condition which allows the rest of the code to flow? This is a void function, so it's not even expected to return anything. I find that using return
as a form of flow-control to be confusing. In short functions, this is less of an issue, but here, the exit point is hidden, easy to miss. There are other empty returns doing the same thing.
previouslyInferredInstances = checkIfPreviouslyInferred(reactionInst, inferredFrom, previouslyInferredInstances); | ||
if (previouslyInferredInstances.size() > 0) | ||
{ | ||
previouslyInferredInstances.addAll(checkIfPreviouslyInferred(reactionInst, inferredFrom, previouslyInferredInstances)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since previouslyInferredInstances
is only added to and not queried in the method checkIfPreviouslyInferred
, passing it as a parameter isn't needed. It's best practice not to modify parameters, but instead to create and return a different variable in the method body.
You could do this part of the code as follows:
List<GKInstance> previouslyInferredInstances = checkIfPreviouslyInferred(
reactionInst, Arrays.asList(inferredFrom, orthologousEvent)
);
Then the checkIfPreviouslyInferred
method could be:
private static List<GKInstance> checkIfPreviouslyInferred(GKInstance reactionInst, List<String> attributes) throws Exception
{
List previouslyInferredInstances = new ArrayList<>();
for (String attribute : attributes) {
for (GKInstance attributeInst : (Collection<GKInstance>) reactionInst.getAttributeValuesList(attribute)) {
GKInstance reactionSpeciesInst = (GKInstance) attributeInst.getAttributeValue(species);
if (reactionSpeciesInst.getDBID().equals(speciesInst.getDBID()) && attributeInst.getAttributeValue(isChimeric) == null) {
previouslyInferredInstances.add(attributeInst);
}
}
}
return previouslyInferredInstances;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it may be better to rename checkIfPreviouslyInferred
to getInstancesIfPreviouslyInferred
to make it clear that the list of previously inferred instances is what the method returns. A method name starting with "check" may be more suitable to a void or possibly a boolean return value.
if (previouslyInferredInstances.size() > 0) | ||
{ | ||
previouslyInferredInstances.addAll(checkIfPreviouslyInferred(reactionInst, inferredFrom, previouslyInferredInstances)); | ||
if (previouslyInferredInstances.size() > 0) { | ||
GKInstance prevInfInst = previouslyInferredInstances.get(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a check needed to see if the list contains more than one element? If the list has more than one previously inferred instances would that be an error?
manualEventToNonHumanSource.put(reactionInst, prevInfInst); | ||
manualHumanEvents.add(reactionInst); | ||
eventsAlreadyInferredMap.put(reactionInst, prevInfInst); | ||
eventsAlreadyInferred.add(reactionInst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the eventsAlreadyInferredMap
and eventsAlreadyInferred
data structures both needed? Could the list populated in eventsAlreadyInferred
be obtained from the key set of the eventsAlreadyInferredMap
?
{ | ||
for (GKInstance inferrableEventInst : updatedInferrableHumanEvents) { | ||
if (!seenPrecedingEvent.contains(inferrableEventInst)) { | ||
if (inferrableEventInst.getAttributeValue(precedingEvent) != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two conditions could be combined with &&
if seenPrecedingEvent.add(inferrableEventInst);
(line 198) is moved outside the inner if condition. Since seenPrecedingEvent
is a set, it won't cause a problem if you add an instance that is already present in the set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may also want to consider refactoring the body of the if statement to its own method so that you have the operation occurring for each inferrableEventInst
isolated.
if (!inferredPrecedingEvents.contains(precedingEventInst.getDBID().toString())) | ||
{ | ||
for (GKInstance precedingEventInst : precedingEventInstances) { | ||
if (!inferredPrecedingEvents.contains(precedingEventInst.getDBID().toString())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could inferredPrecedingEvents
be typed as Set<Long>
instead of Set<String>
so a call to toString()
isn't needed?
updatedPrecedingEventInstances.add(precedingEventInst); | ||
} | ||
} | ||
// Add preceding event to inferred instance | ||
if (updatedPrecedingEventInstances != null && updatedPrecedingEventInstances.size() > 0) | ||
{ | ||
if (updatedPrecedingEventInstances != null && updatedPrecedingEventInstances.size() > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The null check shouldn't be necessary since updatingPrecedingEventInstances
is initialized as an empty ArrayList.
logger.info("Inferring inputs..."); | ||
if (inferReactionInputsOrOutputs(reactionInst, infReactionInst, input)) | ||
logger.info("Inferring outputs..."); | ||
if (inferReactionInputsOrOutputs(reactionInst, infReactionInst, output)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be clearer to use guard clauses rather than nest.
https://stackoverflow.com/a/4887280/2295778
https://refactoring.guru/replace-nested-conditional-with-guard-clauses
// Attempt to infer all PhysicalEntities associated with this reaction's Input, Output, CatalystActivity and RegulatedBy attributes.
// Failure to successfully infer any of these attributes will end inference for this reaction.
logger.info("Inferring inputs...");
if (!(inferReactionInputsOrOutputs(reactionInst, infReactionInst, input)))
{
logger.info("Input inference unsuccessful -- terminating inference for " + reactionInst);
return;
}
logger.info("Inferring outputs...");
if (!(inferReactionInputsOrOutputs(reactionInst, infReactionInst, output)))
{
logger.info("Output inference unsuccessful -- terminating inference for " + reactionInst);
return;
}
logger.info("Inferring catalysts...");
if (!(inferReactionCatalysts(reactionInst, infReactionInst)))
{
logger.info("Catalyst inference unsuccessful -- terminating inference for " + reactionInst);
return;
}
inferredCount++; | ||
inferrableHumanEvents.add(reactionInst); | ||
String inferredEvent = infReactionInst.getAttributeValue(DB_ID).toString() + "\t" + infReactionInst.getDisplayName() + "\n"; | ||
Files.write(Paths.get(inferredFilehandle), inferredEvent.getBytes(), StandardOpenOption.APPEND); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inferredFilehandle
might be better named inferredFilePath
. It also may be better to store it as a Path object rather than a String:
private static Path inferredFilePath;
// Replacing the setInferredFilename method
public static void setInferredFilePath(String inferredFilename)
{
inferredFilePath = Paths.get(inferredFilename);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this could confuse future developers, as Java does not really use file handles.
String inferredEvent = infReactionInst.getAttributeValue(DB_ID).toString() + "\t" + infReactionInst.getDisplayName() + "\n"; | ||
Files.write(Paths.get(inferredFilehandle), inferredEvent.getBytes(), StandardOpenOption.APPEND); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the reaction db_id and display name are needed, the getExtendedDisplayName()
method could be used in place of the inferredEvent
temp variable:
Files.write(
Paths.get(inferredFilehandle),
infReactionInst.getExtendedDisplayName().getBytes(),
StandardOpenOption.APPEND
);
public static Map<GKInstance, GKInstance> getInferredEvent(Map<GKInstance, GKInstance> eventsAlreadyInferredMap) | ||
{ | ||
inferredEvent.putAll(eventsAlreadyInferredMap); | ||
return inferredEvent; | ||
} | ||
|
||
public static List<GKInstance> getInferrableHumanEvents() | ||
public static List<GKInstance> getInferrableHumanEvents(List<GKInstance> eventsAlreadyInferred) | ||
{ | ||
inferrableHumanEvents.addAll(eventsAlreadyInferred); | ||
return inferrableHumanEvents; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public static void setEligibleFilename(String eligibleFilename) | ||
{ | ||
eligibleFilehandle = eligibleFilename; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*/ | ||
private static boolean reactionComponentsAreInferrable(GKInstance reactionInst) throws Exception { | ||
// First gather all inputs, outputs and the PEs in catalyst activities | ||
// Inputs/Outputs/CatalystPEs need to be stored in seperate collections. At time of writing, having it all stored in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo - separate?
String eligibleEventName = reactionInst.getAttributeValue(DB_ID).toString() + "\t" + reactionInst.getDisplayName() + "\n"; | ||
Files.write(Paths.get(eligibleFilehandle), eligibleEventName.getBytes(), StandardOpenOption.APPEND); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (GKInstance reactionInput : reactionInputs) { | ||
if (!componentIsInferrable(reactionInput)) { | ||
return false; | ||
} | ||
} | ||
// Screen outputs | ||
for (GKInstance reactionOutput : reactionOutputs) { | ||
if (!componentIsInferrable(reactionOutput)) { | ||
return false; | ||
} | ||
} | ||
// Screen catalyst PhysicalEntities | ||
for (GKInstance reactionCatalystPE : reactionCatalystPEs) { | ||
if (!componentIsInferrable(reactionCatalystPE)) { | ||
return false; | ||
} | ||
} | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could do this:
return Stream.of(reactionInputs, reactionOutputs, reactionCatalystPEs)
.allMatch(reactionInput -> componentIsInferrable(reactionInput));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing is the componentIsInferrable
exception would need to be handled in that method rather than thrown. If it's practical to handle the exception in the method it gets thrown rather than throwing it to the method caller it can help reduce the methods which throw exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jweiser I'm not entirely sure what would be equivalent. In the code here, the function will return as soon as something is found to not be inferrable. Does Stream.of(...).allMatch(...)
short-circuit at the first failed match, or will it evaluate everything? It's not clear to me. Is there an alternate to allMatch
that will fail on the first failure?
Forcing an unnecessary (I assume) evaluation of everything in all of those collections seems rather inefficient, just for the sake of slightly niftier-looking code (ignoring the fact that lambdas already introduce performance overheads).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, allMatch
is a short-circuiting operation: https://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html#allMatch-java.util.function.Predicate-
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to simplify this might be to add all of the inputs, outputs, PEs into a big list and then just use one loop:
List<GKInstance> bigList = new ArrayList<>();
bigList.addAll(reactionInputs);
bigList.addAll(reactionCatalystPEs);
bigList.addAll(reactionOutputs);
for (GKInstance instance : bigList) {
if (!componentIsInferrable(instance)) {
return false;
}
}
return true;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still has return
in the middle of the function, but only in two places, and the code is smaller so it's not as bad ;)
if (!SpeciesCheckUtility.checkForSpeciesAttribute(reactionComponent)) { | ||
// return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a guard clause and the return true;
not commented out. Empty blocks are best avoided.
// return true; | ||
} else if (reactionComponent.getSchemClass().isa(GenomeEncodedEntity)) | ||
{ | ||
if (reactionComponent.getSchemClass().toString().contains(EntityWithAccessionedSequence)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why contains
here rather than equals
or isa
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isa
probably wouldn't make sense since this is comparing Strings. It's possible that the String-representation of the SchemaClass is something like [EntityWithAccessionedSequence]
so equals
would fail as well. Not sure...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It think it could be
reactionComponent.getSchemClass().isa(EntityWithAccessionedSequence)
or
reactionComponent.getSchemClass().getName().equals(EntityWithAccessionedSequence)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. Mostly style/organization related comments.
if (reactionComponent.getSchemClass().isa(Complex) || reactionComponent.getSchemClass().isa(Polymer)) { | ||
int percent = 0; | ||
if (totalProteinCounts > 0) { | ||
percent = (inferrableProteinCounts * 100) / totalProteinCounts; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using an int
for percent
doesn't cause any loss-of-precision problems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double percent = 0
at line 199 would be good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please have a look at the logic in ReactioneInferrer.inferReaction
. There are two empty return statements. I feel that using empty returns in a large void function as a method of flow-control is not a good practice. They can easily get lost and make it more difficult to comprehend the logic flow. If possible, adjust this function so that it does not need any empty return statements.
This PR addresses a couple of things that had been on the backburner for Orthoinference:
Most PhysicalEntity instances are now screened before any inference is attempted on a ReactionlikeEvent. The only cases that aren't screened ahead of time are some edge cases for EntitySets that were very difficult to isolate out ahead of time. Safeguards still exist for preventing inference, so the ones that do squeak through this new screening system will still be rejected. This will prevent a significant majority of orphan PEs.
Orthoinference is now interruptable. If the process is stopped in the middle of a species' inference, it can be restarted and it will still create Pathways and PathwayDiagrams for any automatically inferred RlEs. A question: Should manually inferred ones also be receiving Pathway and PathwayDiagram instances? Its a very simple adjustment if so. This addresses Orthoinference needs to be interruption proof #31
The report file produced at the end of a species' inference now has a header that denotes the release. This addresses Orthoinference report needs a header to identify the Reactome version #83
Thanks!