Postprocessdeduplication#129
Conversation
1. Triple links of old URI added to new URI. 2. sendTriples() added to interface AdvancedTripleBasedSink. 3. New Test class created for DeduplicatorComponent. 4. Refactored some code.
…connection and URI filter.
…integrated. 2. Test Cases written to test deduplication functionality.
…adata and the uris with same hashes are fetched. Also, tests are written for the same
…ostprocessdeduplication
…a, removing the duplicate graphs based on hash value check and updation of graph id of the new uris to the graph id of old uri.
MichaelRoeder
left a comment
There was a problem hiding this comment.
The work seems to progress into the right direction. I added some comments regarding possible improvements.
| # - JVM_ARGS=-Xmx2g | ||
|
|
||
| mongodb: | ||
| db: |
There was a problem hiding this comment.
Please be careful with renaming the containers in the compose-file. Renaming this container for example will lead to a failure in the frontier. Please undo this change.
| public static final String URI_HASH_KEY = "HashValue"; | ||
| public static final String UUID_KEY = "UUID"; | ||
| public static final String URI_GRAPH = "graphID"; | ||
| public static final String URI_DUPLICATE = "duplicate-uri"; |
There was a problem hiding this comment.
The URI_DUPLICATE shouldn't be necessary, right? 🤔
| /** | ||
| Set<CrawleableUri> getUrisWithSameHashValues(String hashValue); | ||
|
|
||
| /**< |
| Set<CrawleableUri> getUrisWithSameHashValues(Set<HashValue> hashValuesForComparison); | ||
|
|
||
| /** | ||
| Set<CrawleableUri> getUrisWithSameHashValues(String hashValue); |
There was a problem hiding this comment.
Please add a Javadoc comment describing the method.
| private String graphID; | ||
|
|
||
| private String hashValue; | ||
|
|
There was a problem hiding this comment.
Please remove these two new attributes and use the data map instead.
| queryString.append(uri + "> } } WHERE { GRAPH <"); | ||
| queryString.append(Constants.DEFAULT_META_DATA_GRAPH_URI + "> { ?subject <"); | ||
| queryString.append(Squirrel.containsDataOf + "> <"); | ||
| queryString.append(uri + "> } }"); |
| queryString.append("SELECT ?subject WHERE { GRAPH <"); | ||
| queryString.append(Constants.DEFAULT_META_DATA_GRAPH_URI + "> { ?subject <"); | ||
| queryString.append(Squirrel.containsDataOf + "> <"); | ||
| queryString.append(uri.getUri().toString() + ">} }"); |
| protected static QueryExecutionFactory queryExecFactory = null; | ||
|
|
||
| protected UpdateExecutionFactory updateExecFactory = null; | ||
| protected static UpdateExecutionFactory updateExecFactory = null; |
There was a problem hiding this comment.
Why should the two factories be static? This shouldn't be necessary.
| LOGGER.error("Exception during updating", ex); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
As stated above, these methods shouldn't be part of the sink.
There was a problem hiding this comment.
SparqlBasedSink restored to earlier version.
| if(numberOfTriples == 0) { | ||
| //in case of adding a triple at a later stage, numberOfTriples will be 0. So fetching it from the CrawlingActivity | ||
| numberOfTriples = ((CrawlingActivity) uri.getData().get(Constants.URI_CRAWLING_ACTIVITY)).getNumberOfTriples(); | ||
| } |
There was a problem hiding this comment.
in case of adding a triple at a later stage, numberOfTriples will be 0
That is not true. Please have a look at the usage of the TripleBuffer class in the AbstractBufferingTripleBasedSink. The setting the internal counter numberOfTriples to the value of the activity does not really make sense to me.
DeduplicationSink renamed to SparqlBasedGraphHandler getTriplesForGraph() moved to deduplication module since it has been used only there. AdvancedTripleBasedSink has been deleted since it only contained getTriplesForGraph().
Implementation for post process deduplication added. Sparql based solution has been given for storing hash values, graphId in metadata and fetching uris based on hash values and graph ids.
Also tests have been written for the same.
Dropping of graphs based on graph id and updation of with the old graph id also achieved.