Skip to content

Postprocessdeduplication#129

Open
abhinavhegde wants to merge 23 commits into
dice-group:masterfrom
abhinavhegde:postprocessdeduplication
Open

Postprocessdeduplication#129
abhinavhegde wants to merge 23 commits into
dice-group:masterfrom
abhinavhegde:postprocessdeduplication

Conversation

@abhinavhegde

Copy link
Copy Markdown

Implementation for post process deduplication added. Sparql based solution has been given for storing hash values, graphId in metadata and fetching uris based on hash values and graph ids.
Also tests have been written for the same.
Dropping of graphs based on graph id and updation of with the old graph id also achieved.

abhinavhegde and others added 18 commits November 18, 2019 15:34
1. Triple links of old URI added to new URI.
2. sendTriples() added to interface AdvancedTripleBasedSink.
3. New Test class created for DeduplicatorComponent.
4. Refactored some code.
…integrated.

2. Test Cases written to test deduplication functionality.
…adata and the uris with same hashes are fetched. Also, tests are written for the same
…a, removing the duplicate graphs based on hash value check and updation of graph id of the new uris to the graph id of old uri.

@MichaelRoeder MichaelRoeder left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The work seems to progress into the right direction. I added some comments regarding possible improvements.

Comment thread docker-compose-sparql.yml Outdated
# - JVM_ARGS=-Xmx2g

mongodb:
db:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please be careful with renaming the containers in the compose-file. Renaming this container for example will lead to a failure in the frontier. Please undo this change.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted.

public static final String URI_HASH_KEY = "HashValue";
public static final String UUID_KEY = "UUID";
public static final String URI_GRAPH = "graphID";
public static final String URI_DUPLICATE = "duplicate-uri";

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URI_DUPLICATE shouldn't be necessary, right? 🤔

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted.

/**
Set<CrawleableUri> getUrisWithSameHashValues(String hashValue);

/**<

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a < symbol?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was a mistake, removed.

Set<CrawleableUri> getUrisWithSameHashValues(Set<HashValue> hashValuesForComparison);

/**
Set<CrawleableUri> getUrisWithSameHashValues(String hashValue);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a Javadoc comment describing the method.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadoc added.

private String graphID;

private String hashValue;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove these two new attributes and use the data map instead.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

queryString.append(uri + "> } } WHERE { GRAPH <");
queryString.append(Constants.DEFAULT_META_DATA_GRAPH_URI + "> { ?subject <");
queryString.append(Squirrel.containsDataOf + "> <");
queryString.append(uri + "> } }");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+ in append

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected.

queryString.append("SELECT ?subject WHERE { GRAPH <");
queryString.append(Constants.DEFAULT_META_DATA_GRAPH_URI + "> { ?subject <");
queryString.append(Squirrel.containsDataOf + "> <");
queryString.append(uri.getUri().toString() + ">} }");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected.

protected static QueryExecutionFactory queryExecFactory = null;

protected UpdateExecutionFactory updateExecFactory = null;
protected static UpdateExecutionFactory updateExecFactory = null;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should the two factories be static? This shouldn't be necessary.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

LOGGER.error("Exception during updating", ex);
}
}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As stated above, these methods shouldn't be part of the sink.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SparqlBasedSink restored to earlier version.

if(numberOfTriples == 0) {
//in case of adding a triple at a later stage, numberOfTriples will be 0. So fetching it from the CrawlingActivity
numberOfTriples = ((CrawlingActivity) uri.getData().get(Constants.URI_CRAWLING_ACTIVITY)).getNumberOfTriples();
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case of adding a triple at a later stage, numberOfTriples will be 0

That is not true. Please have a look at the usage of the TripleBuffer class in the AbstractBufferingTripleBasedSink. The setting the internal counter numberOfTriples to the value of the activity does not really make sense to me.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted.

DeduplicationSink renamed to SparqlBasedGraphHandler
getTriplesForGraph() moved to deduplication module since it has been used only there. AdvancedTripleBasedSink has been deleted since it only contained getTriplesForGraph().
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants