Skip to content

Conversation

@derlin
Copy link

@derlin derlin commented Jun 10, 2017

About path argument: I found it easier to be able to pass the path to the wikidump file as an argument instead of recompiling every time I want to use another dump.

About docIds: In chapter 6, it says

"creating a mapping of row IDs to document titles is a little more difficult. To achieve it, we can use the zipWithUniqueId function ..."

Another way to keep track of docIds is to use IndexRowMatrix instead of RowMatrix. This way, document ids are embedded in the svd model and don't depend on the partitioning anymore. This technique has many advantages, one of which is that it is now possible to save the svd model for later use.

To generate doc ids, I still use the zipWithUniqueId available for RDD only. A better way would be to use the sql function monotically_increasing_id:

    import org.apache.spark.sql.functions._
    docTermMatrix.withColumn("id",monotonically_increasing_id)

but this generates huge ids (about 10 digits long), which is harder to read. Hence the addNiceRowId method.

(By the way, loved your book, nice work !)

…atrix to keep track of document Ids

    In chapter 6, it says "creating a mapping of row IDs to document titles is a little more difficult. To achieve it, we can use the zipWithUniqueId function ...". A less "hackish" way to keep track of docIds is to use IndexRowMatrix instead of RowMatrix. This way, doc ids are embedded in the svd model. There are many advantages, one of which is that it is now possible to save the svd model for later use.

    To generate doc ids, I still use the zipWithUniqueId available for RDD only. A better way would be to use the sql function "monotically_increasing_id":

        import org.apache.spark.sql.functions._
        docTermMatrix.withColumn("id",monotonically_increasing_id)

    but this generates huge ids (about 10 digits long), which is harder to read. Hence the "addNiceRowId" method.
@srowen
Copy link
Collaborator

srowen commented Jun 10, 2017

This looks like a good suggestion. The book has just gone to press though, so I'm not sure we can add this for the 2nd edition. But it can stay here as a note and suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants