Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summary output short / wrong html text #27

Open
chrisribe opened this issue Sep 1, 2015 · 3 comments
Open

Summary output short / wrong html text #27

chrisribe opened this issue Sep 1, 2015 · 3 comments

Comments

@chrisribe
Copy link

Hi,
I am trying to get a summary text for a given webpage but the returned summary is short / bad and I cant figure out why....

Here are my steps:

  1. Use needle to get the url html data
  2. Run the data through readability so only the core of the html is extracted.
  3. Sanitize the html to only return the text.
  4. Run summarize...

Why am I getting one or two lines of summary ? Can I control how long the summary output should be? I want the output to be something like http://smmry.com

Here is the code:

var needle = require('needle');
var read = require('node-readability');
var sanitizeHtml = require('sanitize-html');
var SummaryTool = require('node-summary');

var url = " http://www.inc.com/gene-marks/the-one-way-to-tell-if-you-re-a-successful-entrepreneur.html?cid=sf01001";

needle.get(url, function(error, response) {

  if (!error && response.statusCode == 200){
    read(response.body, function(err, article, meta) {

        var cleanCont = sanitizeHtml(article.content, {
          allowedTags: [ ],
          allowedAttributes: {}
        });
        SummaryTool.summarize(article.title, cleanCont, function(err, summary) {
            if(err) console.log("Something went wrong man!");

            console.log(summary);

            console.log("Original Length " + (article.title.length + cleanCont.length));
            console.log("Summary Length " + summary.length);
            console.log("Summary Ratio: " + (100 - (100 * (summary.length / (article.title.length + cleanCont.length)))));
        });
    });
  }

});

Thanks
Chris

@chrisribe
Copy link
Author

Well some progress added carriage returns on html </p> and <br> so the algorithm knows the text separations... Not perfect but better. Any ideas?

var needle = require('needle');
var read = require('node-readability');
var sanitizeHtml = require('sanitize-html');
var SummaryTool = require('node-summary');

var url = " http://www.inc.com/gene-marks/the-one-way-to-tell-if-you-re-a-successful-entrepreneur.html?cid=sf01001";

needle.get(url, function(error, response) {

  if (!error && response.statusCode == 200){
    read(response.body, function(err, article, meta) {

        var str = article.content.replace(/<\/p>/g, '<\/p>\n\n');
        str = str.replace(/<br>/g, '<br>\n');
        str = str.replace(/<br\/>/g, '<br\/>\n');
        str = str.replace(/<br \/>/g, '<br \/>\n');

        var cleanCont = sanitizeHtml(str, {
          allowedTags: [ ],
          allowedAttributes: {}
        });

        SummaryTool.summarize(article.title, cleanCont, function(err, summary) {
            if(err) console.log("Something went wrong man!");

            console.log(summary);

            console.log("Original Length " + (article.title.length + cleanCont.length));
            console.log("Summary Length " + summary.length);
            console.log("Summary Ratio: " + (100 - (100 * (summary.length / (article.title.length + cleanCont.length)))));
        });

    });
  }

});

@SmiffyKMc
Copy link

@chrisribe I'm getting the same. I will write a good couple of lines for content and try to summarize it but I only get like a line back? The example returned shows loads of lines but summarizes fine then? How can we get it to summarize better?

@chrisribe
Copy link
Author

Never got it to work exactly as I wanted it.
Moved to another project and put this aside so...

Sorry.
Chris

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants