Skip to content

This works for me in PHP #2

@juanmf

Description

@juanmf
<?php 
/**
 * Cli process that gets as 1st argument the output of tesseract ... hocr and dumps 
 * its text nodes
 */
$inFile = $argv[1];
$outFile = $argv[2];
$stream = file_get_contents($inFile);
$dom = DOMDocument::loadHTML($stream);
$out = array();
foreach ($dom->getElementsByTagName('p') as $tag) {
    $out[] = $tag->nodeValue;
}

file_put_contents($outFile, implode("\n", $out));

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions