|
| 1 | +## Brick\StructuredData |
| 2 | + |
| 3 | +<img src="https://raw.githubusercontent.com/brick/brick/master/logo.png" alt="" align="left" height="64"> |
| 4 | + |
| 5 | +A PHP library to read Microdata, RDFa Lite & JSON-LD structured data in HTML pages. |
| 6 | + |
| 7 | +This library is a foundation to read schema.org structured data in [brick/schema](https://github.com/brick/schema), |
| 8 | +but may be used with other vocabularies. |
| 9 | + |
| 10 | +[](http://travis-ci.org/brick/structured-data) |
| 11 | +[](https://coveralls.io/r/brick/structured-data?branch=master) |
| 12 | +[](https://packagist.org/packages/brick/structured-data) |
| 13 | +[](http://opensource.org/licenses/MIT) |
| 14 | + |
| 15 | +### Installation |
| 16 | + |
| 17 | +This library is installable via [Composer](https://getcomposer.org/): |
| 18 | + |
| 19 | +```bash |
| 20 | +composer require brick/structured-data |
| 21 | +``` |
| 22 | + |
| 23 | +### Requirements |
| 24 | + |
| 25 | +This library requires PHP 7.2 or later. It makes use of the following extensions: |
| 26 | + |
| 27 | +- [dom](https://www.php.net/manual/en/book.dom.php) |
| 28 | +- [json](https://www.php.net/manual/en/book.json.php) |
| 29 | +- [libxml](https://www.php.net/manual/en/book.libxml.php) |
| 30 | + |
| 31 | +These extensions are enabled by default, and should be available in most PHP installations. |
| 32 | + |
| 33 | +### Project status & release process |
| 34 | + |
| 35 | +This library is under development. It is likely to change fast in the early `0.x` releases. However, the library follows a strict BC break convention: |
| 36 | + |
| 37 | +The current releases are numbered `0.x.y`. When a non-breaking change is introduced (adding new methods, fixing bugs, |
| 38 | +optimizing existing code, etc.), `y` is incremented. |
| 39 | + |
| 40 | +**When a breaking change is introduced, a new `0.x` version cycle is always started.** |
| 41 | + |
| 42 | +It is therefore safe to lock your project to a given release cycle, such as `0.1.*`. |
| 43 | + |
| 44 | +If you need to upgrade to a newer release cycle, check the [release history](https://github.com/brick/structured-data/releases) |
| 45 | +for a list of changes introduced by each further `0.x.0` version. |
| 46 | + |
| 47 | +### Introduction |
| 48 | + |
| 49 | +The library unifies reading the 3 supported formats (Microdata, RDFa Lite & JSON-LD) under a common interface: |
| 50 | + |
| 51 | +```php |
| 52 | +interface Brick\StructuredData\Reader |
| 53 | +{ |
| 54 | + /** |
| 55 | + * Reads the items contained in the given document. |
| 56 | + * |
| 57 | + * @param DOMDocument $document The DOM document to read. |
| 58 | + * @param string $url The URL the document was retrieved from. This will be used only to resolve relative |
| 59 | + * URLs in property values. No attempt will be performed to connect to this URL. |
| 60 | + * |
| 61 | + * @return Item[] The top-level items. |
| 62 | + */ |
| 63 | + public function read(DOMDocument $document, string $url) : array; |
| 64 | +} |
| 65 | +``` |
| 66 | + |
| 67 | +There are 3 implementations of this interface, one for each format: |
| 68 | + |
| 69 | +- `MicrodataReader` |
| 70 | +- `RdfaLiteReader` |
| 71 | +- `JsonLdReader` |
| 72 | + |
| 73 | +The `read()` method returns the top-level items found in the document. Every `Item` consists of: |
| 74 | + |
| 75 | +- An optional id (`itemid` in Microdata, `resource` in RDFa Lite, `@id` in JSON-LD) |
| 76 | +- An array of zero or more types; each type is a URL, for example `http://schema.org/Product` |
| 77 | +- An associative array of zero or more properties; each property has a URL as a key, for example `http://schema.org/price`, |
| 78 | + and maps to an array of one or more values; values can be plain strings, or nested `Item` objects |
| 79 | + |
| 80 | +### Quickstart |
| 81 | + |
| 82 | +Here is a working example that reads Microdata from a web page. Just change the URL and give it a try: |
| 83 | + |
| 84 | +```php |
| 85 | +use Brick\StructuredData\Reader\MicrodataReader; |
| 86 | +use Brick\StructuredData\HTMLReader; |
| 87 | +use Brick\StructuredData\Item; |
| 88 | + |
| 89 | +// Let's read Microdata here; |
| 90 | +// You could also use RdfaLiteReader, JsonLdReader, |
| 91 | +// or even use all of them by chaining them in a ReaderChain |
| 92 | +$microdataReader = new MicrodataReader(); |
| 93 | + |
| 94 | +// Wrap into HTMLReader to be able to read HTML strings or files directly, |
| 95 | +// i.e. without manually converting them to DOMDocument instances first |
| 96 | +$htmlReader = new HTMLReader($microdataReader); |
| 97 | + |
| 98 | +// Replace this URL with that of a website you know is using Microdata |
| 99 | +$url = 'http://www.example.com/'; |
| 100 | +$html = file_get_contents($url); |
| 101 | + |
| 102 | +// Read the document and return the top-level items found |
| 103 | +// Note: the URL is only required to resolve relative URLs; no attempt will be made to connect to it |
| 104 | +$items = $htmlReader->read($html, $url); |
| 105 | + |
| 106 | +// Loop through the top-level items |
| 107 | +foreach ($items as $item) { |
| 108 | + echo implode(',', $item->getTypes()), PHP_EOL; |
| 109 | + |
| 110 | + foreach ($item->getProperties() as $name => $values) { |
| 111 | + foreach ($values as $value) { |
| 112 | + if ($value instanceof Item) { |
| 113 | + // We're only displaying the class name in this example; you would typically |
| 114 | + // recurse through nested Items to get the information you need |
| 115 | + $value = '(' . implode(', ', $value->getTypes()) . ')'; |
| 116 | + } |
| 117 | + |
| 118 | + // If $value is not an Item, then it's a plain string |
| 119 | + |
| 120 | + echo " - $name: $value", PHP_EOL; |
| 121 | + } |
| 122 | + } |
| 123 | +} |
| 124 | +``` |
| 125 | + |
| 126 | +### Known issues |
| 127 | + |
| 128 | +- No support for the `itemref` attribute in `MicroDataReader` |
| 129 | +- No support for the `prefix` attribute in `RdfaLiteReader`; only [predefined prefixes](https://www.w3.org/2011/rdfa-context/rdfa-1.1) are supported right now |
| 130 | +- No proper support for `@context` in `JsonLdReader`; right now, only strings are accepted in `@context`, and they are considered a vocabulary identifier; this works fine with simple markup like the one used in the examples on [schema.org](https://schema.org/), but may fail with more complex documents. |
| 131 | + |
| 132 | +#### Note about JSON-LD's `@context` |
| 133 | + |
| 134 | +While `JsonLdReader` should be able to handle a proper context object in the future, its goal will never be to be a |
| 135 | +fully compliant JSON-LD parser; in particular, it will *never* attempt to fetch a JSON-LD context referenced by a URL. |
| 136 | + |
| 137 | +This is consistent with how indexing robots typically crawl the web, they do not fetch remote contexts, which relieves |
| 138 | +them from fetching additional documents to extract structured data from a web page. |
| 139 | + |
| 140 | +The aim of `JsonLdReader`, and the other `Reader` implementations for that matter, is to be able to parse a document with the same capabilities as [Google Structured Data Testing Tool](https://search.google.com/structured-data/testing-tool/) or [Yandex Structured data validator](https://webmaster.yandex.com/tools/microtest/), no more, no less. These tools [do not load external context files](https://webmasters.stackexchange.com/q/123425/18342). |
0 commit comments