|
| 1 | +# Import |
| 2 | + |
| 3 | +## Default import options |
| 4 | + |
| 5 | +Every index has `default_import_options` configuration to specify, suddenly, default import options: |
| 6 | + |
| 7 | +```ruby |
| 8 | +class ProductsIndex < Chewy::Index |
| 9 | + index_scope Post.includes(:tags) |
| 10 | + default_import_options batch_size: 100, bulk_size: 10.megabytes, refresh: false |
| 11 | + |
| 12 | + field :name |
| 13 | + field :tags, value: -> { tags.map(&:name) } |
| 14 | +end |
| 15 | +``` |
| 16 | + |
| 17 | +See [import.rb](../lib/chewy/index/import.rb) for available options. |
| 18 | + |
| 19 | +## Raw import |
| 20 | + |
| 21 | +Another way to speed up import time is Raw Imports. This technology is only available in ActiveRecord adapter. Very often, ActiveRecord model instantiation is what consumes most of the CPU and RAM resources. Precious time is wasted on converting, say, timestamps from strings and then serializing them back to strings. Chewy can operate on raw hashes of data directly obtained from the database. All you need is to provide a way to convert that hash to a lightweight object that mimics the behaviour of the normal ActiveRecord object. |
| 22 | + |
| 23 | +```ruby |
| 24 | +class LightweightProduct |
| 25 | + def initialize(attributes) |
| 26 | + @attributes = attributes |
| 27 | + end |
| 28 | + |
| 29 | + # Depending on the database, `created_at` might |
| 30 | + # be in different formats. In PostgreSQL, for example, |
| 31 | + # you might see the following format: |
| 32 | + # "2016-03-22 16:23:22" |
| 33 | + # |
| 34 | + # Taking into account that Elastic expects something different, |
| 35 | + # one might do something like the following, just to avoid |
| 36 | + # unnecessary String -> DateTime -> String conversion. |
| 37 | + # |
| 38 | + # "2016-03-22 16:23:22" -> "2016-03-22T16:23:22Z" |
| 39 | + def created_at |
| 40 | + @attributes['created_at'].tr(' ', 'T') << 'Z' |
| 41 | + end |
| 42 | +end |
| 43 | + |
| 44 | +index_scope Product |
| 45 | +default_import_options raw_import: ->(hash) { |
| 46 | + LightweightProduct.new(hash) |
| 47 | +} |
| 48 | + |
| 49 | +field :created_at, 'datetime' |
| 50 | +``` |
| 51 | + |
| 52 | +Also, you can pass `:raw_import` option to the `import` method explicitly. |
| 53 | + |
| 54 | +## Index creation during import |
| 55 | + |
| 56 | +By default, when you perform import Chewy checks whether an index exists and creates it if it's absent. |
| 57 | +You can turn off this feature to decrease Elasticsearch hits count. |
| 58 | +To do so you need to set `skip_index_creation_on_import` parameter to `false` in your `config/chewy.yml`. |
| 59 | + |
| 60 | +## Skip record fields during import |
| 61 | + |
| 62 | +You can use `ignore_blank: true` to skip fields that return `true` for the `.blank?` method: |
| 63 | + |
| 64 | +```ruby |
| 65 | +index_scope Country |
| 66 | +field :id |
| 67 | +field :cities, ignore_blank: true do |
| 68 | + field :id |
| 69 | + field :name |
| 70 | + field :surname, ignore_blank: true |
| 71 | + field :description |
| 72 | +end |
| 73 | +``` |
| 74 | + |
| 75 | +### Default values for different types |
| 76 | + |
| 77 | +By default `ignore_blank` is false on every type except `geo_point`. |
| 78 | + |
| 79 | +## Journaling |
| 80 | + |
| 81 | +You can record all actions that were made to the separate journal index in Elasticsearch. |
| 82 | +When you create/update/destroy your documents, it will be saved in this special index. |
| 83 | +If you make something with a batch of documents (e.g. during index reset) it will be saved as a one record, including primary keys of each document that was affected. |
| 84 | +Common journal record looks like this: |
| 85 | + |
| 86 | +```json |
| 87 | +{ |
| 88 | + "action": "index", |
| 89 | + "object_id": [1, 2, 3], |
| 90 | + "index_name": "...", |
| 91 | + "created_at": "<timestamp>" |
| 92 | +} |
| 93 | +``` |
| 94 | + |
| 95 | +This feature is turned off by default. |
| 96 | +You can turn it on by setting `journal` option to `true` in `config/chewy.yml`. |
| 97 | + |
| 98 | +Also, you can provide this option while you're importing some index: |
| 99 | + |
| 100 | +```ruby |
| 101 | +CityIndex.import journal: true |
| 102 | +``` |
| 103 | + |
| 104 | +Or as a default import option for an index: |
| 105 | + |
| 106 | +```ruby |
| 107 | +class CityIndex |
| 108 | + index_scope City |
| 109 | + default_import_options journal: true |
| 110 | +end |
| 111 | +``` |
| 112 | + |
| 113 | +You may be wondering why do you need it? The answer is simple: not to lose the data. |
| 114 | + |
| 115 | +Imagine that you reset your index in a zero-downtime manner (to separate index), |
| 116 | +and in the meantime somebody keeps updating the data frequently (to old |
| 117 | +index). So all these actions will be written to the journal index and you'll be |
| 118 | +able to apply them after index reset using the `Chewy::Journal` interface. |
| 119 | + |
| 120 | +When enabled, journal can grow to enormous size, consider setting up cron job |
| 121 | +that would clean it occasionally using [`chewy:journal:clean` rake |
| 122 | +task](rake_tasks.md#chewyjournal). |
0 commit comments