There is a large index, size over 800Gb.
There are billions redords in this index, and most records(over 95%) are duplicate records, they are generated by log resend.
ES limit search Batch size less than 10000, and this index is so huge. I tried it with es-dedupe to dedupe it records just during 1 mins, search processing cost 1 hours(I checked the es server, 4G IO per second).
Maybe there is another way to deal with it.
Read origin index, if the record is unique, write it to a new index. If record is duplicated, skip. If the new index is still huge, limit the new index size, over size then write to another new index named as xxx-001
There is a large index, size over 800Gb.
There are billions redords in this index, and most records(over 95%) are duplicate records, they are generated by log resend.
ES limit search Batch size less than 10000, and this index is so huge. I tried it with es-dedupe to dedupe it records just during 1 mins, search processing cost 1 hours(I checked the es server, 4G IO per second).
Maybe there is another way to deal with it.
Read origin index, if the record is unique, write it to a new index. If record is duplicated, skip. If the new index is still huge, limit the new index size, over size then write to another new index named as xxx-001