Skip to content

not working with over 800Gb index #16

@LQBing

Description

@LQBing

There is a large index, size over 800Gb.
There are billions redords in this index, and most records(over 95%) are duplicate records, they are generated by log resend.

ES limit search Batch size less than 10000, and this index is so huge. I tried it with es-dedupe to dedupe it records just during 1 mins, search processing cost 1 hours(I checked the es server, 4G IO per second).

Maybe there is another way to deal with it.
Read origin index, if the record is unique, write it to a new index. If record is duplicated, skip. If the new index is still huge, limit the new index size, over size then write to another new index named as xxx-001

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions