Skip to content

Bulk indexing giving intermittent 400s (due to 100ms timeout?) #788

@aguynamedben

Description

@aguynamedben

Describe the bug
I'm trying to use the _bulk endpoint. I read the tests in the code, and understand that it want's line-by-line JSON as the request body. I got it all working, and I'm trying to index Wikipedia articles that look like this:

{"url":"https://en.wikipedia.org/wiki?curid=48688133","title":"Zhenjiang dialect","body":"\nZhenjiang dialect\n\nZhenjiang dialect is a form of Eastern Mandarin spoken in the town of Zhenjiang in Jiangsu Province. The town is situated on the south bank of the Yangtze river between Nanjing and Changzhou. It is thus at the intersection of China's Mandarin and Wu speaking regions. About 2.7 million Chinese live in the area where the Zhenjiang dialect is predominant.\nIn ancient times, Zhenjiang spoke Wu. Today, Wu is the language of nearby Changzhou, as well as Shanghai and Zhejiang Province. Mandarin speakers from the North have been immigrating to Zhenjiang since the fourth century, gradually changing the character of the local dialect. In modern times, the city speaks a dialect that is transitional between the Eastern Mandarin of Nanjing, located just west of the city, and the Taihu dialect of Wu spoken in Changzhou, which is just east of the city. Zhenjiang dialect is comprehensible to Nanjing residents, but not to Changzhou residents.\nThe issue of tones in the Zhenjiang dialect has been a topic scholarly study. Nanjing residents use the four tones of Mandarin, while Changzhou residents use seven or eight tones. According to a study by Qiu Chunan, Zhenjiang dialect has five citation tones: Tone1 (42) (a sharp fall from pitch 4 to pitch 2, or \"yinping\"), Tone2 (35) (a rising tone or \"yangping\"), Tone3 (32) (slight falling tone or \"shang\"), Tone4 (55) (high even or \"qu\"), and Tone5 (5) (checked tone or \"ru\"). Qiu's study used residents who had grown up in the Daxi Road area, where the standard form of the dialect is said to be spoken. The checked tone was a feature of Chinese spoken in the Middle Ages, but it is not part of Mandarin. Applying the theory of government phonology to the issue, Bao Zhiming noted that non-even tones become even when they appear before the high even, or 55, tone.\n\n","id":"65be5132-c479-474a-b8d6-3ed9bc1a19d9"}

I regularly (but not always) get 400s with no helpful response body. I don't see a panic or any logs in Toshi's stdout. From looking at the bulk_insert handler, it looks like it's probably to the index_documents call at the bottom. It seems that within index_document there is some sort of 100ms timeout. I've seen the error happen after a slight hiccup in my script's output, so I'm wondering if a delay within Toshi is causing the 400s.

It seems that if I use a small batch size, i.e. 5 or 10 records, the timeout is less likely, but I'm trying to insert 5m documents, so I wish to use a batch size of 100 or 1,000, and flush at the end.

Any ideas?

Thanks for sharing this project, it's really, really cool!

To Reproduce
Steps to reproduce the behavior:

  1. Bulk insert records quickly.
  2. Watch for error.

Expected behavior
201 Created

Desktop (please complete the following information):

  • OS: macOS 11.3.1
  • Rust Version: 1.52.1
  • Toshi Version: master @ bbfa8e

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions