[Tracking] 1M TPS benchmarks

Tracking issues for all tasks related to benchmarks, for the 1 million TPS initiative. 

# Current Max TPS

| State size | With RPC nodes | Number of shards | Action | TPS |
|-|-|-|-|-|
| 4 accounts / shard | no | 70 | native transfer | 108.5k |

# Reference benchmark

A 20 shards benchmark which we can use as a baseline to measure performance improvements in `neard`. 

## Unlimited config

**Baseline max TPS: 56k**

[Benchmark run](https://github.com/near/nearcore/issues/13130#issuecomment-2754325519)

- Accounts per shard: 2
- Block time: 2.5s
- Max block time: 6s
- produce_chunk_add_transactions_time_limit: 800ms
- Validator mandates: 1
- Unlimited setup (no state witness limits, no bandwidth scheduler, no gas limit, no congestion control)
- All VMs are deployed in the same zone
- Transactions are injected into chunk producers

## Realistic config

**Baseline max TPS: 14k**

[Benchmark run](https://github.com/near/nearcore/issues/13130#issuecomment-2759045507)

- Accounts per shard: 50k
- Block time: 1.3s
- Max block time: 4s
- produce_chunk_add_transactions_time_limit: 400ms
- Validator mandates: 14 (2/3 of stake)
- Unlimited setup (no state witness limits, no bandwidth scheduler, no gas limit, no congestion control)
- All VMs are deployed in the same zone
- Transactions are injected into chunk producers

# Tasks

# High priority
- [x] Use different hardware
We currently use N2D class machines. We can try something with faster CPU.
**Using machines with better single threaded performance (C2D) yielded an improvement in TPS of ~%15 [dashboard](https://grafana.nearone.org/goto/E0T3TdTNR?orgId=1)**
- [ ] Investigate chunk misses bottleneck
The chain breaks when there's too much load and chunks can't be endorsed in time. 
  - One possible fix: #13190 
  - Needs more investigation, maybe creating additional metrics
- [ ] Run a more *realistic* network @Trisfald 
  - 20 shards
  - large state
  - mainnet-like config
- [ ] Investigate how chuck validators impact performance

# Low priority
- [ ] Attempt to create more shards
I've stopped at 70 shards right now. With 80 shards I had trouble even to create accounts. It might be possible to push the number of shards to 75 or, with some tweaks, higher.
- [x] Try even higher block time
Especially on larger network, increasing block time has been an efficient way to improve TPS. We can try more aggressive configurations.
**Not very successful: increasing block time to very large number didn't help TPS**

# Benchmarks

## State generation
Create genesis, adjust nodes configuration, and build a suitable initial database state.

- [x] Tool to generate uniform state locally, for 50 shards
  - [x] Minimal state (a few accounts): `synth-bm` from CRT
  - [x] Large state
- [x] Tool to generate uniform state in forknet, for 50 shards
  - [x] Minimal state (a few accounts)
  - [x] Large state

## Traffic generation
Generate transactions to stress the network.

- [ ] Native token transfers
  - [x] Evaluate existing tools made by CRT
  - [x] Tool to generate intra-shard traffic locally
  - [ ] Tool to generate cross-shard traffic locally
  - [x] Tool to generate intra-shard traffic in forknet
  - [ ] Tool to generate cross-shard traffic in forknet
- [ ] Fungible token transfers: 
`TODO` 

## Benchmark setup
- [ ] Automation to run local benchmarks
  - [x] Support for native token transfers
  - [ ] Support for fungible token transfers
- [ ] Automation to run multi-node benchmarks (forknet)
  - [x] Support for native token transfers
  - [ ] Support for fungible token transfers

## Benchmark runs
- [x] Native token transfers
  - [x] Intra-shard traffic
    - [x] Single node benchmark
    - [x] Multi node local benchmark
    - [x] Forknet benchmark
  - [ ] Cross-shard traffic
    - [ ] Single node benchmark
    - [ ] Multi node local benchmark
    - [ ] Forknet benchmark

# Issues found
- ## High priority
  - [x] **_[Low number of shards]_** Client actor bottleneck: TPS is suboptimal due to the single threaded design of client actor. See also #12963.
  - [ ] **_[High number of shards]_** Chunk production / endorsement bottleneck: the chain TPS are limited by the appearance of chunk misses, that create a snowball effect and keep increasing in number. See #13190 
  - [ ] RPC nodes are unable to keep up with the network, because they track all shards. In a real network they also need to sustain the extra load of accepting user transactions. Even without sending any transaction to the RPC node, and with memtries, I have observed TX application speed of 15k-8k TPS while the chain can go above 40k TPS

- ## Medium priority
  - [ ] The number of peer connections must be equal or higher than the number of chunk producers in the network. If not, the chain TPS degrades significantly.
  - [ ] The bigger the state, the lower the TPS. This affects single shard performance. Which is expected, but perhaps not to this extent.
          1k accounts -> 4k TPS
          100k accounts -> 2.7k TPS
  - ~[x] The chain breaks at relatively low TPS due to exponentially growing block times, caused by chunk misses because of lack of endorsements.~ This is a side effect of the client actor bottleneck: endorsements are not processed in time. Does not happen if proper gas limits are in place.

- ## Low priority
  - [x] Sending transactions directly to chunk validators doesn't work, if they don't track all shards, due to timeouts. Workaround: Inject transactions with the `transaction generator ` tool

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tracking] 1M TPS benchmarks #13130

Current Max TPS

Reference benchmark

Unlimited config

Realistic config

Tasks

High priority

Low priority

Benchmarks

State generation

Traffic generation

Benchmark setup

Benchmark runs

Issues found

High priority

Medium priority

Low priority

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Tracking] 1M TPS benchmarks #13130

Description

Current Max TPS

Reference benchmark

Unlimited config

Realistic config

Tasks

High priority

Low priority

Benchmarks

State generation

Traffic generation

Benchmark setup

Benchmark runs

Issues found

High priority

Medium priority

Low priority

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions