MosaicBERT: pretraining configuration for models > 128 seq. length

Hi MosaicML team,

many thanks for releasing the code and models for your MosaicBERT! I highly appreciate the effort that you put in modernizing the BERT architecture.

I am interested in pretraining MosaicBERT so I have some questions :)

* I am interested in the pretraining configuration for the model with 512 sequence length. Additionally: do you have hardware recommendations and the approx. time to pretrain MosaicBERT with 512 seq. length. Did you use the phase 1 + phase 2 "trick" with pretraining for 128 seq. length and then fewer steps with 512? For that, the MosaicBERT with 128 seq. length could be "recycled".
* I'm also interested in what implementation is recommended to use e.g. a tagged/specific commit or the upcoming #440 PR.

Many thanks in advance!

Stefan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MosaicBERT: pretraining configuration for models > 128 seq. length #442

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MosaicBERT: pretraining configuration for models > 128 seq. length #442

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions