Skip to content

1) sequence recovery (sequence design) code request 2) generating .mdb file 3) PEFT 4) best model selection after training  #73

@sj584

Description

@sj584

Hi,

I successfully ran the finetuning code using config/pretrain/saprot.py and config/Thermostability/saprot.py
Then I newly got these questions

I would really appreciate it if you could answer these.




1. Could you share sequence recovery (or sequence design) code?
I made it in my own way, but not sure whether it is correct

Pseudocode would be.

given these;
initial_tokens = ['M#', 'Ev', 'Vp', 'Qp', 'L#', 'Vy', 'Qd', 'Ya', 'Kv'] (initial sequence)
input_tokens = ['##', 'Ev', '#p', 'Qp', 'L#', '#y', '#d', 'Ya', 'Kv'] (masked sequence in sequence subtoken)
(##, #p, #y, #d)

the model predicts a single token (seq/structure token) solely from the masked token, the structure subtoken could be wrong.
(ex. ## -> Gr, #p -> Gp, #y -> Gp, #d -> Sd)

Then only extract the sequence token from the predicted token and reconstruct it. (structure subtoken is same)

input_tokens   = ['##', 'Ev', '#p', 'Qp', 'L#', '#y', '#d', 'Ya', 'Kv']
recovered_tokens = ['G#', 'Ev', 'Gp', 'Qp', 'L#, 'Gy', 'Sd', 'Ya', 'Kv']





2. I also made a code to generate the .mdb file as dataset
I checked that it runs ok. But not sure whether the id can be arbitrary or not. (ex. 550, 5500)
I would appreciate it if you could verify this code compared to yours

Generating .mdb file

'''python
import lmdb
import json

Example data
data = {
  "550": {"description": "A0A0J6SSW7", "seq": "M#R#A#A#A#T#L#L#V#T#L#C#V#V#G#A#N#E#A#R#A#GfIwLe..."},
  "5500": {"description": "A0A535NFD5", "seq": "AdAvRvEvAvLvRvAvSvGvHdPdFdVdEdAdPpGpEpAaAdFp..."},
  # Add more entries here
}

Open (or create) an LMDB environment
env = lmdb.open("my_lmdb_file", map_size=1e9) # map_size is the maximum size (in bytes) of the DB
with env.begin(write=True) as txn:
  # Add the length of the dataset for.. return int(self._get("length")) in SaprotFoldseekDataset
  length = len(data)
  txn.put("length".encode("utf-8"), str(length).encode("utf-8"))
  for key, value in data.items():
    # Convert the value to a JSON string
    value_json = json.dumps(value)
    # Store key-value pairs in the database; keys must be bytes
    txn.put(key.encode("utf-8"), value_json.encode("utf-8"))

Close the LMDB environment
env.close()
'''

Reading .mdb file

'''python
env = lmdb.open("my_lmdb_file/", readonly=True)

with env.begin() as txn:
  cursor = txn.cursor()
  for key, value in cursor:
    print(key, value)
'''





3. I onced asked whether PEFT is possible and you kindly answered that it is there in SaprotBaseModel.py
In the code, I could see that Lora can be used for downstream task.

In my case, I was hoping to use LoRA for MLM finetuning first in certain protein domain
and then do further finetuning on downstream task.

I somehow made the code but I think no approaches like this were available previously.
So I was asking your opinion. Whether it will be viable approaches or not.

So the steps will be

  1. Load SaProt model weights
  2. Use LoRA for MLM finetuning
  3. Load (SaProt model weights + Lora MLM finetuning weights)
  4. finetune on downstream task
  5. Load (SaProt model weights + Lora MLM finetuning weights + Lora downstream finetuning weights)
  6. Prediction on downstream task

Or simply downstream task can be done by getting the embeddings from the
(SaProt model weights + Lora MLM finetuning weights)
coz above mentioned steps are too complicated





4. When I ran the code using config/pretrain/saprot.py or config/pretrain/saprot.py
It seems that only one model is saved after training
If so, how can I know whether the saved model is the optimal model?

I could see that in Trainer, enable_checkpointing: false.
Should I change it into True and keep track of the result with wandb and find the model?





Thank you for reading long inqueries. It will be very helpful to me :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions