Skip to content

'seq' value is ;[]', empty for the LBA dataset #62

@lzhangUT

Description

@lzhangUT

Hi,
I am working on the LBA dataset trying to reproduce your results.
I downloaded your LBA dataset in the LMDB format, the download and load dataset function works fine, but the 'seq' value in the dataset is '[]'- empty for each protein.

  1. why is that?
  2. I tried to generate the sequence by myself using your get_chain_sequences function in the sequence.py in the protein folder:

def get_chain_sequences(df):
"""Return list of tuples of (id, sequence) for different chains of monomers in a given dataframe."""
# Keep only CA of standard residues
df = df[df['name'] == 'CA'].drop_duplicates()
df = df[df['resname'].apply(lambda x: Poly.is_aa(x, standard=True))]
df['resname'] = df['resname'].apply(Poly.three_to_one)
chain_sequences = []
for c, chain in df.groupby(['ensemble', 'subunit', 'structure', 'model', 'chain']):
seq = ''.join(chain['resname'])
chain_sequences.append((tuple([str(x) for x in c]), seq))
return chain_sequences

It also returns empty list for sequence, so I think there is a bug here.

  1. I modified the function a little bit, so I can the get the protein sequences. While for some proteins, there are multiple chains, how to process the multiple chains to use for training or which chain to choose to pair with ligand SMILES to be used for training?

Thanks for your help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions