'seq' value is ;[]', empty for the LBA dataset

Hi,
I am working on the LBA dataset trying to reproduce your results.
I downloaded your LBA dataset in the LMDB format, the download and load dataset function works fine, but the 'seq' value in the dataset is '[]'- empty for each protein. 
1) why is that?
2) I tried to generate the sequence by myself using your get_chain_sequences function in the sequence.py in the protein folder:

def get_chain_sequences(df):
    """Return list of tuples of (id, sequence) for different chains of monomers in a given dataframe."""
    # Keep only CA of standard residues
    df = df[df['name'] == 'CA'].drop_duplicates()
    df = df[df['resname'].apply(lambda x: Poly.is_aa(x, standard=True))]
    df['resname'] = df['resname'].apply(Poly.three_to_one)
    chain_sequences = []
    for c, chain in df.groupby(['ensemble', 'subunit', 'structure', 'model', 'chain']):
        seq = ''.join(chain['resname'])
        chain_sequences.append((tuple([str(x) for x in c]), seq))
    return chain_sequences

It also returns empty list for sequence, so I think there is a bug here.

3) I modified the function a little bit, so I can the get the protein sequences. While for some proteins, there are multiple chains, how to process the multiple chains to use for training or which chain to choose to pair with ligand SMILES to be used for training?

Thanks for your help.   
    



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

'seq' value is ;[]', empty for the LBA dataset #62

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

'seq' value is ;[]', empty for the LBA dataset #62

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions