Open
Description
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual/3")
df2 = dd.from_pandas(pd.DataFrame({'ngram': ngrams_list}), npartitions=4)
csv_dask = dd.read_csv(csv_file_path)
csv_dask = csv_dask.repartition(npartitions=1).reset_index()
def emb_skill(skill):
return embed([skill])[0].numpy()
df2 = df2.assign(embeddings = df2['ngram'].map_partitions(lambda series: series.apply(emb_skill), meta='object'))
csv_dask = csv_dask.assign(embeddings = csv_dask['hcms_skills'].map_partitions(lambda series: series.apply(emb_skill), meta='object'))
csv_dask.reset_index()
df2.compute()
csv_dask.compute()
This is a snippet of my code I want to use dask distributed to make it faster but it gives me this error whenever I am trying can you please help and tell me what I am doing wrong?
INFO:distributed.protocol.pickle:Failed to serialize (<function map_chunk at 0x7e8804b995a0>, Delayed('emb_skill-4a6ac665-a176-4bea-9f6c-c278abbf6062'), ["('from_sequence-956324e59717b7e4a6b95e21dce2c68b', 0)"], None, {}). Exception: can't pickle repeated message fields, convert to list first
2023-08-09 16:30:40,955 - distributed.protocol.core - CRITICAL - Failed to Serialize