Skip to content

[Bug?] Please, restore scipdf_parser! It is not worning anymore on Colab :( #23

Open
@MatteoRiva95

Description

@MatteoRiva95

Hello everyone,

I was used to launch "scipdf_parser" on Google Colab and it worked so well!
Today I tried to launch it again with the same commands, but it does not work anymore!
Please, please can someone help me? :(

Here is the code I was using:

from google.colab import drive
drive.mount('/content/drive')

!pip install git+https://github.com/titipata/scipdf_parser

!python -m spacy download en_core_web_sm

import subprocess
subprocess.Popen("bash serve_grobid.sh", shell=True)

(It now returns: <Popen: returncode: None args: 'bash serve_grobid.sh'>)

!bash serve_grobid.sh

(It now returns: Error: Docker is not installed. Please install Docker before running Grobid.)

import scipdf
import os
import pandas as pd
import warnings
from bs4.builder import XMLParsedAsHTMLWarning
warnings.filterwarnings('ignore', category=XMLParsedAsHTMLWarning)

files_to_process = os.listdir("/content/drive/My Drive/PDF/")[:1500]
for idx, filename in enumerate(files_to_process, start=1):
if filename.lower().endswith(".pdf"):
try:
pmid = filename.split(".")[0]
percorso_file_csv = f"/content/drive/My Drive/CSV/{pmid}.csv"
dizionario = scipdf.parse_pdf_to_dict(f'/content/drive/My Drive/PDF/{filename}')
sections_content = [f"{key}: {value}" for section in dizionario.get('sections', []) for key, value in section.items()]
#references_content = [f"{key}: {value}" for reference in dizionario.get('references', []) for key, value in reference.items()]
#figures_content = [f"{key}: {value}" for figure in dizionario.get('figures', []) for key, value in figure.items()]

        content = {
            "Title": f"{dizionario.get('title', '')}\n",
            "Authors": f"{dizionario.get('authors', '')}\n",
            "Publication date": f"{dizionario.get('pub_date', '')}\n",
            "Abstract": f"{dizionario.get('abstract', '')}\n",
            "Sections": "\n".join(sections_content),
            #"References": "\n".join(references_content),
            #"Figures": "\n".join(figures_content),
            "Doi": f"{dizionario.get('doi', '')}"
        }

        df = pd.DataFrame([[pmid, content_str]])

        df.to_csv(percorso_file_csv, index=False, header=["pmid", "content"])
        print(f"Il file CSV è stato creato con successo per il file {filename}")
        print(f"File numero {idx}")

    except Exception as e:
        print(e)
        continue

(It now returns:

OSError Traceback (most recent call last)
in <cell line: 9>()
7
8 # Elabora solo i primi 10 file
----> 9 files_to_process = os.listdir("/content/drive/My Drive/PDF/")[:1500]
10 for idx, filename in enumerate(files_to_process, start=1):
11 if filename.lower().endswith(".pdf"):

OSError: [Errno 5] Input/output error: '/content/drive/My Drive/PDF/')

What is going on?

Thank you so much in advance!!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions