Description
Hello everyone,
I was used to launch "scipdf_parser" on Google Colab and it worked so well!
Today I tried to launch it again with the same commands, but it does not work anymore!
Please, please can someone help me? :(
Here is the code I was using:
from google.colab import drive
drive.mount('/content/drive')
!pip install git+https://github.com/titipata/scipdf_parser
!python -m spacy download en_core_web_sm
import subprocess
subprocess.Popen("bash serve_grobid.sh", shell=True)
(It now returns: <Popen: returncode: None args: 'bash serve_grobid.sh'>)
!bash serve_grobid.sh
(It now returns: Error: Docker is not installed. Please install Docker before running Grobid.)
import scipdf
import os
import pandas as pd
import warnings
from bs4.builder import XMLParsedAsHTMLWarning
warnings.filterwarnings('ignore', category=XMLParsedAsHTMLWarning)
files_to_process = os.listdir("/content/drive/My Drive/PDF/")[:1500]
for idx, filename in enumerate(files_to_process, start=1):
if filename.lower().endswith(".pdf"):
try:
pmid = filename.split(".")[0]
percorso_file_csv = f"/content/drive/My Drive/CSV/{pmid}.csv"
dizionario = scipdf.parse_pdf_to_dict(f'/content/drive/My Drive/PDF/{filename}')
sections_content = [f"{key}: {value}" for section in dizionario.get('sections', []) for key, value in section.items()]
#references_content = [f"{key}: {value}" for reference in dizionario.get('references', []) for key, value in reference.items()]
#figures_content = [f"{key}: {value}" for figure in dizionario.get('figures', []) for key, value in figure.items()]
content = {
"Title": f"{dizionario.get('title', '')}\n",
"Authors": f"{dizionario.get('authors', '')}\n",
"Publication date": f"{dizionario.get('pub_date', '')}\n",
"Abstract": f"{dizionario.get('abstract', '')}\n",
"Sections": "\n".join(sections_content),
#"References": "\n".join(references_content),
#"Figures": "\n".join(figures_content),
"Doi": f"{dizionario.get('doi', '')}"
}
df = pd.DataFrame([[pmid, content_str]])
df.to_csv(percorso_file_csv, index=False, header=["pmid", "content"])
print(f"Il file CSV è stato creato con successo per il file {filename}")
print(f"File numero {idx}")
except Exception as e:
print(e)
continue
(It now returns:
OSError Traceback (most recent call last)
in <cell line: 9>()
7
8 # Elabora solo i primi 10 file
----> 9 files_to_process = os.listdir("/content/drive/My Drive/PDF/")[:1500]
10 for idx, filename in enumerate(files_to_process, start=1):
11 if filename.lower().endswith(".pdf"):
OSError: [Errno 5] Input/output error: '/content/drive/My Drive/PDF/')
What is going on?
Thank you so much in advance!!