[Bug?] Please, restore scipdf_parser! It is not worning anymore on Colab :(

Hello everyone,

I was used to launch "scipdf_parser" on Google Colab and it worked so well!
Today I tried to launch it again with the same commands, but it does not work anymore!
Please, please can someone help me? :(

Here is the code I was using:

from google.colab import drive
drive.mount('/content/drive')

!pip install git+https://github.com/titipata/scipdf_parser

!python -m spacy download en_core_web_sm

import subprocess
subprocess.Popen("bash serve_grobid.sh", shell=True)

(It now returns: <Popen: returncode: None args: 'bash serve_grobid.sh'>)

!bash serve_grobid.sh

(It now returns: Error: Docker is not installed. Please install Docker before running Grobid.)

import scipdf
import os
import pandas as pd
import warnings
from bs4.builder import XMLParsedAsHTMLWarning
warnings.filterwarnings('ignore', category=XMLParsedAsHTMLWarning)

files_to_process = os.listdir("/content/drive/My Drive/PDF/")[:1500]
for idx, filename in enumerate(files_to_process, start=1):
    if filename.lower().endswith(".pdf"):
        try:
            pmid = filename.split(".")[0]
            percorso_file_csv = f"/content/drive/My Drive/CSV/{pmid}.csv"
            dizionario = scipdf.parse_pdf_to_dict(f'/content/drive/My Drive/PDF/{filename}')
            sections_content = [f"{key}: {value}" for section in dizionario.get('sections', []) for key, value in section.items()]
            #references_content = [f"{key}: {value}" for reference in dizionario.get('references', []) for key, value in reference.items()]
            #figures_content = [f"{key}: {value}" for figure in dizionario.get('figures', []) for key, value in figure.items()]

            content = {
                "Title": f"{dizionario.get('title', '')}\n",
                "Authors": f"{dizionario.get('authors', '')}\n",
                "Publication date": f"{dizionario.get('pub_date', '')}\n",
                "Abstract": f"{dizionario.get('abstract', '')}\n",
                "Sections": "\n".join(sections_content),
                #"References": "\n".join(references_content),
                #"Figures": "\n".join(figures_content),
                "Doi": f"{dizionario.get('doi', '')}"
            }

            df = pd.DataFrame([[pmid, content_str]])

            df.to_csv(percorso_file_csv, index=False, header=["pmid", "content"])
            print(f"Il file CSV è stato creato con successo per il file {filename}")
            print(f"File numero {idx}")

        except Exception as e:
            print(e)
            continue

(It now returns: 
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
[<ipython-input-13-6ed5ea737114>](https://localhost:8080/#) in <cell line: 9>()
      7 
      8 # Elabora solo i primi 10 file
----> 9 files_to_process = os.listdir("/content/drive/My Drive/PDF/")[:1500]
     10 for idx, filename in enumerate(files_to_process, start=1):
     11     if filename.lower().endswith(".pdf"):

OSError: [Errno 5] Input/output error: '/content/drive/My Drive/PDF/')

What is going on? 

Thank you so much in advance!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug?] Please, restore scipdf_parser! It is not worning anymore on Colab :( #23

(It now returns:

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Bug?] Please, restore scipdf_parser! It is not worning anymore on Colab :( #23

Description

(It now returns:

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions