Skip to content

Conversation

@Adibvafa
Copy link
Collaborator

@Adibvafa Adibvafa commented Oct 7, 2024

PR Type

Feature

Short Description

Add pipeline to create finegrained dataset

Tests Added

None


This change is Reviewable

@Adibvafa Adibvafa self-assigned this Oct 7, 2024
@Adibvafa Adibvafa added the enhancement New feature or request label Oct 7, 2024
@Adibvafa
Copy link
Collaborator Author

Adibvafa commented Oct 7, 2024

@yasamanparhizkar This should be ready for merge when I clean up the scripts we don't need.

@Adibvafa Adibvafa marked this pull request as draft October 7, 2024 20:16
Copy link
Collaborator

@afkanpour afkanpour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @Adibvafa and @yasamanparhizkar)

@yasamanparhizkar
Copy link
Collaborator

@afkanpour the code under openpmcvl/utils is actually not mine. I think these are @yasamanparhizkar 's codes (perhpas i copied them over from previous repo?).

@yasamanparhizkar Do we need them anymore?

We don't need the scripts in openpmcvl/utils. The whole folder can be deleted. When I created the working directory, I intended it to be a place for temporary experimental scripts, and the whole working folder should be deleted in the end.

There's another working folder in openpmcvl/experiment/working. That one should also get completely deleted; all the useful scripts are already moved to other folders.

@yasamanparhizkar
Copy link
Collaborator

openpmcvl/granular/utils/clean_by_format.py line 61 at r2 (raw file):

Previously, afkanpour (Arash) wrote…

These directory paths should not be hard-coded in python scripts. Please pass them via flags and apply this change to all python scripts.

This file can be deleted.

@yasamanparhizkar
Copy link
Collaborator

openpmcvl/granular/utils/dir_or_file.py line 4 at r2 (raw file):

Previously, afkanpour (Arash) wrote…

Same comment about hard-coding dir names.

File can be deleted.

@yasamanparhizkar
Copy link
Collaborator

openpmcvl/granular/utils/train_test_split.py line 123 at r2 (raw file):

Previously, afkanpour (Arash) wrote…

Everything is commented out here. Please fix.

File can be deleted.

@Adibvafa
Copy link
Collaborator Author

@yasamanparhizkar done, removed the utils directory.

@Adibvafa Adibvafa requested a review from afkanpour January 14, 2025 06:04
@Adibvafa
Copy link
Collaborator Author

@afkanpour Please review the README file and let me know if anything is ambigous!
There is a final notebook that I will commit tomorrow, its the very end of pipeline.

Copy link
Collaborator

@afkanpour afkanpour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few _original files in this PR. I assume those are the original files before your changes. Why should we keep them around? Are we keeping them because you suspect you might have made a mistake in your changes? or do you think the originals might be needed at some point in the future? If the latter, please note that this is our pipeline now. So for the sake of cleaner code and less clutter we should not keep the originals.

Reviewed 1 of 8 files at r7, 9 of 11 files at r8, all commit messages.
Reviewable status: 101 of 110 files reviewed, 12 unresolved discussions (waiting on @Adibvafa and @yasamanparhizkar)


openpmcvl/granular/pipeline/align.sh line 0 at r8 (raw file):
For this and other shell scripts add a one line description followed by a sample command on top of the script.


openpmcvl/granular/pipeline/align_original.sh line 0 at r8 (raw file):
Should this be in the PR? How's it different from align.sh? If it's needed, please add a description and a sample command.


openpmcvl/granular/pipeline/align_original.sh line 12 at r8 (raw file):

# Activate the environment
source /h/afallah/light/bin/activate

Please do not use personal paths.

Code quote:

/h/afallah/light/bin/activate

openpmcvl/granular/pipeline/align_original.sh line 15 at r8 (raw file):

# Set the working directory
cd /h/afallah/pmc-data-extraction

ditto

Code quote:

/h/afallah/pmc-data-extraction

openpmcvl/granular/pipeline/align_original.sh line 27 at r8 (raw file):

# Define the root directory
root_dir="/projects/multimodal/datasets/pmc_oa"

There shouldn't be pointers to our directories if we want to open-source this. If there are similar instances elsewhere, please change them all.

Code quote:

/projects/multimodal/datasets/pmc_oa

openpmcvl/granular/pipeline/align_original.sh line 31 at r8 (raw file):

# Define the input and output files
input_file="${root_dir}/pmc_oa.jsonl"
output_file="${root_dir}/pmc_oa_labeled/pmc_oa_aligned_${BEGIN_IDX}_${END_IDX}.jsonl"

Is this directory created in align.py if it doesn't already exist or does the python script throw an error?
If the latter, we should make sure the directory is created before running align.py.

Code quote:

${root_dir}/pmc_oa_labeled/pmc_oa_aligned_${BEGIN_IDX}_${END_IDX}.jsonl

openpmcvl/granular/README.md line 0 at r8 (raw file):
For running each step please provide a sample command.


openpmcvl/granular/pipeline/preprocess_original.sh line 0 at r8 (raw file):
Same comment as above re _original files.


openpmcvl/granular/pipeline/classify_original.sh line 0 at r8 (raw file):
What does "_original" refer to? The original script before your changes? If so, why do we need to keep the originals?

@Adibvafa
Copy link
Collaborator Author

@afkanpour oh sorry the original files werent supposed to be committed, slipped my hand, removed them.
will add sample calls for .sh files.

Copy link
Collaborator Author

@Adibvafa Adibvafa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 92 of 110 files reviewed, 12 unresolved discussions (waiting on @afkanpour and @yasamanparhizkar)


openpmcvl/granular/README.md line at r8 (raw file):

Previously, afkanpour (Arash) wrote…

For running each step please provide a sample command.

Done.


openpmcvl/granular/pipeline/align.sh line at r8 (raw file):

Previously, afkanpour (Arash) wrote…

For this and other shell scripts add a one line description followed by a sample command on top of the script.

Done.


openpmcvl/classification/classifier.sh line at r2 (raw file):

Previously, afkanpour (Arash) wrote…

Do we use this script?

Done.


openpmcvl/granular/utils/train_test_split.py line 123 at r2 (raw file):

Previously, afkanpour (Arash) wrote…

Everything is commented out here. Please fix.

Done.

Copy link
Collaborator Author

@Adibvafa Adibvafa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 92 of 110 files reviewed, 12 unresolved discussions (waiting on @afkanpour and @yasamanparhizkar)

Copy link
Collaborator Author

@Adibvafa Adibvafa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 92 files at r1, 75 of 91 files at r2, 3 of 3 files at r3, 8 of 10 files at r5, 8 of 8 files at r7, 2 of 11 files at r8, 5 of 5 files at r9, 1 of 6 files at r10, 5 of 5 files at r11, all commit messages.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @afkanpour and @yasamanparhizkar)


openpmcvl/granular/pipeline/align_original.sh line 12 at r8 (raw file):

Previously, afkanpour (Arash) wrote…

Please do not use personal paths.

Done.


openpmcvl/granular/pipeline/align_original.sh line 15 at r8 (raw file):

Previously, afkanpour (Arash) wrote…

ditto

Done.


openpmcvl/granular/pipeline/align_original.sh line 27 at r8 (raw file):

Previously, afkanpour (Arash) wrote…

There shouldn't be pointers to our directories if we want to open-source this. If there are similar instances elsewhere, please change them all.

Done.


openpmcvl/granular/pipeline/align_original.sh line 31 at r8 (raw file):

Previously, afkanpour (Arash) wrote…

Is this directory created in align.py if it doesn't already exist or does the python script throw an error?
If the latter, we should make sure the directory is created before running align.py.

Done.


openpmcvl/granular/pipeline/align_original.sh line at r8 (raw file):

Previously, afkanpour (Arash) wrote…

Should this be in the PR? How's it different from align.sh? If it's needed, please add a description and a sample command.

Done.


openpmcvl/granular/pipeline/classify_original.sh line at r8 (raw file):

Previously, afkanpour (Arash) wrote…

What does "_original" refer to? The original script before your changes? If so, why do we need to keep the originals?

Done.


openpmcvl/granular/pipeline/preprocess_original.sh line at r8 (raw file):

Previously, afkanpour (Arash) wrote…

Same comment as above re _original files.

Done.


openpmcvl/granular/utils/clean_by_format.py line 61 at r2 (raw file):

Previously, yasamanparhizkar (Yasaman P.) wrote…

This file can be deleted.

Done.


openpmcvl/granular/utils/dir_or_file.py line 4 at r2 (raw file):

Previously, yasamanparhizkar (Yasaman P.) wrote…

File can be deleted.

Done.


openpmcvl/granular/utils/train_test_split.py line 123 at r2 (raw file):

Previously, Adibvafa (Adibvafa Fallahpour) wrote…

Done.

Done.

Copy link
Collaborator Author

@Adibvafa Adibvafa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @afkanpour and @yasamanparhizkar)

@Adibvafa Adibvafa requested a review from afkanpour January 14, 2025 16:27
@Adibvafa
Copy link
Collaborator Author

@afkanpour should be good to merge.

Copy link
Collaborator

@afkanpour afkanpour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 14 of 14 files at r12, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @yasamanparhizkar)

@Adibvafa Adibvafa merged commit ae91449 into main Jan 20, 2025
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants