Skip to content

Update README.md #165

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions basic_pitch/data/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Data / Training
The code and scripts in this section deal with training basic pitch on your own. Scripts in the `datasets` folder allow one to download and process a selection of the datasets used to train the original model. Each of these download scripts has the following keyword arguments:
* **--source**: Source directory to download raw data to. It defaults to `$HOME/mir_datasets/{dataset_name}`
* **--source**: Source directory to download raw data to. It defaults to `$HOME/mir_datasets/{dataset_name}`.
* **--destination**: Directory to write processed data to. It defaults to `$HOME/data/basic_pitch/{dataset_name}`.
* **--runner**: The method used to run the Beam Pipeline for processing the dataset. Options include `DirectRunner`, running directly in the code process running the pipeline, `PortableRunner`, which can be used to run the pipeline in a docker container locally, and `DataflowRunner`, which can be used to run the pipeline in a docker container on Dataflow.
* **--timestamped**: If passed, the dataset will be put into a timestamp directory instead of 'splits'.
* **--batch-size**: Number of examples per tfrecord when partitioning the dataset.
* **--sdk_container_image**: The Docker container image used to process the data if using `PortableRunner` or `DirectRunner` .
* **--sdk_container_image**: The Docker container image used to process the data if using `PortableRunner` or `DirectRunner`.
* **--job_endpoint**: the endpoint where the job is running. It defaults to `embed` which works for `PortableRunner`.

Additional arguments that work with Beam in general can be used as well, and will be passed along and used by the pipeline. If using `DataflowRunner`, you will be required to pass `--temp_location={Path to GCS Bucket}`, `--staging_location={Path to GCS Bucket}`, `--project={Name of GCS Project}` and `--region={GCS region}`.