Skip to content

Conversation

@Elkrival
Copy link
Collaborator

@Elkrival Elkrival commented Aug 28, 2023

This pr updates the script to parse csv files and makes a requests to the api. It no longer depends on pymongo to parse the database and make the updates.

The pr removes unused code, and it adds a service that collects the parsed data, when the data is collected it then makes a request to the backend node api to insert the data.

It also adds a configuration of api_url which points to the server route.

Tests and more tests.

Figma link to workflow. https://www.figma.com/file/EYWZPBDHD25Of0KUkwMksC/import-Workflow?type=whiteboard&node-id=0-1&t=DDCkCP8SeU9rSRy6-0

headers=headers,
data=data_file_data,
)
print(r)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it should print. Maybe a log.info?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

headers=headers,
data=metadata_file_data,
)
print(r)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. Now that I think about it, we probably want to return the status or success of r?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


def diverge_files(self, path):
basename = os.path.basename(path)
is_data_file = self.DATAFILE.match(basename)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about calling these vars data_file and metadata_file? It's not a boolean expression, I believe it returns the matched regex or None.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_metadata_file = self.METADATA.match(basename)

if is_data_file:
file_extension_to_dict = self.DATAFILE.match(basename).groupdict()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use the var declared on 26 here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_metadata_file = self.METADATA.match(basename)

if is_data_file:
file_extension_to_dict = self.DATAFILE.match(basename).groupdict()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
file_extension_to_dict = self.DATAFILE.match(basename).groupdict()
data_file_matches = self.DATAFILE.match(basename).groupdict()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


return self.process_data_file(path, file_extension_to_dict)
if is_metadata_file:
metadata_file_extension = self.METADATA.match(basename).groupdict()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
metadata_file_extension = self.METADATA.match(basename).groupdict()
metadata_file_matches = self.METADATA.match(basename).groupdict()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.data_file_list = data_file_list
self.metadata_file_list = metadata_file_list

def diverge_files(self, path):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def diverge_files(self, path):
def process_file(self, path):

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngmaloney
Copy link
Collaborator

@Elkrival Is there an example config containing the new URL directive pointing to the API endpoint? I couldn't find it.

@Elkrival Elkrival force-pushed the 381-update-import-script-to-json-payload branch from 17a12ba to 7858a39 Compare October 9, 2023 16:25
@Elkrival
Copy link
Collaborator Author

Elkrival commented Oct 9, 2023

@ngmaloney 733d948 this commit updates the example config

Comment on lines 19 to 24
parser = ap.ArgumentParser()
parser.add_argument('-c', '--config')
parser.add_argument('-d', '--dbname', default='dpdata')
parser.add_argument('-v', '--verbose', action='store_true')
parser.add_argument('expr')
args = parser.parse_args()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I'm a fan of the argparse library.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 39 to 58
# dirname = os.path.dirname(f)
# basename = os.path.basename(f)
# # probe for dpdash-compatibility and gather information
# probe = dpimport.probe(f)
# if not probe:
# logger.debug('document is unknown %s', basename)
# continue
# # nothing to be done
# if db.exists(probe):
# logger.info('document exists and is up to date %s', probe['path'])
# continue
# logger.info('document does not exist or is out of date %s', probe['path'])
# # import the file
# logger.info('importing file %s', f)
# dppylib.import_file(db.db, probe)

# logger.info('cleaning metadata')
# lastday = get_lastday(db.db)
# if lastday:
# clean_metadata(db.db, lastday)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# dirname = os.path.dirname(f)
# basename = os.path.basename(f)
# # probe for dpdash-compatibility and gather information
# probe = dpimport.probe(f)
# if not probe:
# logger.debug('document is unknown %s', basename)
# continue
# # nothing to be done
# if db.exists(probe):
# logger.info('document exists and is up to date %s', probe['path'])
# continue
# logger.info('document does not exist or is out of date %s', probe['path'])
# # import the file
# logger.info('importing file %s', f)
# dppylib.import_file(db.db, probe)
# logger.info('cleaning metadata')
# lastday = get_lastday(db.db)
# if lastday:
# clean_metadata(db.db, lastday)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


DPimport is a command line tool for importing files into DPdash using a
simple [`glob`](https://en.wikipedia.org/wiki/Glob_(programming)) expression.
simple [`glob`](<https://en.wikipedia.org/wiki/Glob_(programming)>) expression.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need the <>s? Is it because of the parens in the url?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were pushes done to main so maybe this was done for a reason. I'm unsure what.




## MongoDB
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth noting something like "This used to require Mongo but doesn't anymore"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The app still uses mongo, but we don't write to it directly anymore

logging.basicConfig(level=level)

with open(os.path.expanduser(args.config), 'r') as fo:
config = yaml.load(fo, Loader=yaml.SafeLoader)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config isn't used anywhere, is it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it's used in the import.py file within the scripts directory

Comment on lines 66 to 68
studies[subject['_id']['study']] = {}
studies[subject['_id']['study']]['subject'] = []
studies[subject['_id']['study']]['max_day'] = 0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subject['_id']['study'] is referenced a lot in this loop. Feels like it would be worth it to both give it a clear name/reason and also to reduce noise on any given line.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
'_id' : True,
'collection' : True,
'synced' : True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this line request that the sync status be part of the result set?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure, but this code is not part of the work
655c34d

Comment on lines 84 to 85
if doc['synced'] is False and 'collection' in doc:
db[doc['collection']].drop()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it look like you're dropping the whole collection?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not part of the original work, it's been removed 655c34d

subject_metadata['days'] = subject['days']
subject_metadata['study'] = subject['_id']['study']

studies[subject['_id']['study']]['max_day'] = studies[subject['_id']['study']]['max_day'] if (studies[subject['_id']['study']]['max_day'] >= subject['days'] ) else subject['days']
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like too much for one line

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is outdated it's been removed
655c34d

Comment on lines 5 to 36
def create_data_file(api_url, data_file_data):
request_url = api_url + "day"
headers = {"content-type": "application/json"}
r = requests.post(
request_url,
headers=headers,
data=data_file_data,
)
status = r.status_code
if status != 200:
response = r.json()["message"]
logging.info(response)
else:
response = r.json()["data"]
logging.info(response)


def create_metadata_file(api_url, metadata_file_data):
request_url = api_url + "metadata"
headers = {"content-type": "application/json"}
r = requests.post(
request_url,
headers=headers,
data=metadata_file_data,
)
status = r.status_code
if status != 200:
response = r.json()["message"]
logging.info(response)
else:
response = r.json()["data"]
logging.info(response)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, these look identical aside from the url's postfix. Could you move this functionality into a common function that both create_data_file and create_metadata_file both call?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* Added service that parses data csv and metadata csv
* Using the new data structures the service parses the data and creats a
  json
* Added tests to the service

Updated import script for v2

* Removed all files that parse mongodb and removed dependency
* Added service that handles incoming csv data for metadata and
  participant csv data
* Service also has a method to convert data to json
* Added api request with a configuration for url

Add requests

* Added requests dependency

Updates to script to add hash collection

* Added hash collection
@Elkrival Elkrival force-pushed the 381-update-import-script-to-json-payload branch from 655c34d to 9fdfcc4 Compare October 11, 2023 17:33
u
Load Test CSV

* This pr adds a script generate_test_csv to generate import files that stress test the api endpoint
* It uses parallelism to write the files and it is configurable using the yaml file

 # json

* Added hash collection
* Converted api functions to import service
* Added api auth keys to service

 # participant csv data

 # json

 # participant csv data
* Subject ID and uppercase Study keys needed to be updated for import
* updated key in metadata
* Updated tests
* Handle infinity and nan values
* If compute error, import as string values
* String values
* Fix nan
* Updates to import script
* Add assessment, study and subject properties to data
* Update Tests
* Added route to restarte metadata collection
* Updates to import
* Updates for new data migration structure
* Updated tests
* Test consent and synced
* extract variables from assessment
@mikestone14
Copy link

@Elkrival what's the status of this PR? Does it need to be rebased/merged or closed?

elmantis added 2 commits April 9, 2024 08:22
* removed print
* added variables as dictionaries
elmantis and others added 7 commits April 9, 2024 09:34
* Pr comments
* print structure
When connecting with local/self-signed/dev environments it is sometimes necessary to bypass SSL verification.
This commit adds a config option to do so but it is not the default nor is it considered secure/best-practice.
This should also speed up the import API.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants