381 update import script to json payload #17

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

Elkrival wants to merge 25 commits into main from 381-update-import-script-to-json-payload

Collaborator

Elkrival commented Aug 28, 2023 •

edited

Loading

This pr updates the script to parse csv files and makes a requests to the api. It no longer depends on pymongo to parse the database and make the updates.

The pr removes unused code, and it adds a service that collects the parsed data, when the data is collected it then makes a request to the backend node api to insert the data.

It also adds a configuration of api_url which points to the server route.

Tests and more tests.

Figma link to workflow. https://www.figma.com/file/EYWZPBDHD25Of0KUkwMksC/import-Workflow?type=whiteboard&node-id=0-1&t=DDCkCP8SeU9rSRy6-0

ngmaloney reviewed

View reviewed changes

scripts/api/api.py Outdated

    
                      headers=headers,

                      data=data_file_data,

                  )

                  print(r)

Collaborator

ngmaloney Aug 28, 2023

I don't think it should print. Maybe a log.info?

Collaborator Author

Elkrival Aug 29, 2023

ngmaloney reviewed

View reviewed changes

scripts/api/api.py Outdated

    
                      headers=headers,

                      data=metadata_file_data,

                  )

                  print(r)

Collaborator

ngmaloney Aug 28, 2023

Same as above. Now that I think about it, we probably want to return the status or success of r?

Collaborator Author

Elkrival Aug 29, 2023

taylorkearns reviewed

View reviewed changes

scripts/data_importer_service/data_import_service.py

    
                  def diverge_files(self, path):

                      basename = os.path.basename(path)

                      is_data_file = self.DATAFILE.match(basename)

taylorkearns Aug 29, 2023

What do you think about calling these vars data_file and metadata_file? It's not a boolean expression, I believe it returns the matched regex or None.

Collaborator Author

Elkrival Oct 9, 2023

scripts/data_importer_service/data_import_service.py

    
                      is_metadata_file = self.METADATA.match(basename)

                      if is_data_file:

                          file_extension_to_dict = self.DATAFILE.match(basename).groupdict()

taylorkearns Aug 29, 2023

Can use the var declared on 26 here.

Collaborator Author

Elkrival Oct 9, 2023

scripts/data_importer_service/data_import_service.py

    
                      is_metadata_file = self.METADATA.match(basename)

                      if is_data_file:

                          file_extension_to_dict = self.DATAFILE.match(basename).groupdict()

taylorkearns Aug 29, 2023

Suggested change

      
                        file_extension_to_dict = self.DATAFILE.match(basename).groupdict()
          
                        data_file_matches = self.DATAFILE.match(basename).groupdict()

Collaborator Author

Elkrival Oct 9, 2023

scripts/data_importer_service/data_import_service.py

    
                          return self.process_data_file(path, file_extension_to_dict)

                      if is_metadata_file:

                          metadata_file_extension = self.METADATA.match(basename).groupdict()

taylorkearns Aug 29, 2023

Suggested change

      
                        metadata_file_extension = self.METADATA.match(basename).groupdict()
          
                        metadata_file_matches = self.METADATA.match(basename).groupdict()

Collaborator Author

Elkrival Oct 9, 2023

scripts/data_importer_service/data_import_service.py Outdated

    
                      self.data_file_list = data_file_list

                      self.metadata_file_list = metadata_file_list

                  def diverge_files(self, path):

taylorkearns Aug 29, 2023

Suggested change

      
                def diverge_files(self, path):
          
                def process_file(self, path):

Collaborator Author

Elkrival Oct 9, 2023

Collaborator

ngmaloney commented Aug 30, 2023

@Elkrival Is there an example config containing the new URL directive pointing to the API endpoint? I couldn't find it.

Elkrival force-pushed the 381-update-import-script-to-json-payload branch from 17a12ba to 7858a39 Compare

October 9, 2023 16:25

Collaborator Author

Elkrival commented Oct 9, 2023

@ngmaloney 733d948 this commit updates the example config

zelaznik reviewed

View reviewed changes

build/scripts-3.11/import.py Outdated

Comment on lines 19 to 24

    
                  parser = ap.ArgumentParser()

                  parser.add_argument('-c', '--config')

                  parser.add_argument('-d', '--dbname', default='dpdata')

                  parser.add_argument('-v', '--verbose', action='store_true')

                  parser.add_argument('expr')

                  args = parser.parse_args()

zelaznik Oct 9, 2023

👍 I'm a fan of the argparse library.

Collaborator Author

Elkrival Oct 11, 2023

👍

zelaznik reviewed

View reviewed changes

build/scripts-3.11/import.py Outdated

Comment on lines 39 to 58

    
                  #     dirname = os.path.dirname(f)

                  #     basename = os.path.basename(f)

                  #     # probe for dpdash-compatibility and gather information

                  #     probe = dpimport.probe(f)

                  #     if not probe:

                  #         logger.debug('document is unknown %s', basename)

                  #         continue

                  #     # nothing to be done

                  #     if db.exists(probe):

                  #         logger.info('document exists and is up to date %s', probe['path'])

                  #         continue

                  #     logger.info('document does not exist or is out of date %s', probe['path'])

                  #     # import the file

                  #     logger.info('importing file %s', f)

                  #     dppylib.import_file(db.db, probe)

                  # logger.info('cleaning metadata')

                  # lastday = get_lastday(db.db)

                  # if lastday:

                  #     clean_metadata(db.db, lastday)

zelaznik Oct 9, 2023

Suggested change

      
                #     dirname = os.path.dirname(f)
          
                #     basename = os.path.basename(f)
          
                #     # probe for dpdash-compatibility and gather information
          
                #     probe = dpimport.probe(f)
          
                #     if not probe:
          
                #         logger.debug('document is unknown %s', basename)
          
                #         continue
          
                #     # nothing to be done
          
                #     if db.exists(probe):
          
                #         logger.info('document exists and is up to date %s', probe['path'])
          
                #         continue
          
                #     logger.info('document does not exist or is out of date %s', probe['path'])
          
                #     # import the file
          
                #     logger.info('importing file %s', f)
          
                #     dppylib.import_file(db.db, probe)
          
                # logger.info('cleaning metadata')
          
                # lastday = get_lastday(db.db)
          
                # if lastday:
          
                #     clean_metadata(db.db, lastday)

Collaborator Author

Elkrival Oct 11, 2023

jyurek reviewed

View reviewed changes

README.md

    
              DPimport is a command line tool for importing files into DPdash using a

              simple [`glob`](https://en.wikipedia.org/wiki/Glob_(programming)) expression.

              simple [`glob`](<https://en.wikipedia.org/wiki/Glob_(programming)>) expression.

jyurek Oct 9, 2023

Why does this need the <>s? Is it because of the parens in the url?

Collaborator Author

Elkrival Oct 11, 2023

There were pushes done to main so maybe this was done for a reason. I'm unsure what.

README.md




		## MongoDB

jyurek Oct 9, 2023

Is it worth noting something like "This used to require Mongo but doesn't anymore"?

Collaborator Author

Elkrival Oct 11, 2023

The app still uses mongo, but we don't write to it directly anymore

build/scripts-3.11/import.py Outdated

    
                  logging.basicConfig(level=level)

                  with open(os.path.expanduser(args.config), 'r') as fo:

                      config = yaml.load(fo, Loader=yaml.SafeLoader)

jyurek Oct 9, 2023

This config isn't used anywhere, is it?

Collaborator Author

Elkrival Oct 11, 2023

yes it's used in the import.py file within the scripts directory

build/scripts-3.11/import.py Outdated

Comment on lines 66 to 68

    
                          studies[subject['_id']['study']] = {}

                          studies[subject['_id']['study']]['subject'] = []

                          studies[subject['_id']['study']]['max_day'] = 0

jyurek Oct 9, 2023

subject['_id']['study'] is referenced a lot in this loop. Feels like it would be worth it to both give it a clear name/reason and also to reduce noise on any given line.

Collaborator Author

Elkrival Oct 11, 2023

build/scripts-3.11/import.py Outdated

    
                              {

                                  '_id' : True,

                                  'collection' : True,

                                  'synced' : True

jyurek Oct 9, 2023

Does this line request that the sync status be part of the result set?

Collaborator Author

Elkrival Oct 11, 2023

not sure, but this code is not part of the work
655c34d

build/scripts-3.11/import.py Outdated

Comment on lines 84 to 85

    
                                  if doc['synced'] is False and 'collection' in doc:

                                      db[doc['collection']].drop()

jyurek Oct 9, 2023

Why does it look like you're dropping the whole collection?

Collaborator Author

Elkrival Oct 11, 2023

it's not part of the original work, it's been removed 655c34d

build/scripts-3.11/import.py Outdated

    
                      subject_metadata['days'] = subject['days']

                      subject_metadata['study'] = subject['_id']['study']

                      studies[subject['_id']['study']]['max_day'] = studies[subject['_id']['study']]['max_day'] if (studies[subject['_id']['study']]['max_day'] >= subject['days'] ) else subject['days']

jyurek Oct 9, 2023

This feels like too much for one line

Collaborator Author

Elkrival Oct 11, 2023

this is outdated it's been removed
655c34d

scripts/api/api.py Outdated

Comment on lines 5 to 36

    
              def create_data_file(api_url, data_file_data):

                  request_url = api_url + "day"

                  headers = {"content-type": "application/json"}

                  r = requests.post(

                      request_url,

                      headers=headers,

                      data=data_file_data,

                  )

                  status = r.status_code

                  if status != 200:

                      response = r.json()["message"]

                      logging.info(response)

                  else:

                      response = r.json()["data"]

                      logging.info(response)

              def create_metadata_file(api_url, metadata_file_data):

                  request_url = api_url + "metadata"

                  headers = {"content-type": "application/json"}

                  r = requests.post(

                      request_url,

                      headers=headers,

                      data=metadata_file_data,

                  )

                  status = r.status_code

                  if status != 200:

                      response = r.json()["message"]

                      logging.info(response)

                  else:

                      response = r.json()["data"]

                      logging.info(response)

jyurek Oct 9, 2023

Unless I'm missing something, these look identical aside from the url's postfix. Could you move this functionality into a common function that both create_data_file and create_metadata_file both call?

Collaborator Author

Elkrival Oct 11, 2023

jyurek approved these changes

View reviewed changes


          Service

9fdfcc4

* Added service that parses data csv and metadata csv
* Using the new data structures the service parses the data and creats a
  json
* Added tests to the service

Updated import script for v2

* Removed all files that parse mongodb and removed dependency
* Added service that handles incoming csv data for metadata and
  participant csv data
* Service also has a method to convert data to json
* Added api request with a configuration for url

Add requests

* Added requests dependency

Updates to script to add hash collection

* Added hash collection

Elkrival force-pushed the 381-update-import-script-to-json-payload branch from 655c34d to 9fdfcc4 Compare

October 11, 2023 17:33

elmantis added 15 commits

October 11, 2023 18:40

56038b8

Load Test CSV

* This pr adds a script generate_test_csv to generate import files that stress test the api endpoint
* It uses parallelism to write the files and it is configurable using the yaml file

 # json

* Added hash collection


          Merge pull request #23 from AMP-SCZ/21-utility-load-test-csv

aa2f88b

Test CSV Load generator


          Service

ca9246c

* Converted api functions to import service
* Added api auth keys to service

 # participant csv data

 # json

 # participant csv data


          Merge pull request #24 from AMP-SCZ/add-api-auth-to-requests

ec46ad2

Add authentication to request


          Change keys for columns

c704d46

* Subject ID and uppercase Study keys needed to be updated for import


          Key change

8a78722

* updated key in metadata
* Updated tests


          Handle Values

e07ffd7

* Handle infinity and nan values
* If compute error, import as string values


          String

4ef295a

* String values

Nan

c756537

* Fix nan


          Updates

b013a59

* Updates to import script
* Add assessment, study and subject properties to data
* Update Tests
* Added route to restarte metadata collection


          Add updates

adfdd5d

* Updates to import


          Support for new data structure

4ebafb3

* Updates for new data migration structure


          Updated tests

* Updated tests


          Test constn

f08527f

* Test consent and synced


          Extract variables

2dc1a04

* extract variables from assessment

mikestone14 commented Apr 9, 2024

@Elkrival what's the status of this PR? Does it need to be rebased/merged or closed?

elmantis added 2 commits

April 9, 2024 08:22


          remove print log

4bbacf4

* removed print


          variables dictionary

dae624f

* added variables as dictionaries

elmantis and others added 7 commits

April 9, 2024 09:34


          PR comments

f03fc80

* Pr comments


          Merge pull request #28 from AMP-SCZ/add-variables-property

785ff14

Extract variables


          print struct

cf69b84

* print structure


          Allow SSL verification to be bypassed

bb1ac63

When connecting with local/self-signed/dev environments it is sometimes necessary to bypass SSL verification.
This commit adds a config option to do so but it is not the default nor is it considered secure/best-practice.


          omit printing of imported data

bb14b79


          omit unhelpful ssl_verify=False warning

dbf699f

This should also speed up the import API.


          Merge pull request #31 from AMP-SCZ/patch-1

Omit verbosity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet