Skip to content

Conversation

evanbiederstedt
Copy link

No description provided.

evanbiederstedt and others added 30 commits August 11, 2018 21:37
@evanbiederstedt
Copy link
Author

Comments:

RE: C API, FORTRAN, and wext/setup.py

I wrote this to be both 2.7 and 3.x compatible. The C API changed a good deal in Python3.x, but it's quite slick. Hopefully these changes make sense.

I didn't touch the Fortran code, but I did revise wext/setup.py to compile both the C and Fortran code into separate modules. It appears to work on Travis, as well as the Linux server.

I also recommend users should just install this, so the modules are individually "importable".

RE: README

The Travis-CI config will have to be activated on your end. Otherwise, the badge will fail.

I wish the README contained a bit more motivation/details about WExT, and why this should be used

RE: py23 compatibility changes

--- Any '.iteritems()' became '.items()'.

--- We were previously using implicit relative imports. Python does not like this----these need to be explicit relative imports, e.g.

https://github.com/evanbiederstedt/wext/blob/master/wext/exclusivity_tests.py#L5-L6

from .exact import exact_test

--- This needs to explicitly be a tuple() for Python3.x

https://github.com/evanbiederstedt/wext/blob/master/compute_mutation_probabilities.py#L49

--- The most interesting change is adding 'xranges()' from future, which is now a required dependency.

https://github.com/evanbiederstedt/wext/blob/master/compute_mutation_probabilities.py#L12
https://github.com/evanbiederstedt/wext/blob/master/wext/mcmc.py#L7

Normally, I would simply convert this to range(). However, this would be a major issue for Python2.7 users in this case, as the 2.7 version would take a major hit in terms of performance:

https://github.com/raphael-group/wext/blob/master/compute_mutation_probabilities.py#L109

e.g. range(1, 2*10**9) will take minutes in python 2.7

It appears the way around this is to use xrange() for both versions 2 and 3, which is what I've done.

@zmiimz
Copy link

zmiimz commented Dec 3, 2018

thanks!
m.b. add future into requirements?

  • simple example works without error but pancan stops with

examples/generate_data.py", line 31, in generate_pancan_data
samples1 = [ 'd1-%s' % (i+1) for i in range(N_1)]
TypeError: 'float' object cannot be interpreted as an integer
Makefile:79: recipe for target '... examples/pancan/data/dataset%-aberrations.tsv' failed
'float' object cannot be interpreted as an integer

seems that all N/ divisions should be replaced with the int division operator // in examples/generate_data.py

  • also in process_mutations.py
    in process_maf function for arr

AttributeError: 'map' object has no attribute 'index'

possible solution?..

patient_index = list(arr).index('tumor_sample_barcode')
var_class_index = list(arr).index('variant_classification') if 'variant_classification' in arr else None
var_type_index = list(arr).index('variant_type') if 'variant_type' in arr else None
val_status_index = list(arr).index('validation_status') if 'validation_status' in arr else None

-running experiments/eccb2016 in find_exclusive_sets.py

File "find_exclusive_sets.py", line 173, in run
print('* Using {} permuted matrix files'.format(len(permuted_files)))
TypeError: object of type 'zip' has no len()

fixed in def get_permuted_files(permuted_matrix_directories, num_permutations):

return list(zip(*permuted_directory_files))

etc.

@evanbiederstedt
Copy link
Author

evanbiederstedt commented Dec 3, 2018

Hi @zmiimz

Thanks for the comments!

m.b. add future into requirements?

This is true that we should update the README if this pull request is accepted. Currently, it looks like I have future >= 0.16.0 in the requirements file. It might be easier to clone this repo https://github.com/evanbiederstedt/wext

simple example works without error but pancan stops with

Could you show me which commands you used to get this error? I just tried with Python 2.7.15 and Python 3.7.1, and I wasn't able to re-create this. I just followed the steps in the README, i.e.

virtualenv venv
source venv/bin/activate
pip install -r requirements.txt

cd wext
python setup.py install

For all of the issues you discuss, could you please provide the version of python you used, and the command you used? That would be helpful moving forward---it's a bit difficult to follow what you've done so far.

Thanks, Evan

@zmiimz
Copy link

zmiimz commented Dec 3, 2018

Could you show me which commands you used to get this error?

started examples and eccb2016 experiment calculation

@evanbiederstedt
Copy link
Author

Hi @zmiimz

Please provide as many details as possible so I can proceed to investigate the errors you report, i.e. the python version, the version of gcc, how you installed the package, and the commands you are running.

I'm running this on Mac OS 10.13.6

Versions of python and gcc

$ python --version
Python 3.7.1
$ gcc --version
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 10.0.0 (clang-1000.10.44.4)
Target: x86_64-apple-darwin17.7.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
$ 

I've attached a text file of the commands I use for installation and running examples/pancan. I haven't run into your errors, so I will need more information about the versions of tools you are using, as well as the commands you have run.

wext_example_pancan.txt

Thanks, Evan

@zmiimz
Copy link

zmiimz commented Dec 7, 2018

Dear Evan,
I repeated install and test run of WEXT on a fresh installed Ubuntu 18.04 OS and everything works here as expected (I am waiting for finish of eccb2016 run, but other tests (simple and pancan) finished and that is ok and enough for me now). Mentioned errors were observed on another (older) Linux OS with python 2.7.13 but I still have to verify this assumption (in few days) .

@evanbiederstedt
Copy link
Author

Hi @zmiimz

Thanks for the response. Please keep me updated.

I'm going to try out the eccb2016 run on my end as well. I'll post results.

Thanks, Evan

Copy link
Contributor

@matthewreyna matthewreyna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all of these changes. I tested the simple example, and the code successfully runs and returns equivalent results on Python 2 and 3. I made several comments but all for minor issues.

  1. The package future is needed for both Python 2 and 3.
  2. Most Python 2/3 functions work fine with iterators whether they are lists (like range(n) in Python 2) or generators (like range(n) in Python 3) -- len is a notable exception for some reason. I flagged some of these because they make it a little harder to read the code and slow things down the code at times but you can either leave them as-is or change them as needed.
  3. print(a, b, c) prints a tuple (a, b, c) in Python 3. Maybe replace with print(a + b + c) so that Python 2 and 3 return same output.

this_dir = os.path.dirname(os.path.realpath(__file__))
sys.path.append(this_dir)
from wext import *
from past.builtins import xrange
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need xrange.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my sole motivation here was consistency :)

Above I detail the memory/performance issue associated with range() vs xrange() between Python2.x and 3.x. So, I may have simply started using this everything to avoid any potential issues. (A downside to not taking the time to unit test this is that I didn't think deeply about each use of range(), unless it became obvious it was a problem.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, and the use of xrange in

seeds = random.sample(xrange(1, 2*10**9), args.num_permutations)

is the only really important one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree with this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep xrange here and other places -- not a big issue.

#!/usr/bin/env python

import numpy as np
from past.builtins import xrange
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may be entirely correct, and I'm happy to accept this.

See previous comments on range() vs. xrange() and consistency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big issue. We can leave as-is.

print '-' * 14, 'Correlation: WRE (Saddlepoint) and WRE (Recursive)', '-' * 14
print 'All: \\rho={:.5}, P={:.5}'.format(*all_correlation)
print '\Phi_WR < 10^-4: \\rho={:.5}, P={:.5}'.format(*tail_correlation)
print('-' * 14, 'Correlation: WRE (Saddlepoint) and WRE (Recursive)', '-' * 14)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Python 2, print(a, b, c) prints (a, b, c). Replace commas with plus signs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace commas with plus signs?

Good catch. Yes, let's use plus signs.

geneToLengthRank.update(zip(geneToLength.keys(), length_ranks))
threshold_gene = sorted(geneToLength.keys(), key=lambda g: geneToLengthRank[g])[args.length_threshold]
print 'Length of {} longest gene: {}'.format(args.length_threshold, geneToLength[threshold_gene])
geneToLengthRank.update(list(zip(list(geneToLength.keys()), length_ranks)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both lists unneeded here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this; you are correct, these are not necessary.

setToPval.update(pval.items())
setToTime.update(time.items())
setToObs.update(obs.items())
setToPval.update(list(pval.items()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice dict.update(iterator) but not list needed here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

for M, pval in setToPval.items():
if setToFDR[M]<=fdr_threshold:
X, T, Z, tbl = setToObs[M]
row = [ ', '.join(sorted(M)), pval, setToFDR[M], setToRuntime[M], T, Z ] + tbl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error on line 66 if rows = []. Between lines 65 and 66, maybe add if rows: and indent lines 66-73 so no output if no results.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what you mean. Yes, I'll correct this.

Latest tested version in parentheses.
[![Build Status](https://api.travis-ci.org/raphael-group/wext.svg?branch=master)](https://travis-ci.org/raphael-group/wext?branch=master)

1. Python (2.7.9)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

future is also currently needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments above on the README; I think I was thinking the requirements.txt addressed all Python dependencies.

Something like pip install -r requirements.txt should address concerns of all users, I think. (Could be wrong, happy to change.)

Copy link
Contributor

@matthewreyna matthewreyna Jan 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but not everyone will think to check requirements.txt or run pip install -r requirements.txt. Others may think, justifiably, that running python setup.py install successfully means that everything ready to go. For a real-life example, it crashed for me because I tried it in a new virtual machine with the usual dependencies but hadn't needed future yet.

If we add it to the README, then we'll save a few emails, GitHub issues, and StackOverflow searches, which is worth it for everyone.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add it to the README, then we'll save a few emails, GitHub issues, and StackOverflow searches, which is worth it for everyone.

Absolutely, and I agree with this as a philosophy whole-heartedly. If there are any GitHub issues, it's 99% the fault of the developer, even if the problem is one of presentation.

Yes, but not everyone will think to check requirements.txt or run pip install -r requirements.txt.

So, the motivation above wasn't to remove this information. Rather, I'm trying to cater to the lazy user (myself included) which reduces GitHub issues, e-mails, SO questions, etc.

When I look at new tools in bioinformatics, I want to see a succinct description of the algorithm/code within seconds in the README. This is needed for WExT, at the top of the README. In the current README, requirements are given first. The requirements should be made more comprehensive (e.g. add the python libraries necessary with libraries required as detailed in requirements.txt) and I think it shouldn't be the first think seen in a README. We need to aim for both succinct and comprehensive :)

The rationale behind this is presentation for new users (as an invitation to use the code), as well as preventing unnecessary Github issues/e-mails.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Let's move requirements lower in the README if they are too long.

@matthewreyna
Copy link
Contributor

LGTM -- thanks for all of the changes! Let's fix the print functions so that the output looks right, run the simple example again to test for new errors, and merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants