Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,8 @@ install:
- wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/$CUDA_REPO_PKG
- sudo dpkg -i $CUDA_REPO_PKG
- rm ${CUDA_REPO_PKG}
- sudo apt-get -y update
- sudo apt-get install -y --no-install-recommends cuda-core-$CUDA_PKG_VERSION cuda-cudart-dev-$CUDA_PKG_VERSION cuda-cublas-dev-$CUDA_PKG_VERSION cuda-curand-dev-$CUDA_PKG_VERSION
- travis_retry sudo apt-get -y update
- travis_retry sudo apt-get install -y --no-install-recommends cuda-core-$CUDA_PKG_VERSION cuda-cudart-dev-$CUDA_PKG_VERSION cuda-cublas-dev-$CUDA_PKG_VERSION cuda-curand-dev-$CUDA_PKG_VERSION
- sudo ln -s /usr/local/cuda-${CUDA_VERSION_MAJOR}.${CUDA_VERSION_MINOR} /usr/local/cuda

before_script:
Expand Down
4 changes: 2 additions & 2 deletions src/popsift/common/warp_bitonic_sort.h
Original file line number Diff line number Diff line change
Expand Up @@ -66,8 +66,8 @@ class Warp32
: ( my_val < other_val );
const bool must_swap = not ( my_more ^ reverse ^ increasing );

return ( must_swap ? popsift::shuffle_xor( my_index, 1 << shift )
: my_index );
int lane = must_swap ? ( 1 << shift ) : 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const int lane

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make a new PR for CMake and and the const?
I'm abroad and cannot test anything and I believe having the RTX fix merged is a good idea.
The APPEND needs a bit of testing because some of the CCs won't work. 32, 53, 62 and 72 are CC specific to the Tegra, which is an ARM-based system-on-a-chip and cannot do everything that I've used in PopSift. CC 52, on the other hand, should work.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, I've just listed the CC based on the cuda support, I had no idea which one should or should not be included. It was more to propose a different way to add the relevant CC.

Just out of curiosity, the tests that I ran were on an RTX 2080 without the proper CC set (7.5), do you think there is a significant difference when running the code from another CC that is not the one proper to the card? (i.e. should I expect that it runs even faster? :-) )

I'm not in a hurry to merge it, we can keep it here and merge it later.

return popsift::shuffle_xor( my_index, lane );
}

__device__ inline
Expand Down
8 changes: 5 additions & 3 deletions src/popsift/s_desc_loop.cu
Original file line number Diff line number Diff line change
Expand Up @@ -76,8 +76,10 @@ void ext_desc_loop_sub( const float ang,

float dpt[9] = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };

for( int i = threadIdx.x; i < loops; i+=blockDim.x )
for( int i = threadIdx.x; popsift::any(i < loops); i+=blockDim.x )
{
if( i >= loops ) continue;

const int ii = i / wx + ymin;
const int jj = i % wx + xmin;

Expand Down Expand Up @@ -111,14 +113,14 @@ void ext_desc_loop_sub( const float ang,
const float wgt2 = do0;

int fo = fo0 % DESC_BINS;

// maf: multiply-add
// _ru - round to positive infinity equiv to froundf since always >=0
dpt[fo] = __fmaf_ru( wgt1, wgt, dpt[fo] ); // dpt[fo] += (wgt1*wgt);
dpt[fo+1] = __fmaf_ru( wgt2, wgt, dpt[fo+1] ); // dpt[fo+1] += (wgt2*wgt);
}
__syncthreads();
}
__syncthreads();

dpt[0] += dpt[8];

Expand Down