diff --git a/README.md b/README.md
index ee39093..6ab5706 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,121 @@
 **University of Pennsylvania, CIS 5650: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Christina Qiu
+  * [LinkedIn](https://www.linkedin.com/in/christina-qiu-6094301b6/), [personal website](https://christinaqiu3.github.io/), [twitter](), etc.
+* Tested on: Windows 11, Intel Core i7-13700H @ 2.40GHz, 16GB RAM, NVIDIA GeForce RTX 4060 Laptop GPU (Personal laptop)
 
-### (TODO: Your README)
+## Overview
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+This is an implementation of a flocking simulation based on the Reynolds Boids algorithm. This project involved writing CUDA kernels, using them, and analyzing their performance. 
+
+1. Naive boids simulation.
+2. Scattered boids simulation on uniform grid.
+3. Coherent boids simulation on uniform grid.
+
+## Naive Boids
+
+At every timestep, a boid looks at each of its neighboring boids and computes the velocity change contribution from each of the three rules:
+
+1. cohesion - boids move towards the perceived center of mass of their neighbors
+2. separation - boids avoid getting to close to their neighbors
+3. alignment - boids generally try to move with the same direction and speed as their neighbors
+
+Thus, a brute force implementation has each boid check every other boid in the simulation.
+
+![Simulation Demo](images/hw_1_100.gif)
+
+<p align="center">
+  <img src="images/100_1.png" alt="Before" width="45%">
+  <img src="images/100_2.png" alt="After" width="45%">
+</p>
+
+boids = 50,000, blockSize = 128, no visualization, dt = 0.2
+
+Runtime O(n^2)
+
+## Scattered Uniform Grid
+
+Each boid is assigned to a grid cell based on its position. We store an array of particle indices, which are sorted by their grid cell. The actual position and velocity data remain in their original arrays so we must look up the sorted index to access a boid's data. This enables fast neighbor search but can cause memory access inefficiencies due to scattered data.
+
+![Simulation Demo](images/hw_1_110.gif)
+
+<p align="center">
+  <img src="images/110_1.png" alt="Before" width="45%">
+  <img src="images/110_2.png" alt="After" width="45%">
+</p>
+
+boids = 50,000, blockSize = 128, no visualization, dt = 0.2
+
+Runtime O(n) * neighbors
+
+## Coherent Uniform Grid
+
+The coherent uniform grid improves on the scattered version by also reordering the position and velocity arrays so that boids within the same grid cell are stored contiguously in memory. This cuts out the middle-man and noticeably improves performance, especially on large numbers of boids.
+
+![Simulation Demo](images/hw_1_111.gif)
+
+<p align="center">
+  <img src="images/111_1.png" alt="Before" width="45%">
+  <img src="images/111_2.png" alt="After" width="45%">
+</p>
+
+boids = 50,000, blockSize = 128, no visualization, dt = 0.2
+
+Runtime O(n) * neighbors
+
+## Runtime Analysis
+
+1. Hypothesis: Implementing a coherent uniform grid will significantly improve the performance of the Boids simulation compared to the scattered uniform grid and naive implementation, due to more efficient memory access and reduced unnecessary comparisons.
+
+### Average Kernel Time change with increasing # of Boids Graph
+
+blockSize = 128, no visualization, dt = 0.2
+
+![Data](images/graph1_t.png)
+![Graph](images/graph1_v.png)
+
+### Framerate change with increasing # of Boids Graph
+
+blockSize = 128, no visualization, dt = 0.2
+
+![Data](images/graph2_t.png)
+![Graph](images/graph2_v.png)
+
+Conclusion: Naive performance decreases significantly as the number of boids increases. Scattered Uniform Grid performance scales better. Coherent Uniform Grid has the theoretical complexity as scattered but is faster in practice due to coalesced memory access.
+
+##
+
+2. Hypothesis: Best performance will be observed at block sizes of 128 and 256. Very small or very large block sizes lead to underutilization of GPU available cores or lead to increased register use per threat which will reduce the number of active warps and thus lower performance.
+
+### Average Kernel Time change with increasing Blocksize Graph
+
+boids = 50,000, no visualization, dt = 0.2
+
+![Data](images/graph3_t.png)
+![Graph](images/graph3_v.png)
+
+### Framerate change with increasing Blocksize Graph
+
+boids = 50,000, no visualization, dt = 0.2
+
+![Data](images/graph4_t.png)
+![Graph](images/graph4_v.png)
+
+Conclusion: In testing, block sizes of 128 and 256 generally performed better. That said, the runtime data was somewhat noisy, making it difficult to draw firm conclusions. More controlled testing or averaging over a larger number of runs may be needed for more definitive insights.
+
+##
+
+3. Hypothesis: coherent uniform grid will have performance improvements over the scattered uniform grid. 
+
+Conclusion: Yes there was a notable improvment, which was expected because the coherent grid improves memory access patterns by storing position and velocity data contiguously in memory.
+
+##
+
+4. Hypothesis: changing cell width and checking 27 vs 8 neighboring cells will affect performance because these parameters directly control the number of comparisons made per boid and the density of boids within each cell. Checking 27 neighboring cells instead of 8 will likely increase runtime due to the greater number of memory lookups and loop iterations. However, the impact may not be as dramatic if the additional cells are mostly empty.
+
+Conclusion: When cell width was too small, more cells were needed to cover the same neighborhood radius, which increased the number of neighbor cells to check. This added overhead and reduced performance. When cell width was too large, more boids ended up in each cell, increasing the number of pairwise checks per cell and reducing the benefits of spatial partitioning. The ideal cell width was around twice the max rule distance, as suggested (covers interaction range efficiently with minimal overlap).
+
+Using 27-cell neighborhood search slightly increased runtime across all grid implementations due to the larger number of loop iterations and memory accesses. However, this increase was not always dramatic and depended heavily on boid density.
+
+Switching to an 8-cell search did reduce runtime, especially at higher boid counts.
\ No newline at end of file
diff --git a/images/100_1.png b/images/100_1.png
new file mode 100644
index 0000000..e39bd0c
Binary files /dev/null and b/images/100_1.png differ
diff --git a/images/100_2.png b/images/100_2.png
new file mode 100644
index 0000000..8a028ba
Binary files /dev/null and b/images/100_2.png differ
diff --git a/images/110_1.png b/images/110_1.png
new file mode 100644
index 0000000..0d676d6
Binary files /dev/null and b/images/110_1.png differ
diff --git a/images/110_2.png b/images/110_2.png
new file mode 100644
index 0000000..8f7518f
Binary files /dev/null and b/images/110_2.png differ
diff --git a/images/111_1.png b/images/111_1.png
new file mode 100644
index 0000000..2e18b0d
Binary files /dev/null and b/images/111_1.png differ
diff --git a/images/111_2.png b/images/111_2.png
new file mode 100644
index 0000000..864dbc5
Binary files /dev/null and b/images/111_2.png differ
diff --git a/images/graph1_t.png b/images/graph1_t.png
new file mode 100644
index 0000000..c7d4876
Binary files /dev/null and b/images/graph1_t.png differ
diff --git a/images/graph1_v.png b/images/graph1_v.png
new file mode 100644
index 0000000..2754e62
Binary files /dev/null and b/images/graph1_v.png differ
diff --git a/images/graph2_t.png b/images/graph2_t.png
new file mode 100644
index 0000000..61beaac
Binary files /dev/null and b/images/graph2_t.png differ
diff --git a/images/graph2_v.png b/images/graph2_v.png
new file mode 100644
index 0000000..79d5f5d
Binary files /dev/null and b/images/graph2_v.png differ
diff --git a/images/graph3_t.png b/images/graph3_t.png
new file mode 100644
index 0000000..2b39565
Binary files /dev/null and b/images/graph3_t.png differ
diff --git a/images/graph3_v.png b/images/graph3_v.png
new file mode 100644
index 0000000..c49e23c
Binary files /dev/null and b/images/graph3_v.png differ
diff --git a/images/graph4_t.png b/images/graph4_t.png
new file mode 100644
index 0000000..bdcc06e
Binary files /dev/null and b/images/graph4_t.png differ
diff --git a/images/graph4_v.png b/images/graph4_v.png
new file mode 100644
index 0000000..74d3513
Binary files /dev/null and b/images/graph4_v.png differ
diff --git a/images/hw_1_100.gif b/images/hw_1_100.gif
new file mode 100644
index 0000000..980a473
Binary files /dev/null and b/images/hw_1_100.gif differ
diff --git a/images/hw_1_110.gif b/images/hw_1_110.gif
new file mode 100644
index 0000000..e90cc67
Binary files /dev/null and b/images/hw_1_110.gif differ
diff --git a/images/hw_1_111.gif b/images/hw_1_111.gif
new file mode 100644
index 0000000..24ed14e
Binary files /dev/null and b/images/hw_1_111.gif differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 7149917..7e6218c 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -56,7 +56,7 @@ void checkCUDAError(const char *msg, int line = -1) {
 #define rule3Distance 5.0f
 
 #define rule1Scale 0.01f
-#define rule2Scale 0.1f
+#define rule2Scale 0.1f // 0.05 test
 #define rule3Scale 0.1f
 
 #define maxSpeed 1.0f
@@ -95,6 +95,9 @@ int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+glm::vec3 *dev_positionGridOrder;
+glm::vec3 *dev_velocityGridOrder;
+
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -179,6 +182,31 @@ void Boids::initSimulation(int N) {
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+  dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+  dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+  
+  // 2.3
+
+  cudaMalloc((void**)&dev_positionGridOrder, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_positionGridOrder failed!");
+
+  cudaMalloc((void**)&dev_velocityGridOrder, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_velocityGridOrder failed!");
+
+
+
   cudaDeviceSynchronize();
 }
 
@@ -243,7 +271,47 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
   // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
   // Rule 2: boids try to stay a distance d away from each other
   // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+
+    glm::vec3 perceived_center(0.0f);
+    glm::vec3 c(0.0f);
+    glm::vec3 perceived_velocity(0.0f);
+    int count1 = 0, count2 = 0, count3 = 0;
+
+    glm::vec3 boid = pos[iSelf];
+
+    for (int n = 0; n < N; n++) {
+        if (n == iSelf) continue;
+        glm::vec3 b = pos[n];
+        float dist = glm::distance(b, boid);
+        if (dist < rule1Distance) {
+            perceived_center += b;
+			count1++;
+        }
+        if (dist < rule2Distance) {
+            c -= (b - boid);
+			count2++;
+        }
+        if (dist < rule3Distance) {
+            perceived_velocity += vel[n];
+            count3++;
+        }
+	}
+
+    glm::vec3 v1(0.0f), v2(0.0f), v3(0.0f);
+
+    if (count1 > 0) {
+        perceived_center /= count1; // num of neighbors
+        v1 = (perceived_center - boid) * rule1Scale;
+    }
+    if (count2 > 0) {
+        v2 = c * rule2Scale; // rule2Scale
+    }
+    if (count3 > 0) {
+	    perceived_velocity /= count3; // num of neighbors
+        v3 = perceived_velocity * rule3Scale; // (perceived_velocity - vel[iSelf]) * rule3Scale;
+    }
+
+  return v1 + v2 + v3;
 }
 
 /**
@@ -255,6 +323,19 @@ __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   // Compute a new velocity based on pos and vel1
   // Clamp the speed
   // Record the new velocity into vel2. Question: why NOT vel1?
+
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) return;
+    glm::vec3 velChange = computeVelocityChange(N, index, pos, vel1);
+	glm::vec3 velNew = vel1[index] + velChange;
+
+	float speed = glm::length(velNew);
+    if (speed > maxSpeed) {
+        velNew = (velNew / speed) * maxSpeed;
+	}
+
+	vel2[index] = velNew;
+    
 }
 
 /**
@@ -299,6 +380,14 @@ __global__ void kernComputeIndices(int N, int gridResolution,
     // - Label each boid with the index of its grid cell.
     // - Set up a parallel array of integer indices as pointers to the actual
     //   boid data in pos and vel1/vel2
+
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) return;
+	indices[index] = index;
+	glm::vec3 p = pos[index];
+	glm::ivec3 gridPos = glm::floor((p - gridMin) * inverseCellWidth);
+	gridIndices[index] = gridIndex3Dto1D(gridPos.x, gridPos.y, gridPos.z, gridResolution);
+
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -316,6 +405,24 @@ __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) return;
+	int gridIndex = particleGridIndices[index];
+    if (index == 0) {
+        gridCellStartIndices[gridIndex] = index;
+    }
+    else {
+        int prevGridIndex = particleGridIndices[index - 1];
+        if (gridIndex != prevGridIndex) {
+            gridCellStartIndices[gridIndex] = index;
+            gridCellEndIndices[prevGridIndex] = index;
+        }
+	}
+    if (index == N - 1) {
+        gridCellEndIndices[gridIndex] = index + 1;
+	}
+
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
@@ -332,6 +439,85 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+    int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+    if (index >= N) return;
+    glm::vec3 p = pos[index];
+    glm::vec3 v = vel1[index];
+    glm::ivec3 gridPos = glm::floor((p - gridMin) * inverseCellWidth);
+
+    glm::vec3 center(0.0f);
+    glm::vec3 seperation(0.0f);
+    glm::vec3 velocity(0.0f);
+    int count1 = 0, count2 = 0, count3 = 0;
+
+    for (int i = -1; i <= 1; i++) {
+        for (int j = -1; j <= 1; j++) {
+            for (int z = -1; z <= 1; z++) {
+                int neighborX = gridPos.x + i;
+                int neighborY = gridPos.y + j;
+                int neighborZ = gridPos.z + z;
+
+                if (neighborX < 0 || neighborX >= gridResolution ||
+                    neighborY < 0 || neighborY >= gridResolution ||
+                    neighborZ < 0 || neighborZ >= gridResolution) {
+                    continue;
+                }
+
+                int neighborGridIndex = gridIndex3Dto1D(neighborX, neighborY, neighborZ, gridResolution);
+                int start = gridCellStartIndices[neighborGridIndex];
+                int end = gridCellEndIndices[neighborGridIndex];
+
+                for (int idx = start; idx < end; idx++) {
+                    int boidIndex = particleArrayIndices[idx];
+                    if (boidIndex == idx) continue;
+
+                    glm::vec3 neighborPos = pos[boidIndex];
+                    glm::vec3 neighborVel = vel1[boidIndex];
+                    float dist = glm::distance(p, neighborPos);
+
+                    // Rule 1
+                    if (dist < rule1Distance) {
+                        center += neighborPos;
+                        count1++;
+                    }
+                    // Rule 2
+                    if (dist < rule2Distance) {
+                        seperation -= (neighborPos - p);
+                        count2++;
+                    }
+                    // Rule 3
+                    if (dist < rule3Distance) {
+                        velocity += neighborVel;
+                        count3++;
+                    }
+                }
+            }
+        }
+    }
+
+    glm::vec3 v1(0.0f), v2(0.0f), v3(0.0f);
+
+    if (count1 > 0) {
+        center /= count1; // num of neighbors
+        v1 = (center - p) * rule1Scale;
+    }
+    if (count2 > 0) {
+        v2 = seperation * rule2Scale; // rule2Scale
+    }
+    if (count3 > 0) {
+        velocity /= count3; // num of neighbors
+        v3 = velocity * rule3Scale; // (perceived_velocity - vel[iSelf]) * rule3Scale;
+    }
+
+    glm::vec3 newVel = v + v1 + v2 + v3;
+
+    // Clamp
+    float speed = glm::length(newVel);
+    if (speed > maxSpeed) {
+        newVel = (newVel / speed) * maxSpeed;
+    }
+
+    vel2[index] = newVel;
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -351,6 +537,85 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+
+    int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+    if (index >= N) return;
+    glm::vec3 p = pos[index];
+    glm::vec3 v = vel1[index];
+    glm::ivec3 gridPos = glm::floor((p - gridMin) * inverseCellWidth);
+
+    glm::vec3 center(0.0f);
+    glm::vec3 seperation(0.0f);
+    glm::vec3 velocity(0.0f);
+    int count1 = 0, count2 = 0, count3 = 0;
+
+    for (int i = -1; i <= 1; i++) {
+        for (int j = -1; j <= 1; j++) {
+            for (int z = -1; z <= 1; z++) {
+                int neighborX = gridPos.x + i;
+                int neighborY = gridPos.y + j;
+                int neighborZ = gridPos.z + z;
+
+                if (neighborX < 0 || neighborX >= gridResolution ||
+                    neighborY < 0 || neighborY >= gridResolution ||
+                    neighborZ < 0 || neighborZ >= gridResolution) {
+                    continue;
+                }
+
+                int neighborGridIndex = gridIndex3Dto1D(neighborX, neighborY, neighborZ, gridResolution);
+                int start = gridCellStartIndices[neighborGridIndex];
+                int end = gridCellEndIndices[neighborGridIndex];
+
+                for (int idx = start; idx < end; idx++) {
+                    if (index == idx) continue;
+
+                    glm::vec3 neighborPos = pos[idx];
+                    glm::vec3 neighborVel = vel1[idx];
+                    float dist = glm::distance(p, neighborPos);
+
+                    // Rule 1
+                    if (dist < rule1Distance) {
+                        center += neighborPos;
+                        count1++;
+                    }
+                    // Rule 2
+                    if (dist < rule2Distance) {
+                        seperation -= (neighborPos - p);
+                        count2++;
+                    }
+                    // Rule 3
+                    if (dist < rule3Distance) {
+                        velocity += neighborVel;
+                        count3++;
+                    }
+                }
+            }
+        }
+    }
+
+    glm::vec3 v1(0.0f), v2(0.0f), v3(0.0f);
+
+    if (count1 > 0) {
+        center /= count1; // num of neighbors
+        v1 = (center - p) * rule1Scale;
+    }
+    if (count2 > 0) {
+        v2 = seperation * rule2Scale; // rule2Scale
+    }
+    if (count3 > 0) {
+        velocity /= count3; // num of neighbors
+        v3 = velocity * rule3Scale; // (perceived_velocity - vel[iSelf]) * rule3Scale;
+    }
+
+    glm::vec3 newVel = v + v1 + v2 + v3;
+
+    // Clamp
+    float speed = glm::length(newVel);
+    if (speed > maxSpeed) {
+        newVel = (newVel / speed) * maxSpeed;
+    }
+
+    vel2[index] = newVel;
 }
 
 /**
@@ -359,6 +624,18 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
 void Boids::stepSimulationNaive(float dt) {
   // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
   // TODO-1.2 ping-pong the velocity buffers
+
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize); // do this because need to round up
+
+	kernUpdateVelocityBruteForce << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_pos, dev_vel1, dev_vel2);
+	checkCUDAErrorWithLine("kernUpdateVelocityBruteForce failed!");
+	
+    kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel2);
+	checkCUDAErrorWithLine("kernUpdatePos failed!");
+    
+    std::swap(dev_vel1, dev_vel2);
+    
+	cudaDeviceSynchronize();
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
@@ -374,6 +651,53 @@ void Boids::stepSimulationScatteredGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed
+
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize); // do this because need to round up
+
+    kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+    checkCUDAErrorWithLine("kernComputeIndices failed!");
+
+    thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
+    kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (gridCellCount, dev_gridCellStartIndices, -1); // why is this good?
+    kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (gridCellCount, dev_gridCellEndIndices, -1);
+
+    kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+    checkCUDAErrorWithLine("kernIdentifyCellStartEnd failed!");
+    
+    kernUpdateVelNeighborSearchScattered << <fullBlocksPerGrid, blockSize >> > (
+        numObjects,
+        gridSideCount,
+        gridMinimum,
+        gridInverseCellWidth,
+        gridCellWidth,
+        dev_gridCellStartIndices,
+        dev_gridCellEndIndices,
+        dev_particleArrayIndices, 
+        dev_pos,
+        dev_vel1, 
+        dev_vel2
+    );
+    checkCUDAErrorWithLine("kernUpdateVelNeighborSearchScattered failed!");
+
+    kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel1);
+    checkCUDAErrorWithLine("kernUpdatePos failed!");
+
+    std::swap(dev_vel1, dev_vel2);
+
+    cudaDeviceSynchronize();
+
+    // dev_thrust_particleArrayIndices;
+    // dev_thrust_particleGridIndices;
+}
+
+__global__ void kernReorderBoidData(int N, int* particleArrayIndices, glm::vec3* old_pos, glm::vec3* old_vel, glm::vec3* new_pos, glm::vec3* new_vel) {
+    int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+    if (index >= N) return;
+
+    int orderIndex = particleArrayIndices[index];
+    new_pos[index] = old_pos[orderIndex];
+    new_vel[index] = old_vel[orderIndex];
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
@@ -392,6 +716,47 @@ void Boids::stepSimulationCoherentGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize); // do this because need to round up
+
+    kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+    checkCUDAErrorWithLine("kernComputeIndices failed!");
+
+    thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
+    kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (gridCellCount, dev_gridCellStartIndices, -1); // why is this good?
+    kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (gridCellCount, dev_gridCellEndIndices, -1);
+
+    kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+    checkCUDAErrorWithLine("kernIdentifyCellStartEnd failed!");
+
+    // BIG DIFFERENCE : use the rearranged array index buffer to reshuffle all
+    //   the particle data in the simulation array.
+    //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
+    kernReorderBoidData << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleArrayIndices, dev_pos, dev_vel1, dev_positionGridOrder, dev_velocityGridOrder);
+    checkCUDAErrorWithLine("kernReorderBoidData failed!");
+
+    kernUpdateVelNeighborSearchCoherent << <fullBlocksPerGrid, blockSize >> > (
+        numObjects,
+        gridSideCount,
+        gridMinimum,
+        gridInverseCellWidth,
+        gridCellWidth,
+        dev_gridCellStartIndices,
+        dev_gridCellEndIndices,
+        dev_positionGridOrder,
+        dev_velocityGridOrder,
+        dev_vel2
+        );
+    checkCUDAErrorWithLine("kernUpdateVelNeighborSearchCoherent failed!");
+
+    kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_positionGridOrder, dev_vel2);
+    checkCUDAErrorWithLine("kernUpdatePos failed!");
+
+    std::swap(dev_pos, dev_positionGridOrder);
+    std::swap(dev_vel1, dev_vel2);
+
+    cudaDeviceSynchronize();
 }
 
 void Boids::endSimulation() {
@@ -400,6 +765,13 @@ void Boids::endSimulation() {
   cudaFree(dev_pos);
 
   // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_particleGridIndices);
+  cudaFree(dev_gridCellEndIndices);
+  cudaFree(dev_gridCellEndIndices);
+
+  cudaFree(dev_positionGridOrder);
+  cudaFree(dev_velocityGridOrder);
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index 9c917c0..73def65 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -22,12 +22,12 @@
 // ================
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
-#define VISUALIZE 1
+#define VISUALIZE 0
 #define UNIFORM_GRID 0
 #define COHERENT_GRID 0
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
-const int N_FOR_VIS = 5000;
+const int N_FOR_VIS = 50000;
 const float DT = 0.2f;
 
 /**
@@ -193,6 +193,22 @@ void initShaders(GLuint * program) {
   // Main loop
   //====================================
   void runCUDA() {
+
+      static cudaEvent_t startEvent, stopEvent;
+      static bool eventsCreated = false;
+
+      static float elapsedTimeSec = 0.0f;
+      static int frameCount = 0;
+      static float totalKernelTimeMs = 0.0f;
+
+      float dt = 0.01f;
+
+      if (!eventsCreated) {
+          cudaEventCreate(&startEvent);
+          cudaEventCreate(&stopEvent);
+          eventsCreated = true;
+      }
+
     // Map OpenGL buffer object for writing from CUDA on a single GPU
     // No data is moved (Win & Linux). When mapped to CUDA, OpenGL should not
     // use this buffer
@@ -204,6 +220,9 @@ void initShaders(GLuint * program) {
     cudaGLMapBufferObject((void**)&dptrVertPositions, boidVBO_positions);
     cudaGLMapBufferObject((void**)&dptrVertVelocities, boidVBO_velocities);
 
+    // START TIME
+    cudaEventRecord(startEvent);
+
     // execute the kernel
     #if UNIFORM_GRID && COHERENT_GRID
     Boids::stepSimulationCoherentGrid(DT);
@@ -213,12 +232,38 @@ void initShaders(GLuint * program) {
     Boids::stepSimulationNaive(DT);
     #endif
 
+    // STOP TIME
+    cudaEventRecord(stopEvent);
+    cudaEventSynchronize(stopEvent);
+
+    float kernelTimeMs = 0.0f;
+    cudaEventElapsedTime(&kernelTimeMs, startEvent, stopEvent);
+
     #if VISUALIZE
     Boids::copyBoidsToVBO(dptrVertPositions, dptrVertVelocities);
     #endif
+
     // unmap buffer object
     cudaGLUnmapBufferObject(boidVBO_positions);
     cudaGLUnmapBufferObject(boidVBO_velocities);
+
+    elapsedTimeSec += dt;
+    totalKernelTimeMs += kernelTimeMs;
+    frameCount++;
+
+    if (elapsedTimeSec >= 10.0f) {
+        float avgKernelTimeMs = totalKernelTimeMs / frameCount;
+        float avgFPS = 1000.0f / avgKernelTimeMs;
+
+        std::cout << "\n[CUDA Kernel Performance - 10s]" << std::endl;
+        std::cout << "  Average Kernel Time: " << avgKernelTimeMs << " ms" << std::endl;
+        std::cout << "  Average FPS (Simulation Only): " << avgFPS << std::endl;
+
+        // Reset for next window
+        elapsedTimeSec = 0.0f;
+        totalKernelTimeMs = 0.0f;
+        frameCount = 0;
+    }
   }
 
   void mainLoop() {