Skip to content

Initial refactoring of edge storage #1085

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

litghost
Copy link
Collaborator

@litghost litghost commented Jan 27, 2020

Description

This changes edge storage from an allocation array of struct per node to
struct of array for all edge data.

Several algorithms over edges that were previous per node per edge, but
were actually just iteration over edges are now part of rr_node_storage.

This PR is built on top #1084 .

Related Issue

#1079
#1081
#1084

Motivation and Context

This changes the number of heap allocations on preallocated array (e.g. loading rr graph from file) from Nnodes to 3.

This changes reduces the number of heap allocations during rr graph building from Nnodes to 3 * ceil(log2(Nedges) - log2(10*Nodes)).

When the number of edges is known the max memory usage for edges is now Nedge * (2*sizeof(int)+sizeof(short)). When the number of edges is not known, the max memory usage for edges is now:

min(10*Nnode, Nedge) * (2*sizeof(int)+sizeof(short)) + sizeof(int)* (Nedge/2)

How Has This Been Tested?

  • Travis CI is green
  • Nightly and weekly QoR metrics are acceptable

Types of changes

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation
  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • All new and existing tests passed

@probot-autolabeler probot-autolabeler bot added lang-cpp C/C++ code VPR VPR FPGA Placement & Routing Tool labels Jan 27, 2020
@litghost litghost requested a review from kmurray January 27, 2020 22:15
@litghost litghost force-pushed the refactor_edges2 branch 2 times, most recently from 28c4331 to 42693d6 Compare January 27, 2020 22:19
@litghost litghost requested a review from vaughnbetz January 27, 2020 22:20
iterator operator--() {
value_ -= 1;
return *this;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the corresponding operators (also != below)? They don't exist?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry what? This comment doesn't make a lot of sense.

t_edge_size num_edges(const RRNodeId& id) const {
auto first_id = first_edge_[id];
auto second_id = (&first_edge_[id])[1];
return (size_t)second_id - (size_t)first_id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's dangerous to use auto followed by a cast. I'd prefer an explicit type here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't actually a cast, it's a operator size_t.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could consider using size_t(second_id) - size_t(first_id) to make this more explicit. That's how its done elsewhere in VPR.

@litghost litghost force-pushed the refactor_edges2 branch 2 times, most recently from e6267aa to 87d0ef8 Compare January 27, 2020 22:41
@HackerFoo
Copy link
Contributor

HackerFoo commented Jan 27, 2020

There are many uses of first and second that seem to be used as placeholder names, like x and y.

@litghost litghost changed the title Initial refactoring of edge storage WIP: Initial refactoring of edge storage Jan 27, 2020
@litghost
Copy link
Collaborator Author

Sanity check with gsm_switch_stratixiv_arch_timing.blif (from weekly titan tests)

Before:

## Build routing resource graph took 212.56 seconds (max_rss 7950.6 MiB, delta_rss +1698.9 MiB)
  RR Graph Nodes: 20014774
  RR Graph Edges: 159049032
# Create Device took 214.84 seconds (max_rss 7950.6 MiB, delta_rss +1698.9 MiB)

After:

## Build routing resource graph took 159.88 seconds (max_rss 9165.3 MiB, delta_rss +2842.8 MiB)
  RR Graph Nodes: 20014774
  RR Graph Edges: 159049032
# Create Device took 161.22 seconds (max_rss 9165.3 MiB, delta_rss +2842.8 MiB)

This jump in ~1.2 GiB is basically the cost of storing the source rr node explicitly. I'll noddle over alternative strategies.

@litghost
Copy link
Collaborator Author

litghost commented Jan 28, 2020

Sanity check with gsm_switch_stratixiv_arch_timing.blif (from weekly titan tests)

Before:

## Build routing resource graph took 212.56 seconds (max_rss 7950.6 MiB, delta_rss +1698.9 MiB)
  RR Graph Nodes: 20014774
  RR Graph Edges: 159049032
# Create Device took 214.84 seconds (max_rss 7950.6 MiB, delta_rss +1698.9 MiB)

After:

## Build routing resource graph took 159.88 seconds (max_rss 9165.3 MiB, delta_rss +2842.8 MiB)
  RR Graph Nodes: 20014774
  RR Graph Edges: 159049032
# Create Device took 161.22 seconds (max_rss 9165.3 MiB, delta_rss +2842.8 MiB)

This jump in ~1.2 GiB is basically the cost of storing the source rr node explicitly. I'll noddle over alternative strategies.

So in an unexpected turn, while the max_rss after create device is higher than before, because the heap is not thrashed, the final max_rss is much better:

Before:

The entire flow of VPR took 13322.02 seconds (max_rss 16152.4 MiB)

After:

The entire flow of VPR took 7068.79 seconds (max_rss 10713.4 MiB)

I'll need to do a full vtr_reg_weekly to confirm the results.

Copy link
Contributor

@kmurray kmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial thoughts after looking through this is that its looking good!

It is interesting that some parts of it are looking structurally more like the proposed RRGraph in #1046 (e.g. SoA form edge storage). I think ultimately moving towards that style of interface (pass an ID to get an attribute from an element in the graph) is where we want to go.

I have a couple more detailed comments below including a possible approach to avoid storing the edge_src_nodes_ after edge partitioning but still allowing (moderatly fast) access to an edge's source node.

vtr::vector<RRNodeId, RREdgeId> first_edge_;
vtr::vector<RRNodeId, t_edge_size> fan_in_;

vtr::vector<RREdgeId, RRNodeId> edge_src_node_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you noted, the additional storage of the source node is non-trivial (since there are more edges than nodes).

It's nice to have this information, but it isn't currently used (e.g. in the router). So if we really want to focus on memory usage, we could consider dropping it from here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this a bit more, given the sorted structure of the edges and that we know the start/end of each node's out-going edges, it seems like we should be able to get the source node for a given edge ID in log(num_nodes) time via binary search against the (sorted) t_rr_node_data::first_edge_s.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary reason for the edge_src_node_ data is to sort the underlying data. After that is complete, we can discard the array.

t_edge_size num_edges(const RRNodeId& id) const {
auto first_id = first_edge_[id];
auto second_id = (&first_edge_[id])[1];
return (size_t)second_id - (size_t)first_id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could consider using size_t(second_id) - size_t(first_id) to make this more explicit. That's how its done elsewhere in VPR.

@@ -71,17 +59,25 @@ struct t_rr_node_data {
int16_t track_num;
int16_t class_num;
} ptc_;
t_edge_size fan_in_ = 0;

uint16_t capacity_ = 0;
};

// RR node and edge storage class.
class t_rr_node_storage {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point this struct is really now representing the RR graph itself. Perhaps it should be renamed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed

return;
}

edges_read_ = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on clear on what this variable is for, and how it differs from partitioned_. Likely need comments

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments added

Comment on lines 418 to 422
const auto& device_ctx = g_vpr_ctx.device();
std::stable_sort(
edge_sort_iterator(this, 0),
edge_sort_iterator(this, edge_src_node_.size()),
edge_compare_src_node_and_configurable_first(device_ctx.rr_switch_inf));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need some comments describing what's going on here.

I think its accomplishing:

  • Ordering edges by src_node (so edges with a shared source node are contiguous)
  • Partitioning the edges for each src_node by configurable/non-configurable
    with a single sort. Is that correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your understanding is correct.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added

t_rr_node_data& get(const RRNodeId& id) {
return storage_[id];
}
const t_rr_node_data& get(const RRNodeId& id) const {
return storage_[id];
}

// Take allocated edges in edge_src_node_/ edge_dest_node_ / edge_switch_
// sort, and assign the first edge for each
void assign_edges();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename assign_first_edges()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* the edge data is directly usable for each node by simply slicing the arrays.
*
* */
struct edge_swapper {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice way to keep the swapping clean in Struct-of-Arrays form!

Comment on lines 241 to 246
RREdgeId edge_id(const RRNodeId& id, t_edge_size iedge) const {
RREdgeId first_edge = first_edge_[id];
RREdgeId ret((size_t)first_edge + iedge);
VTR_ASSERT_SAFE(ret < (&first_edge_[id])[1]);
return ret;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That we can do this easily given the edge ordering is a really nice feature, which I expect will make porting/refactoring downstream code much easier.

@litghost litghost mentioned this pull request Feb 3, 2020
9 tasks
@litghost
Copy link
Collaborator Author

litghost commented Feb 3, 2020

I think I've identified a way to regain some of the lost CPU performance. I've started a vtr_reg_weekly QoR run, but it won't return results for a day or two.

I'll push a new branch today with a rebase onto of master, along with fixes to review comments.

@litghost
Copy link
Collaborator Author

litghost commented Feb 3, 2020

I believe all feedback has been addressed. I've kicked off a QoR run, results will be ready in a couple days. Preliminary results show the CPU cost is gone, with the 4% memory increase remaining. #1096 resolves the memory increase, and changes it to a 35-60% memory reduction.

@litghost litghost changed the title WIP: Initial refactoring of edge storage Initial refactoring of edge storage Feb 3, 2020
@litghost litghost requested a review from kmurray February 3, 2020 22:53
This should have a negliable performance impact, but this enables future
changes to modify how rr nodes and rr edges are storaged.

Signed-off-by: Keith Rothman <[email protected]>
This changes edge storage from an allocation array of struct per node to
struct of array for all edge data.

Several algorithms over edges that were previous per node per edge, but
were actually just iteration over edges are now part of rr_node_storage.

Signed-off-by: Keith Rothman <[email protected]>
This enables 16-byte alignment (4 nodes per cache line).

Signed-off-by: Keith Rothman <[email protected]>
@litghost litghost closed this Mar 4, 2020
@litghost litghost deleted the refactor_edges2 branch March 4, 2020 01:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang-cpp C/C++ code libvtrutil VPR VPR FPGA Placement & Routing Tool
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants