-
Notifications
You must be signed in to change notification settings - Fork 412
[WIP] Tileable Routing Resource Graph Builder #2135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
tangxifan
wants to merge
546
commits into
master
Choose a base branch
from
openfpga
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
resolve merge conflicts.
Add vib info
Add vib info
Add vib info
VIB description update
* [vpr][ap] remove redundant print_pb * Fix styling regressions * Add reset_bimap helper method to AtomPBBimap * Remove copying empty bimap from global context to cluster legalizer * Refactor is_atom_blk_in_pb function to get two t_pb* arguments * Fix minor styling issues * [vpr][pack] reomve redundant function calls * [vpr][place] fix estimated_wl var name * [APPack] Updated How APPack Adheres to Given Placement The original implementation of APPack was focused on reconstructing a given flat placement. This can cause issues if the given flat placement disagrees with the decisions of the packer. Instead, updated APPack so that it treats the flat placement as a hint to help guide how it performs clustering. Added the following new features: - APPack computes the location of clusters based on the centroid of the molecules packed within. - APPack attenuates the gain terms of candidates based on their distance from the cluster. - APPack drops candidates which are too far from the cluster being created. Remove adding molecules near to the position of the cluster. This had similar affects to unrelated clustering and should be investigated separately later. With these changes to APPack, the AP flow now improves WL of circuits by 1-3% at the expense of up to 15% runtime compared to the default VPR flow. * make format * [vpr][route] remove redundant functions from rr_graph2 * make format * [vpr][route] remove redundant functions from rr_graph2 * [libs][rr_graph] change rr_node_indices value type to RRNodeId * fix formatting issues * make format * [AP][GlobalPlacement] Improved Partial Legalizer Legality Updated the partial legalizer to now take into account block types when spreading blocks. This will create windows around overfilled bins that is aware of which block types are overfilled and how large the window needs to be to accomodate them. It also takes these block types into account when spreading to only allow blocks to spread into sub-windows that they can exist in. This improves quality but was detremental to performance, so some performance improvements were needed. To improve the performance of the partial legalizer, I split the problem into groups of models which must be spread together. This allows us to create tighter windows and can make some parts of the legalizer more efficient. Create a model grouper class which forms the model pack patterns into a graph and find disconnected sub-graphs to form the model groups. Also improved the window generation by pre-clustering the overfilled bins before creating the windows. This sped up the window generation code since less windows overlap. * [vpr][rr_graph] fix comment * [AP][Solver] Supporting Unfixed Blocks When no fixed blocks are provided by the user, the AP flow can still work. Currently, in the first iteration, the solver will put all blocks at 0,0 and use the legalized solution in the next iteration as fixed points. Instead of (0,0), it makes more sense to put the blocks in the center of the device. Also added a guess to the solver to help CG converge faster each iteration. Added a regression test to ensure that not describing the fixed blocks is supported. * [vpr] rename arch_opin_between_dice_switch to arch_inter_die_switch since it is used for both 3d CB and 3d SB * [arch] fix 3d sb arch delay * [arch] add ipin_cblock switch * make format * Update clang-format version to 18 This is the version that is installed by default on Ubuntu 24.04 which we currently run CI and testing on. * Fix formatting to be compliant with clang-format-18 * [APPack] Flat-Placement Informed Unrelated Clustering Used flat placement information provided by APPack to try and select better unrelated candidates. This searches for candidates as close to the flat placement position of the cluster. There are two parameters that control how this is performed: 1) max_unrelated_tile_distance decides how far the algorithm will search for unrelated candidates. The algorithm will check for candidates in the same tile as the cluster, and then will search farther and farther out 2) max_unrelated_clustering_attempts decides how many failing attempts the cluster will try unrelated clustering. This matches the option of the same name in the candidate selector class; but this was made separate since likely it will be different for APPack. * apply comments * make format * [vpr][rr_graph] remove flat router parameter from vpr_create_device * [vpr][stats] add print_resource_usage * [vpr][base] moove calculate_device_util to stats * [vpr][pack] include required lib * add print_device_util to stats * [vpr][base] print resource usage and device util only if clb netlist is valid * [vpr][base] remove unused param * [vpr][base] remove var from doxygen comment * [vpr][base] check whether instnace exists in netlist * apply comments * make format * [vpr][place] add skip anneal option * [vpr][place] pass skip_anneal to placer * [vpr][place] update constraint doc * [vpr][place] minor update to the doc * [vtr][script] add run dir to parse script * [script] remove get_latest_run_dir_number out of util * [script] use run dir name instead of only accepting the run dir num * [script] rename to set_global_run_dir * make format-py * fix formatting issue * [script] fix when run dir is not found * make format-py * fix python lint * add NestedNetlistRouter and custom thread pool * fix formatting issues * [script] add class methods * fix python lint * fix pylint * [place] fix the bug to skip anneal when analytic placer is enabled * [place] rename skip_anneal to quench_only * [place] add doc for place_quench_only * [AP][GlobalPlacment] Added Bound2Bound Solver The Bound2Bound net model is a method to solve for the linear HPWL objective by iteratively solving a quadratic objective function. This method does obtain a better quality post-global placement flat placement; at the expense of being more computationally expensive. Found that this solver also has numerical stability issues. This may cause the CG solver to never converge which will hit the iteration limit of 2 * the number of moveable blocks. This makes this algorithm quadratic with the number of blocks in the netlist. To resolve this, set a custom iteration limit. This seems to work well on our benchmarks but may need to be revisited in the future. * [AP][GlobalPlacement] Updated B2B Solver According to Feedback * [vpr][place] rename get_initial_move_lim to get_place_inner_loop_num_move * fix a typo * Bump libs/EXTERNAL/libcatch2 from `914aeec` to `76f70b1` Bumps [libs/EXTERNAL/libcatch2](https://github.com/catchorg/Catch2) from `914aeec` to `76f70b1`. - [Release notes](https://github.com/catchorg/Catch2/releases) - [Commits](catchorg/Catch2@914aeec...76f70b1) --- updated-dependencies: - dependency-name: libs/EXTERNAL/libcatch2 dependency-version: 76f70b1403dbc0781216f49e20e45b71f7eccdd8 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * fix a few typos * added a doxygen comments * use VTR_LOGV_ERROR instead of is statements * doxygen comment for load_rr_edge_overrides() * make format * only override edge delay and not electrical stuff * [script] apply comments * [script] rename get_latest_run_dir to get_active_run_dir * [AP] Tuned the AP Flow The AP flow has many tunable knobs which trade-off quality and run time. Went through each of the knobs to find a good combination. Updates to the partial legalizer: - Reversed the order that unplaced large blocks are inserted into partitions. - Increased the bin cluster gap from 1 to 2 On the largest VTR benchmarks, this decreased the number of overfilled bins after legalization by 15% and the average overfill of each of those bins by 40%. On Titan, the number of overfilled bins decreased by 32% and the average overfill decreased by 2.5%. Updates to the analytical solver and global placer: - Allowed the B2B solver to stop early if it seems to be converging. - Changed the anchor weights from a linearized term to a quadratic term. - Decreased the distance epsilon from 0.5 to 0.01. - Increased the max number of B2B solver iterations from 6 to 24 - Decreased the CG iteration cap from 200 to 150. - The global placer saves the best legalized placement it has seen and returns it as its final result. On the largest VTR benchmarks, this decreased the post GP HPWL by 22% and decreased the GP run time by 17%. On Titan, the post GP HPWL decreased by 25%, and the GP run time decreased by 19%. Updates to APPack: - Decreased the max candidate distance from 0.5 (W + H) to 0.1 (W + H) for logical blocks. - Decreased the max candidate distance for all other blocks to 0.35 (W + H) - Lowered the attenuation distance threshold from 2.0 to 1.75. - Decreased the attenuation value at the distance threshold to 0.35. - Increased the max unrelated clustering distance from 1 to 5. - Increased the max number of unrelated clustering attempts from 2 to 10. - Turned off all APPack optimization for RAM blocks. On the largest VTR benchmarks, this decreased the wirelength by 2% over the un-tuned AP flow, with a 2.8% decreased pack time. On Titan, the post FL wirelength decreased by 6% and the post routing wirelength decreased by 2.6%, with a 0.7% decrease in pack time. Updates to initial placement: - Fixed oversight with how the centroid was being calculated. - Increased the range limit when searching for nearby locations when the location a cluster wants is take from 15 to 60. This further improved the post routing wirelength of Titan to 4.4% better than the un-tuned AP flow. I found that there are a lot of issues with the initial placement which may be blocking a large amount of gains. Will be investigating the initial placement code soon. * [Prepacker] Moved the Prepacker Out of Try Pack The AP flow makes its own prepacker which it uses throughout. However, a full legalizer in the AP flow (APPack) uses the try_pack method which creates its own prepacker. This creates two independent prepacker objects when only one is needed. Move the construction of the prepacker object into vpr_api and have it get passed into the try_pack function. * [script] afix the bug with get_next_run_dir * python lint * [vpr][place] update get_place_inner_loop_num_move comment * [vpr][place] prrint number of moves per temp after getting the number * make format * add a unit test for reading edge override file * Add edge_id() method to find an edge that connects given src and sink nodes * replace for loop with edge_id() method that return an edge connecting given src/sink nodes * add doxygen comment for edge_id() method * verify overridden edge attribute in the unit test * move operator==() and hash function of t_rr_switch_inf to physical_types.cpp * add test_read_rr_edge_override.txt * make format * add InsertNewlineAtEOF: true to .clang-format * make format to add new line at EOF * init value of false for load_flat_placement * [Pack][Timing] Abstracted How Timing is Used in the Packer Timing was intermixed into the packer. It appears as though the code originally was designed to recalculate the timing information every so often in the packer, but the idea was abandoned. This left timing code in disperse locations around the Packer and the timing was being recomputed every time clustering was restarted which was unecessary. Collecting all of the timing information from the Packer into a single object called PreClusterTimingManager which abstracts all of the timing info in the Packer. The ultimate goal is to bring this Manager class into the AP flow to be used together with the Global Placer. By sharing this manager class, the AP flow may be able to update the timing info with flat placement information to make the timing more accurate. * [AP][Timing] Added Basic Net Weighting Added basic timing awareness to the AP flow by weighting nets in the AP solver by their criticality (the max criticality of all edges through that net). This makes the solver try to minimize the length of nets that are more critical more than nets that are less critical (according to the pre-clustering timing analyzer). Added a command-line option to tradeoff between timing and wirelength in the AP flow. * [AP][Test] Added Titan Nightly Test of WL-Driven AP Flow * enum class for graph type * use std::vector for clb_to_clb_directs * doxygen comment for t_unified_to_parallel_seg_index * doxygen comment for get_parallel_segs() * replace t_seg_details* with std::vector<t_seg_details> * get_seg_track_counts() returns std::vector<int> + doxygen comment * move local var declarations from beginning of alloc_and_load_seg_details to where they are used * pass t_chan_width by reference * remove get_ordered_seg_track_counts() * remove t_mux, t_pin_spec, and t_mux_size_distribution structs * add docs for vtr::thread_pool * add is_root_location to grid * remove unnecessary calls to clear() * [AP][InitialPlacement] Improved Initial Placement Found that the Initial Placer stage of the AP flow (after APPack, but before Detailed Placement) was not working as expected. The intention was that clusters would be placed at their centroid location accordin to the flat placement, and if that site was illegal or taken it would take a nearby point instead (falling back on the original initial placer if nothing can be found). To achieve this, I was using a method called find_centroid_neighbor which I thought would return the nearest legal location to the given location. This was not correct. This method just creates a bounding-box and tries to find a random point in that box around the given point. This was causing our AP flow to move clusters WAY farther than they wanted, which moved them into places other clusters wanted to go. This was also not exhaustive, so it was often falling back on the original approach which was putting clusters in practically random locations. All of this was causing the post-FL placement from the AP flow to actually have worse quality than the default AP flow! To resolve this, I wrote the actual method I was intending. It performs a BFS-style search from the src location to all legal locations and returns the closest one. By doing this BFS on the compressed grid, I found that this is actually quite efficient. With these changes, I found that the quality of the post-FL placement more than doubled and the average atom displacement from the GP solution decrease dramatically. * move t_seg_details, t_chan_seg_details, and t_chan_details to rr_types.h * fix compilation error in test_connection router and the warning in rr_graph2.cpp * move t_sblock_pattern to rr_types.h * make format * [vpr][place] remove get_net_wirelength_from_layer_bb_ from netcosthandler class * [vpr][place] make get_net_wirelength_from_layer_bb_ static function and update its parameters * [vpr][place] use appropiate wirelength est function * make format * [test] add strong 3d * fix signal 6 in stratix 10 arch strong test * apply PR comments * add the requested comments * update file_formats.rst * add --read_rr_edge_override to command_line_usage.rst * remove duplicate text in command_line_usage.rst * [vpr][place] apply review comments * make format * make format * [vpr][tileable] add include * remove unused function linear_regression_vector() * add write_channel_occupancy_to_file() * write channel coordinate and occupancy percentage to file * make columns aligns in channel utilization files * update submodule * make format * [libs][arch] return -1 if valid index is not found * make format * [libs][arch] comment unused vars * refactor the code to use the same code for both x and y channels * [libs][pugiutil] delete pointer * [libs][pugiutil] format issue * fix format * [libs][archfpga] comment parse_pin_name * [libs][encrypt] break the line to read file * [vpr][base] call setupvipinf if vib_infs is not empty * [libs][encrypt] initialize plaintext only if file is open * [libs][encrypt] use rdbuf to read a file to avoid gcc-13 warning * [libs][decrypt] rading a file in safe way to prevent gcc13 warning * [vpr][vib_grid] fix type name if type is nullptr * [vpr][tileable] resize if segment inf size is not zero * [vpr][tileable] use empty method instead of checking size * [vpr][tileable] set the size when defining the vector (gcc warning) * fix format * Add Github action to close stale issues The added workflow will close up to 30 old issues every day. Issues that have been inactive for more than a year will be first marked as stale, and if they remain stale after 15 days they will be automatically closed. * Add documentation for automatic issue closure * [test] fix strong constraint * [lib][arch] check num_interconnect is bigger than zero * [vpr][route] add a condition to not increment delta_seg if the segment is on the edge * [vpr][route] fix max seg idx * fix formatting * Change some internal packer APIs to not use C-style arrays This commit changes some functions that used C-style arrays to use std::vector instead. Previously we used the .data() method of std::vectors to pass a pointer to these functions. * pass by reference and typo * Clean up prepacker This commit changes two functions in the prepacker to get the specific element of the array they work with and not the entire array. * Change vector variable name to be more inline with the current style * remove scratch vectors from Move context * NetCostHandler is the owner of all bb-related data * remove PlacerMoveContext * define MoveGenerator::first_rlim * use #pragma once in move generator header files * make format * fix typo * get_bb_from_scratch_() accepts use_ts as its argument * [libs][librrgraph] update echo file of rr graph * [test][strong] update golden result * [test] update strong tileable golden result * explain what RR edge override feature is useful for * [test][tileable] update golden results * add comment for MoveGenerator::first_rlim * [STA] Updated SDF File Generation to Include Min Delays The SDF file generated by the post-implementation netlist writer was only using the max delays of timing connections in the timing graph. In the SDF file, it set all values of the rising and falling triples to the max delay. When using this SDF file for external timing analysis, the minimum timing (hold) paths were incorrect. Updated the netlist writer to work with triples instead of bare delays. This allows (minimum, typical, maximum) delays to be passed through the different functions and be printed cleanly. For standard delay signals in the circuit (not setup / hold times) Tatum provides the minimum delays. These are now being printed in the SDF file and the minimum timing paths are being found correctly in the external timing analyzer. Cleaned up some parts of the netlist printing code as well. 1) netlist_writer.cpp declared many functions in the global scope which may cause conflicts at link time in VTR. Put all of these methods in anonymous namespace to prevent this. 2) The code was casting the delays from seconds to picoseconds in strange places. This was tricky to work with since these are both stored as doubles. Changed all of the code to only work with delays in seconds, and only cast to picoseconds when printing. 3) General cleanup of the header file and the include files. * [STA] Updated How Un-Initialized Delay Triples are Handled Thank you to Fred Tombs for pointing out this issue! * [AP][InitialPlacement] Created Isolated AP Flow The old Initial Placer used in the AP flow was constructed within the initial placer of the non-AP flow. This forced the AP flow to try to place blocks one at a time with minimum displacement. This is non-ideal since blocks that were placed earlier were being getting first picks at locations, which may displace a future cluster which may be a better fit for that location. Separated out the AP initial placement code. For AP, initial placement is done in passes. The first pass will try to place clusters exactly at the tile that the centroid of all atoms within the cluster want to be placed (according to the global placement). Any clusters that could not be placed are reserved for the next pass. The second pass will allow clusters to be placed within 1 tile of their centroid. All subsequent passes will allow cluster to be placed exponentially farther from their centroid. The initial placement terminates when all clusters have been placed or if the max displacement is the size of the entire device. The clusters are sorted based on the size of the macro that contains them and the variance of the placement of the atoms within the macro. This allows large macro blocks with low variance to be placed first. * add doxygen comment for X_coord, Y_coord, and layer_coord * remove X_coord and Y_coord from feasibe_region_move_generator * add comment explaining ts and permanent data members * make format * [AP] General Fixed/Unfixed Blocks Cleanup Fixed a couple of small known issues around the AP flow related to how we handle fixed blocks. Offset the fixed block locations by 0.5 such that they are no longer on the edge. Previously, fixed blocks were placed at the root location of tiles. This was a problem since atoms would want to be generally close to the fixed block and may be biased to the bottom/left tiles to the fixed-block tile. This does not handle large tiles, but will help in general. If no fixed blocks are provided, the AP solver will always produce the trivial solution (all blocks placed on top of one another anywhere on the device). We were wasting time running bound2bound to solve this and the solution was probably being put on the bottom-left corner (0,0) which is not ideal. Instead of running bound2bound during the first iteration in this case, just placed all blocks in the center of the device. This greatly speeds up the first iteration when no fixed blocks are provided. * Remove atom_net global context mutation from packer * [vpr][tileable_rr_graph] fix rr_switch usage --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Amir Poolad <[email protected]> Co-authored-by: Amir Poolad <[email protected]> Co-authored-by: AlexandreSinger <[email protected]> Co-authored-by: AlexandreSinger <[email protected]> Co-authored-by: vaughnbetz <[email protected]> Co-authored-by: Soheil Shahrouz <[email protected]> Co-authored-by: Duck Deux <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: soheilshahrouz <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
build
Build system
docs
Documentation
external_libs
infra
Project Infrastructure
lang-cpp
C/C++ code
lang-hdl
Hardware Description Language (Verilog/VHDL)
lang-make
CMake/Make code
lang-netlist
lang-python
Python code
lang-shell
Shell scripts (bash etc.)
libarchfpga
Library for handling FPGA Architecture descriptions
liblog
libpugiutil
libvtrutil
Odin
Odin II Logic Synthesis Tool: Unsorted item
Parmys
scripts
Utility & Infrastructure scripts
VPR
VPR FPGA Placement & Routing Tool
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Bring the tileable routing resource graph builder from OpenFPGA to VPR.
Full details about the tileable routing resource graph builder can be found at
X. Tang, E. Giacomin, A. Alacchi and P. Gaillardon, "A Study on Switch Block Patterns for Tileable FPGA Routing Architectures," 2019 International Conference on Field-Programmable Technology (ICFPT), 2019, pp. 247-250, doi: 10.1109/ICFPT47387.2019.00039.
fpt2019_final.pdf
https://ieeexplore.ieee.org/document/8977869
Related Issue
Motivation and Context
The tileable routing resource graph builder is an alternative routing resource graph builder than the existing one in VTR.
Being compatible with existing data structures (RRGraphView and RRGraphBuilder), this new feature enables VTR to support FPGA devices created by OpenFPGA.
User Interface
The tileable routing resource graph builder can be enabled through XML syntax in architecture description langauge
The tileable rr_graph generator also supports mixed switch block pattern: The wires which start and end in a switch block have a switch bock pattern, while the wires which pass through a switch block can have another switch block pattern.
SIGSTKSZ
in libcatch2 which is not supported in Ubuntu 21.04+VTR_ENABLE_VERSION
(by default is on), which allows developers to skip version build when integrating VTR as a submoduleis_real_param()
in read_blif.cpp (borrowed from another feature branch of Antmicro)Known Limitations
Checklist
Bugs/Issues found
num_class
intype_descriptor
is not used. It is always set to 0 regardless the list size ofclass_inf
. Suggest to remove it.resize_node()
. It may mistakenly resetnode_lookup()
when calling it incrementally. When callingreserve_node
to pre-allocate memory, such bugs can be bypassed.--write_block_usage
is enabled, the block usage is only shown instd:cout
or an external file. As a result, the information is not included in thevpr_stdout.log
since it is not usingVTR_LOG
vtr-verilog-to-routing/vpr/src/base/ShowSetup.cpp
Lines 174 to 179 in 50b56f3
How Has This Been Tested?
Here are a list of regression tests to added, in order to support existing features/options in customizing routing resource graphs.
Types of changes
Checklist: