Skip to content

Commit 568cc23

Browse files
authored
Merge pull request #4 from msmk0/version1_fixes
Fixes for version 1
2 parents 78b71a3 + 8727bf2 commit 568cc23

File tree

3 files changed

+57
-41
lines changed

3 files changed

+57
-41
lines changed

README.md

+43-28
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
1-
Tracking machine learning challenge (TrackML) utility library
2-
=============================================================
1+
TrackML utility library
2+
=======================
33

4-
A python library to simplify working with the dataset of the tracking machine
5-
learning challenge.
4+
A python library to simplify working with the
5+
[High Energy Physics Tracking Machine Learning challenge](kaggle_trackml)
6+
dataset.
67

78
Installation
89
------------
@@ -50,9 +51,10 @@ for event_id, hits, cells, particles, truth in load_dataset('path/to/dataset'):
5051
...
5152
```
5253

53-
Each event is lazily loaded during the iteration. Options are available to
54-
read only a subset of available events or only read selected parts, e.g. only
55-
hits or only particles.
54+
The dataset path can be the path to a directory or to a zip file containing the
55+
events `.csv` files. Each event is lazily loaded during the iteration. Options
56+
are available to read only a subset of available events or only read selected
57+
parts, e.g. only hits or only particles.
5658

5759
To generate a random test submission from truth information and compute the
5860
expected score:
@@ -65,8 +67,8 @@ shuffled = shuffle_hits(truth, 0.05) # 5% probability to reassign a hit
6567
score = score_event(truth, shuffled)
6668
```
6769

68-
All methods either take or return `pandas.DataFrame` objects. Please have a look
69-
at the function docstrings for detailed documentation.
70+
All methods either take or return `pandas.DataFrame` objects. You can have a
71+
look at the function docstrings for detailed information.
7072

7173
Authors
7274
-------
@@ -94,9 +96,11 @@ some hits can be left unassigned). The training dataset contains the recorded
9496
hits, their truth association to particles, and the initial parameters of those
9597
particles. The test dataset contains only the recorded hits.
9698

97-
The dataset is provided as a set of plain `.csv` files ('.csv.gz' or '.csv.bz2'
98-
are also allowed)'. Each event has four associated files that contain hits,
99-
hit cells, particles, and the ground truth association between them. The common prefix (like `event000000000`) is fully constrained to be `event` followed by 9 digits.
99+
The dataset is provided as a set of plain `.csv` files (`.csv.gz` or `.csv.bz2`
100+
are also allowed). Each event has four associated files that contain hits, hit
101+
cells, particles, and the ground truth association between them. The common
102+
prefix (like `event000000000`) is fully constrained to be `event` followed by 9
103+
digits.
100104

101105
event000000000-hits.csv
102106
event000000000-cells.csv
@@ -132,15 +136,17 @@ are given here to simplify detector-specific data handling.
132136
### Event hit cells
133137

134138
The cells file contains the constituent active detector cells that comprise each
135-
hit. A cell is the smallest granularity inside each detector module, much like a pixel on a screen, except that depending on the volume_id a cell can be a square or a long rectangle. It is
136-
identified by two channel identifiers that are unique within each detector
137-
module and encode the position, much like row/column numbers of a matrix. A cell can provide signal information that the
138-
detector module has recorded in addition to the position. Depending on the
139-
detector type only one of the channel identifiers is valid, e.g. for the strip
140-
detectors, and the value might have different resolution.
139+
hit. A cell is the smallest granularity inside each detector module, much like a
140+
pixel on a screen, except that depending on the volume_id a cell can be a square
141+
or a long rectangle. It is identified by two channel identifiers that are unique
142+
within each detector module and encode the position, much like column/row
143+
numbers of a matrix. A cell can provide signal information that the detector
144+
module has recorded in addition to the position. Depending on the detector type
145+
only one of the channel identifiers is valid, e.g. for the strip detectors, and
146+
the value might have different resolution.
141147

142148
* **hit_id**: numerical identifier of the hit as defined in the hits file.
143-
* **ch0, ch1**: channel identifier/coordinates unique with one module.
149+
* **ch0, ch1**: channel identifier/coordinates unique within one module.
144150
* **value**: signal value information, e.g. how much charge a particle has
145151
deposited.
146152

@@ -149,7 +155,8 @@ detectors, and the value might have different resolution.
149155
The particles files contains the following values for each particle/entry:
150156

151157
* **particle_id**: numerical identifier of the particle inside the event.
152-
* **vx, vy, vz**: initial position (in millimeters) (vertex) in global coordinates.
158+
* **vx, vy, vz**: initial position or vertex (in millimeters) in global
159+
coordinates.
153160
* **px, py, pz**: initial momentum (in GeV/c) along each global axis.
154161
* **q**: particle charge (as multiple of the absolute electron charge).
155162
* **nhits**: number of hits generated by this particle
@@ -165,23 +172,31 @@ particle/track.
165172
* **hit_id**: numerical identifier of the hit as defined in the hits file.
166173
* **particle_id**: numerical identifier of the generating particle as defined
167174
in the particles file.
168-
* **tx, ty, tz** true intersection point in global coordinates (in millimeters) between
169-
the particle trajectory and the sensitive surface.
170-
* **tpx, tpy, tpz** true particle momentum (in GeV/c) in the global coordinate system
171-
at the intersection point. The corresponding unit vector is tangent to the particle trajectory.
175+
* **tx, ty, tz** true intersection point in global coordinates (in
176+
millimeters) between the particle trajectory and the sensitive surface.
177+
* **tpx, tpy, tpz** true particle momentum (in GeV/c) in the global
178+
coordinate system at the intersection point. The corresponding vector
179+
is tangent to the particle trajectory at the intersection point.
172180
* **weight** per-hit weight used for the scoring metric; total sum of weights
173181
within one event equals to one.
174182

175183
### Dataset submission information
176184

177-
The submission file must associate each hit in each event to one and only one reconstructed particle track. The reconstructed tracks must be uniquely identified only within each event. Participants are advised to compress the submission file (with zip, bzip2, gzip) before submission to Kaggle site.
185+
The submission file must associate each hit in each event to one and only one
186+
reconstructed particle track. The reconstructed tracks must be uniquely
187+
identified only within each event. Participants are advised to compress the
188+
submission file (with zip, bzip2, gzip) before submission to the
189+
[Kaggle site](kaggle_trackml).
178190

179191
* **event_id**: numerical identifier of the event; corresponds to the number
180192
found in the per-event file name prefix.
181-
* **hit_id**: numerical identifier (non negative integer) of the hit inside the event as defined in the per-event hits file.
182-
* **track_id**: user defined numerical identifier (non negative integer) of the track
193+
* **hit_id**: numerical identifier of the hit inside the event as defined in
194+
the per-event hits file.
195+
* **track_id**: user-defined numerical identifier (non-negative integer) of
196+
the track
183197

184198

185-
[cern]: https://home.cern/
199+
[cern]: https://home.cern
186200
[lhc]: https://home.cern/topics/large-hadron-collider
187201
[mit_license]: http://www.opensource.org/licenses/MIT
202+
[kaggle_trackml]: https://www.kaggle.com/c/trackml-particle-identification

setup.py

+5-5
Original file line numberDiff line numberDiff line change
@@ -11,15 +11,15 @@
1111

1212
setup(
1313
name='trackml',
14-
version='1b0',
14+
version='1',
1515
description='TrackML utility library',
1616
long_description=long_description,
1717
long_description_content_type='text/markdown',
18-
# url='TODO',
19-
author='Moritz Kiehn', # TODO who else
20-
author_email='[email protected]', # TODO or mailing list
18+
url='https://github.com/LAL/trackml-library',
19+
author='Moritz Kiehn',
20+
author_email='[email protected]',
2121
classifiers=[
22-
'Development Status :: 4 - Beta', # TODO update for first release
22+
'Development Status :: 5 - Production/Stable',
2323
'Intended Audience :: Science/Research',
2424
'Topic :: Scientific/Engineering :: Information Analysis',
2525
'Topic :: Scientific/Engineering :: Physics',

trackml/weights.py

+9-8
Original file line numberDiff line numberDiff line change
@@ -87,18 +87,20 @@ def weight_hits(truth, particles):
8787
truth : pandas.DataFrame
8888
Truth information. Must have hit_id, particle_id, and tz columns.
8989
particles : pandas.DataFrame
90-
Particle information. Must have particle_id, vz, px, and py columns.
90+
Particle information. Must have particle_id, vz, px, py, and nhits
91+
columns.
9192
9293
Returns
9394
-------
9495
pandas.DataFrame
95-
`truth` augmented with additional columns: ihit, nhits, weight_order,
96-
weight_pt, and weight.
96+
`truth` augmented with additional columns: particle_nhits, ihit,
97+
weight_order, weight_pt, and weight.
9798
"""
9899
# fill selected per-particle information for each hit
99100
selected = pandas.DataFrame({
100101
'particle_id': particles['particle_id'],
101102
'particle_vz': particles['vz'],
103+
'particle_nhits': particles['nhits'],
102104
'weight_pt': weight_pt(numpy.hypot(particles['px'], particles['py'])),
103105
})
104106
combined = pandas.merge(truth, selected,
@@ -107,15 +109,14 @@ def weight_hits(truth, particles):
107109

108110
# fix pt weight for hits w/o associated particle
109111
combined['weight_pt'].fillna(0.0, inplace=True)
110-
112+
# fix nhits for hits w/o associated particle
113+
combined['particle_nhits'].fillna(0.0, inplace=True)
114+
combined['particle_nhits'] = combined['particle_nhits'].astype('i4')
111115
# compute hit count and order using absolute distance from particle vertex
112116
combined['abs_dvz'] = numpy.absolute(combined['tz'] - combined['particle_vz'])
113-
combined['nhits'] = combined.groupby('particle_id')['abs_dvz'].transform(numpy.size).astype('i4')
114-
combined.loc[combined['particle_id'] == INVALID_PARTICLED_ID, 'nhits'] = 0
115117
combined['ihit'] = combined.groupby('particle_id')['abs_dvz'].rank().transform(lambda x: x - 1).fillna(0.0).astype('i4')
116-
117118
# compute order-dependent weight
118-
combined['weight_order'] = combined[['ihit', 'nhits']].apply(weight_order, axis=1)
119+
combined['weight_order'] = combined[['ihit', 'particle_nhits']].apply(weight_order, axis=1)
119120

120121
# compute combined weight normalized to 1
121122
w = combined['weight_pt'] * combined['weight_order']

0 commit comments

Comments
 (0)