Skip to content

Commit 5a9dabc

Browse files
authored
Merge pull request #416 from itamarst/410.large-numpy-arrays
Better support for large numpy arrays
2 parents 864d00d + 27b60e9 commit 5a9dabc

File tree

4 files changed

+55
-1
lines changed

4 files changed

+55
-1
lines changed

docs/source/news.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,10 @@ Features:
1111
when the program runs. Fixes #403.
1212
* PyPy3 is now officially supported.
1313

14+
Changes:
15+
16+
* If you log a NumPy array whose size > 10000, only a subset will logged. This is to ensure logging giant arrays by mistake doesn't impact your software's performance. If you want to customize logging of large arrays, see :ref:`large_numpy_arrays`. Fixes #410.
17+
1418
1.8.0
1519
^^^^^
1620

docs/source/scientific-computing.rst

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,24 @@ Eliot is an ideal logging library for these cases:
1313

1414
At PyCon 2019 Itamar Turner-Trauring gave talk about logging for scientific computing, in part using Eliot—you can `watch the video <https://pyvideo.org/pycon-us-2019/logging-for-scientific-computing-reproducibility-debugging-optimization.html>`_ or `read a prose version <https://pythonspeed.com/articles/logging-for-scientific-computing/>`_.
1515

16+
.. _large_numpy_arrays:
17+
18+
Logging large arrays
19+
--------------------
20+
21+
Logging large arrays is a problem: it will take a lot of CPU, and it's no fun discovering that your batch process was slow because you mistakenly logged an array with 30 million integers every time you called a core function.
22+
23+
So how do you deal with logging large arrays?
24+
25+
1. **Log a summary (default behavior):** By default, if you log an array with size > 10,000, Eliot will only log the first 10,000 values, along with the shape.
26+
2. **Omit the array:** You can also just choose not to log the array at all.
27+
With ``log_call`` you can use the ``include_args`` parameter to ensure the array isn't logged (see :ref:`log_call decorator`).
28+
With ``start_action`` you can just not pass it in.
29+
3. **Manual transformation:** If you're using ``start_action`` you can also manually modify the array yourself before passing it in.
30+
For example, you could write it to some sort of temporary storage, and then log the path to that file.
31+
Or you could summarize it some other way than the default.
32+
33+
1634
.. _dask_usage:
1735

1836
Using Dask

eliot/json.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,14 @@ def default(self, o):
2222
if isinstance(o, (numpy.bool, numpy.bool_)):
2323
return bool(o)
2424
if isinstance(o, numpy.ndarray):
25-
return o.tolist()
25+
if o.size > 10000:
26+
# Too big to want to log as-is, log a summary:
27+
return {
28+
"array_start": o.flat[:10000].tolist(),
29+
"original_shape": o.shape,
30+
}
31+
else:
32+
return o.tolist()
2633
return json.JSONEncoder.default(self, o)
2734

2835

eliot/tests/test_json.py

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66

77
from unittest import TestCase, skipUnless, skipIf
88
from json import loads, dumps
9+
from math import isnan
910

1011
try:
1112
import numpy as np
@@ -18,6 +19,13 @@
1819
class EliotJSONEncoderTests(TestCase):
1920
"""Tests for L{EliotJSONEncoder}."""
2021

22+
def test_nan_inf(self):
23+
"""NaN, inf and -inf are round-tripped."""
24+
l = [float("nan"), float("inf"), float("-inf")]
25+
roundtripped = loads(dumps(l, cls=EliotJSONEncoder))
26+
self.assertEqual(l[1:], roundtripped[1:])
27+
self.assertTrue(isnan(roundtripped[0]))
28+
2129
@skipUnless(np, "NumPy not installed.")
2230
def test_numpy(self):
2331
"""NumPy objects get serialized to readable JSON."""
@@ -62,3 +70,20 @@ def test_numpy_not_imported(self):
6270
with self.assertRaises(TypeError):
6371
dumps([object()], cls=EliotJSONEncoder)
6472
self.assertEqual(dumps(12, cls=EliotJSONEncoder), "12")
73+
74+
@skipUnless(np, "NumPy is not installed.")
75+
def test_large_numpy_array(self):
76+
"""
77+
Large NumPy arrays are not serialized completely, since this is (A) a
78+
performance hit (B) probably a mistake on the user's part.
79+
"""
80+
a1000 = np.array([0] * 10000)
81+
self.assertEqual(loads(dumps(a1000, cls=EliotJSONEncoder)), a1000.tolist())
82+
a1002 = np.zeros((2, 5001))
83+
a1002[0][0] = 12
84+
a1002[0][1] = 13
85+
a1002[1][1] = 500
86+
self.assertEqual(
87+
loads(dumps(a1002, cls=EliotJSONEncoder)),
88+
{"array_start": a1002.flat[:10000].tolist(), "original_shape": [2, 5001]},
89+
)

0 commit comments

Comments
 (0)