To make sure the Pandas APIs are Jittable, Pandas should avoid changing data types based on values and avoid the object type.
One example is that Pandas should default to nullable types directly instead of Numpy/Object types. Currently if you specify a value that Pandas will treat as NA (e.g. None) without a pd.array, you will not produce a nullable type and instead produce a Numpy array type, often an object array.
Arrays with NAs should automatically use the correct nullable type. E.g. pd.DataFrame({'A': [1, 2, None], 'B': [True, False, None]}) should have column A of type Int64 rather than float64 and column B of type boolean rather than object. Users can specify non-nullable data types directly if necessary.
For output of I/O calls like read_csv, the data type should always be a nullable type, and not determined based on values.
To make sure the Pandas APIs are Jittable, Pandas should avoid changing data types based on values and avoid the object type.
One example is that Pandas should default to nullable types directly instead of Numpy/Object types. Currently if you specify a value that Pandas will treat as NA (e.g. None) without a pd.array, you will not produce a nullable type and instead produce a Numpy array type, often an object array.
Arrays with NAs should automatically use the correct nullable type. E.g.
pd.DataFrame({'A': [1, 2, None], 'B': [True, False, None]})should have columnAof typeInt64rather thanfloat64and columnBof typebooleanrather thanobject. Users can specify non-nullable data types directly if necessary.For output of I/O calls like
read_csv, the data type should always be a nullable type, and not determined based on values.