-
Notifications
You must be signed in to change notification settings - Fork 0
Somalia/ipc #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Somalia/ipc #22
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
I think main suggestion is to fail fast when things are unexpected rather than passing on None values or empty dataframes which then later have to be checked if they are None/empty. And quite some of the complexity of the code can be reduced by validating a bit stronger/earlier.
Let me know if you have any questions about my comments :)
| load_dotenv() | ||
|
|
||
|
|
||
| class IpcPropertiesSchema(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many fields are optional and are allowed to be None, is this what you want? Is data were all these fields are None valid? Downside of allowing None's is that then later in the code you have to keep checking if things are None or not. This often leads to difficult to read code with many if statements. So it's advised to keep the locations where you allow None to a minimum and directly raise an error when things are not as the are expected to be.
I would suggets to think about if you really need these fields. If not remove them. If you do need them you probably need them to be something else then None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am aware of it. The data extracted from the API is pretty messy. There are nested geometry features in those features with from_date and to_date None. Only after exploding the nested features, I can extract the right data from there.
I wonder what else we could do to refine it from the tap like here?
| if geom_dict.get("type") == "GeometryCollection" and geom_dict.get("coordinates") is None: | ||
| geom_dict.pop("coordinates", None) | ||
|
|
||
| # skip invalid geometry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont know exactly what a geometry should look like, but can invalid geomeries not be filtered out by means of the pydantic schema's above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the nested features issue seems to be quite random. And there are too many features to go through (Somalia has about 3500 features after filtered). Thats why I couldn't structure it in the pydantic schema and opt for "get first, filter later"
| geom_dict = copy.deepcopy(feature["geometry"]) | ||
|
|
||
| # fix invalid GeometryCollection: remove "coordinates": None | ||
| if geom_dict.get("type") == "GeometryCollection" and geom_dict.get("coordinates") is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a sympthom of allowing None's
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is one of the issues caused by the messy nested features problem above. Happy to brainstorm about this to find another solution
|
The integration test fails on retrieving env var, when you add |
p-phung
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
many of the issue causes by lack of type check for IpcGeometrySchema. I have added this type check Literal as you suggests them and it seem those issues are gone.
still one of the major issue is the input data from IPC api is quite messy with lots of random nested features that makes it difficult to do pydantic check. If you have some advice, happy to learn about it
| load_dotenv() | ||
|
|
||
|
|
||
| class IpcPropertiesSchema(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am aware of it. The data extracted from the API is pretty messy. There are nested geometry features in those features with from_date and to_date None. Only after exploding the nested features, I can extract the right data from there.
I wonder what else we could do to refine it from the tap like here?
|
|
||
| id: str = Field(description="Identifier of data.", examples=["12511433"]) | ||
| title: str = Field(description="Title of data.", examples=["Somalia - Acute Food Insecurity July 2017"]) | ||
| link: str | None = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not actually needed. Should I make it optional instead?
| description="Type of IPC Analysis (condition).", | ||
| examples=["A", "C"], | ||
| ) | ||
| created: str = Field(description="Creation date of IPC data.", examples=["2018-06-04"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how handy!
| geom_dict = copy.deepcopy(feature["geometry"]) | ||
|
|
||
| # fix invalid GeometryCollection: remove "coordinates": None | ||
| if geom_dict.get("type") == "GeometryCollection" and geom_dict.get("coordinates") is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is one of the issues caused by the messy nested features problem above. Happy to brainstorm about this to find another solution
| if geom_dict.get("type") == "GeometryCollection" and geom_dict.get("coordinates") is None: | ||
| geom_dict.pop("coordinates", None) | ||
|
|
||
| # skip invalid geometry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the nested features issue seems to be quite random. And there are too many features to go through (Somalia has about 3500 features after filtered). Thats why I couldn't structure it in the pydantic schema and opt for "get first, filter later"
| except Exception as e: | ||
| logger.error(f"Validation failed for {iso2}: {e}") | ||
|
|
||
| return gdf_iso2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gdf_iso2 does not exist if data is none or if an exception is caught in the try explode geometries.
Todo: