We use tap-marketo, and have noticed several performance issues with it.
The ETL Sub Process Timeline
To visualize the issue, let me first break down the various stages of the etl workflow with tap-marketo's leads stream. The extract--load process is essentially the following:
do while export_end < job_start
request marketo to create the export file and wait for it to be ready (1: dark blue)
download the generated csv file (2: orange)
parse the csv file row by row and call singer.write_record (3: red)
finish the tap process, generate output file (5: light blue)
copy file to s3 (6: green)

Description of Visualization
- time required for step 1 is dictated by marketo and we don't have any options to optimize it from the perspective of the tap.
- step 2 is governed by the network bandwidth
- notice how step 3 (parsing the csv) takes an inordinate amount of time, and that is something we should be able to optimize
- notice how the steps 1, 2 & 3 are repeated for a second time. it seems to me that the while loop is not terminating when it should
Optimization Opportunity
investigating the codebase, i found that parsing the csv has the following optimization opportunities
- reading the csv involves an extra generator expression with a call to
str.replace. this cost is increases with downloaded rows.
- for reach row, call
format_values, which in turn builds available_fields and calls format_value on each field. It stands to reason that we should optimize format_value as that is executed in the inner-most loop. this cost increases as the product of downloaded rows & fields in each row. format_value uses several inefficient constructs, described below.
- determining
available_fields in format_values seems inefficient as well, as the fields are not going to change for a given stream. this cost increases with downloaded rows
- the extra loop seems to be due to
export_end getting millisecond-truncated, export_start is set to export_end at the end of the while loop, and the comparison export_end < job_start returns true for 1 extra time because job_start is not millisecond-truncated, and thus is almost always a few milliseconds greater than export_end
Opportunity to add more user-configuration
ATTRIBUTION_WINDOW_DAYS = 1 is used to set the start of the extract back by 1 day. While this is a good default option for people ingesting data on a daily basis, it is not suitable for extractions running multiple times a day, as we are re-extracting the same record multiple time leading to additional data processing within the database.
It would be useful to be able to over-ride this hard-coded value through a parameter.
Internally at my workplace, we have improved the above mentioned issues, and have noticed our run time reduce by over 50%. Should the maintainers of tap-marketo agree, I'm happy to port the improvements over.
Opportunity to optimize format_value
isinstance is not required in this method.
csv.reader (or DictReader) will only return strings in our usage, and its possible that it might return floats if QUOTE_NONNUMERIC is specified. Quoting the csv module's documentation
Each row read from the csv file is returned as a list of strings. No automatic data type conversion is performed unless the QUOTE_NONNUMERIC format option is specified (in which case unquoted fields are transformed into floats).
the warning message feel unnecessary.
evidently the conversion to integer is correct, and it is marketo's standard practice to format percent-fields as a decimal. the warning message emitted for each percent field for each row imposes an undue linearly increasing cost
conversion to int is twice as fast with int(float(value))
this sequence of operation would still work if value was of non-string type (recall that csv.reader returns strings by default but could also return floats if QUOTE_NONNUMERIC is set), which the current method is unable to handle becasue find method only exists for strings.
str.find is slow (or at least slow for our critical loop), and the added if-else adds cost that we should try to void.
the null-value early return could leverage a set for better lookup performance.
we should just define a global constant (like below) and use that for lookup.
Even compared with a list of size 3, a lookup on a set will be more efficient. Having it be global means that the reference data is allocated only once per process.
NULL_VALUES = {None, '', 'null}
We use tap-marketo, and have noticed several performance issues with it.
The ETL Sub Process Timeline
To visualize the issue, let me first break down the various stages of the etl workflow with tap-marketo's leads stream. The extract--load process is essentially the following:
Description of Visualization
Optimization Opportunity
investigating the codebase, i found that parsing the csv has the following optimization opportunities
str.replace. this cost is increases with downloaded rows.format_values, which in turn buildsavailable_fieldsand callsformat_valueon each field. It stands to reason that we should optimizeformat_valueas that is executed in the inner-most loop. this cost increases as the product of downloaded rows & fields in each row.format_valueuses several inefficient constructs, described below.available_fieldsinformat_valuesseems inefficient as well, as the fields are not going to change for a given stream. this cost increases with downloaded rowsexport_endgetting millisecond-truncated,export_startis set toexport_endat the end of the while loop, and the comparisonexport_end < job_startreturnstruefor 1 extra time becausejob_startis not millisecond-truncated, and thus is almost always a few milliseconds greater thanexport_endOpportunity to add more user-configuration
ATTRIBUTION_WINDOW_DAYS = 1is used to set the start of the extract back by 1 day. While this is a good default option for people ingesting data on a daily basis, it is not suitable for extractions running multiple times a day, as we are re-extracting the same record multiple time leading to additional data processing within the database.It would be useful to be able to over-ride this hard-coded value through a parameter.
Internally at my workplace, we have improved the above mentioned issues, and have noticed our run time reduce by over 50%. Should the maintainers of tap-marketo agree, I'm happy to port the improvements over.
Opportunity to optimize
format_valueisinstance is not required in this method.
csv.reader (or DictReader) will only return strings in our usage, and its possible that it might return floats if
QUOTE_NONNUMERICis specified. Quoting the csv module's documentationthe warning message feel unnecessary.
evidently the conversion to integer is correct, and it is marketo's standard practice to format percent-fields as a decimal. the warning message emitted for each percent field for each row imposes an undue linearly increasing cost
conversion to int is twice as fast with
int(float(value))this sequence of operation would still work if
valuewas of non-string type (recall that csv.reader returns strings by default but could also return floats if QUOTE_NONNUMERIC is set), which the current method is unable to handle becasuefindmethod only exists for strings.str.findis slow (or at least slow for our critical loop), and the added if-else adds cost that we should try to void.the null-value early return could leverage a set for better lookup performance.
we should just define a global constant (like below) and use that for lookup.
Even compared with a list of size 3, a lookup on a set will be more efficient. Having it be global means that the reference data is allocated only once per process.