tap-marketo seems to suffer from several performance issues

We use tap-marketo, and have noticed several performance issues with it.

## The ETL Sub Process Timeline

To visualize the issue, let me first break down the various stages of the etl workflow with tap-marketo's leads stream. The extract--load process is essentially the following:

```
do while export_end < job_start
   request marketo to create the export file and wait for it to be ready (1: dark blue)
   download the generated csv file (2: orange)
   parse the csv file row by row and call singer.write_record (3: red)

finish the tap process, generate output file (5: light blue)
copy file to s3 (6: green)
```
![image](https://github.com/singer-io/tap-marketo/assets/6108231/218d663c-412b-40d8-8b9b-cf33391c02a4)

## Description of Visualization

1. time required for step 1 is dictated by marketo and we don't have any options to optimize it from the perspective of the tap.
2. step 2 is governed by the network bandwidth
3. notice how step 3 (parsing the csv) takes an inordinate amount of time, and that is something we should be able to optimize
4. notice how the steps 1, 2 & 3 are repeated for a second time. it seems to me that the while loop is not terminating when it should

## Optimization Opportunity

investigating the codebase, i found that parsing the csv has the following optimization opportunities

1. reading the csv involves an extra generator expression with a call to `str.replace`. this cost is increases with downloaded rows.
2. for reach row, call `format_values`, which in turn builds `available_fields` and calls `format_value` on each field. It stands to reason that we should optimize `format_value` as that is executed in the inner-most loop. this cost increases as the product of downloaded rows & fields in each row. `format_value` uses several inefficient constructs, described below.
3. determining `available_fields` in `format_values` seems inefficient as well, as the fields are not going to change for a given stream. this cost increases with downloaded rows
5. the extra _loop_ seems to be due to `export_end` getting millisecond-truncated, `export_start` is set to `export_end` at the end of the while loop, and the comparison `export_end < job_start` returns `true` for 1 extra time because `job_start` is not millisecond-truncated, and thus is almost always a few milliseconds greater than `export_end`

## Opportunity to add more user-configuration

`ATTRIBUTION_WINDOW_DAYS = 1` is used to set the start of the extract back by 1 day. While this is a good default option for people ingesting data on a daily basis, it is not suitable for extractions running multiple times a day, as we are re-extracting the same record multiple time leading to additional data processing within the database.

It would be useful to be able to over-ride this hard-coded value through a parameter.

Internally at my workplace, we have improved the above mentioned issues, and have noticed our run time reduce by over 50%. Should the maintainers of tap-marketo agree, I'm happy to port the improvements over.

## Opportunity to optimize `format_value`

### isinstance is not required in this method. 

csv.reader (or DictReader) will only return strings in our usage, and its possible that it might return floats if `QUOTE_NONNUMERIC` is specified. Quoting the csv module's documentation

> Each row read from the csv file is returned as a list of strings. No automatic data type conversion is performed unless the `QUOTE_NONNUMERIC` format option is specified (in which case unquoted fields are transformed into floats).

### the warning message feel unnecessary.

evidently the conversion to integer is correct, and it is marketo's standard practice to format percent-fields as a decimal. the warning message emitted for each percent field for each row imposes an undue linearly increasing cost

### conversion to int is twice as fast with `int(float(value))`

this sequence of operation would still work if `value` was of non-string type (recall that csv.reader returns strings by default but could also return floats if QUOTE_NONNUMERIC is set), which the current method is unable to handle becasue `find` method only exists for strings.

`str.find` is slow (or at least slow for our critical loop), and the added if-else adds cost that we should try to void.

### the null-value early return could leverage a set for better lookup performance.

we should just define a global constant (like below) and use that for lookup.

Even compared with a list of size 3, a lookup on a set will be more efficient. Having it be global means that the reference data is allocated only once per process.

```
NULL_VALUES = {None, '', 'null}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tap-marketo seems to suffer from several performance issues #87

The ETL Sub Process Timeline

Description of Visualization

Optimization Opportunity

Opportunity to add more user-configuration

Opportunity to optimize `format_value`

isinstance is not required in this method.

the warning message feel unnecessary.

conversion to int is twice as fast with `int(float(value))`

the null-value early return could leverage a set for better lookup performance.

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

tap-marketo seems to suffer from several performance issues #87

Description

The ETL Sub Process Timeline

Description of Visualization

Optimization Opportunity

Opportunity to add more user-configuration

Opportunity to optimize format_value

isinstance is not required in this method.

the warning message feel unnecessary.

conversion to int is twice as fast with int(float(value))

the null-value early return could leverage a set for better lookup performance.

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Opportunity to optimize `format_value`

conversion to int is twice as fast with `int(float(value))`