Skip to content

httpHeaders Set-Cookie is single string #33

@Austinb

Description

@Austinb

Where there are multiple Set-Cookie headers in a server response from a WARC record the value of httpHeaders.Set-Cookie is always the last one in the list. This should be returned as an array of the Set-Cookie headers if that change doesnt break other things or there should be another method to get all of the cookies from the headers block. Another option would be to keep the line endings (\n) for the response so it is still a string but you can split it if you want.

Example WARC record (minus the content block):

WARC/1.0
WARC-Type: request
WARC-Date: 2019-06-15T21:54:45Z
WARC-Record-ID: <urn:uuid:1e7aaba9-c5b9-49cd-b0a8-6a4d7460c9b3>
Content-Length: 296
Content-Type: application/http; msgtype=request
WARC-Warcinfo-ID: <urn:uuid:07d8abda-2416-492c-b139-8fb526d5f792>
WARC-IP-Address: 95.216.246.36
WARC-Target-URI: https://www.bpazar.com/index.php?route=product/search&search=Sarj&page=4

GET /index.php?route=product/search&search=Sarj&page=4 HTTP/1.1
User-Agent: CCBot/2.0 (https://commoncrawl.org/faq/)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Host: www.bpazar.com
Connection: Keep-Alive
Accept-Encoding: gzip



WARC/1.0
WARC-Type: response
WARC-Date: 2019-06-15T21:54:45Z
WARC-Record-ID: <urn:uuid:3f3d6e43-9e5d-42ba-a111-43fcd90dd633>
Content-Length: 1043231
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:07d8abda-2416-492c-b139-8fb526d5f792>
WARC-Concurrent-To: <urn:uuid:1e7aaba9-c5b9-49cd-b0a8-6a4d7460c9b3>
WARC-IP-Address: 95.216.246.36
WARC-Target-URI: https://www.bpazar.com/index.php?route=product/search&search=Sarj&page=4
WARC-Payload-Digest: sha1:N2WQFAUYKKXT6MRWSCXCQC7FOZRQCLTI
WARC-Block-Digest: sha1:S3FKWWFJ7LCYFOHUZ4RBPFAMYNQSVQMH
WARC-Identified-Payload-Type: text/html

HTTP/1.1 200 OK
Server: nginx
Date: Sat, 15 Jun 2019 21:54:44 GMT
Content-Type: text/html; charset=UTF-8
X-Crawler-Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Set-Cookie: OCSESSID=d4163e3479bec29a507792acc4; path=/
Set-Cookie: OCSESSID=57bfbd42e2fe9d4d5af66485f7; path=/
Set-Cookie: language=tr-tr; expires=Mon, 15-Jul-2019 21:54:40 GMT; Max-Age=2592000; path=/; domain=tr-tr
Set-Cookie: currency=TRY; expires=Mon, 15-Jul-2019 21:54:40 GMT; Max-Age=2592000; path=/; domain=www.bpazar.com
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
X-Nginx-Cache-Status: BYPASS
X-Server-Powered-By: Engintron
X-Crawler-Content-Encoding: gzip

Response from console.log(record.httpHeaders); when used in the record callback:

{ Server: 'nginx',
  Date: 'Sat, 15 Jun 2019 21:54:44 GMT',
  'Content-Type': 'text/html; charset=UTF-8',
  'X-Crawler-Transfer-Encoding': 'chunked',
  Connection: 'keep-alive',
  Vary: 'Accept-Encoding',
  'Set-Cookie':
   'currency=TRY; expires=Mon, 15-Jul-2019 21:54:40 GMT; Max-Age=2592000; path=/; domain=www.bpazar.com',
  'X-XSS-Protection': '1; mode=block',
  'X-Content-Type-Options': 'nosniff',
  'X-Nginx-Cache-Status': 'BYPASS',
  'X-Server-Powered-By': 'Engintron',
  'X-Crawler-Content-Encoding': 'gzip' }

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions