Skip to content

How to crawl a site behind basic authentication (CredentialStore/HttpAuthenticationCredential ends up with 401) #662

@danijanos

Description

@danijanos

Dear Heritrix3 Community,

Thank you for this great tool! Please help me with this issue:
I am using version 3.10.0.

I need to crawl a site's previous version that has undergone a major upgrade. The old site was placed under a domain that the developers configured to be behind a basic login. (Every request header sent out includes the Authorization field, which supplies credentials for basic authentication with the base64-encoded value of the username and password, as granted by the site administrators.)

Image

I configured the job as I learned from the docs. So the crawl has these two beans for the basic authentication:

<bean id="credentialStore" class="org.archive.modules.credential.CredentialStore">
   <property name="credentials">
     <map>
       <entry key="OLDSiteLoginCredential" value-ref="OLDSiteLoginCredential"/>
     </map>
   </property>
</bean>

<bean id="OLDSiteLoginCredential" class="org.archive.modules.credential.HttpAuthenticationCredential">
   <property name="domain" value="https://old.site.edu:443"/>
   <property name="realm" value="oldsiterealm"/>
   <property name="login" value="myloginname"/>
   <property name="password" value="passwordformyloginname"/>
</bean>

But every time I build and launch it, it stops and finishes with the DNS resolve, and two 401s regarding the main page URL and the robots.txt

401        381 https://old.site.edu/ - - text/html #001
401        381 https://old.site.edu/robots.txt P https://old.site.edu/ text/html #001
1          51  dns:old.site.edu P https://old.site.edu/ text/dns #001

Could you please help me identify what I am doing wrong here? Or would you happen to know how I should do this?
Thanks a lot!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions