Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: Feat/contact details #50

Closed
wants to merge 8 commits into from
Closed

Conversation

metalwarrior665
Copy link
Contributor

@foxt451 Creating a draft PR for better review

@jancurn
Copy link
Contributor

jancurn commented Jan 24, 2024

Great stuff, I'm curious to see how it will work. Surely the LLM will have some advantage over regexps, no?

@metalwarrior665
Copy link
Contributor Author

I don't think so, all social handles have pretty clear regexes. Emails also should be fine. Phone numbers can be tricky but I don't expect GPT to do a better job than regex, it will also halucinate more. GPT is for things that require an understanding of a broader context. Let's continue in the Issue

@foxt451
Copy link
Contributor

foxt451 commented Jan 25, 2024

Yeah, I realised a bit later that this offers no real advantage over regexes. I guess we could set up some metamorph chain between contact-details scraper and gpt so that gpt gets page's text, regex-extracted contacts and just matches them

@metalwarrior665
Copy link
Contributor Author

I would try the name + contacts grouping. That was the student project. The tricky part was actually finding enough testing websites because it is not as common.

@jancurn
Copy link
Contributor

jancurn commented Jan 25, 2024

Yes, for extracting email, URLs, etc. it has no benefit, but the whole point of using LLMs was to:

  • cluster the found handles to a specific people, and ideally also determine their full name
  • be able to correctly identify phone numbers, ideally filling the country prefix - regexes are pretty much useless for this

@foxt451 foxt451 closed this Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants