Commit 9cf3f51
committed
refactor(scraper): replace trafilatura with markitdown for HTML to Markdown conversion
- Removed trafilatura dependency and its usage in scraper.py.
- Added markitdown as a dependency in requirements.txt and integrated it for Markdown conversion.
- Updated content extraction logic to use markitdown and extract page titles with BeautifulSoup.
- Adjusted tests to mock markitdown usage and verify new scraping workflow.1 parent 30cee92 commit 9cf3f51
3 files changed
Lines changed: 62 additions & 25 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
8 | | - | |
| 8 | + | |
9 | 9 | | |
10 | 10 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
2 | 1 | | |
3 | 2 | | |
4 | 3 | | |
5 | 4 | | |
6 | | - | |
7 | | - | |
| 5 | + | |
8 | 6 | | |
9 | 7 | | |
10 | 8 | | |
11 | | - | |
12 | 9 | | |
| 10 | + | |
| 11 | + | |
13 | 12 | | |
14 | 13 | | |
15 | 14 | | |
| |||
79 | 78 | | |
80 | 79 | | |
81 | 80 | | |
82 | | - | |
| 81 | + | |
83 | 82 | | |
84 | 83 | | |
85 | 84 | | |
| |||
115 | 114 | | |
116 | 115 | | |
117 | 116 | | |
118 | | - | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
119 | 129 | | |
120 | | - | |
121 | | - | |
122 | | - | |
123 | | - | |
| 130 | + | |
124 | 131 | | |
125 | | - | |
126 | | - | |
127 | | - | |
128 | | - | |
129 | | - | |
130 | | - | |
131 | | - | |
132 | | - | |
133 | | - | |
134 | | - | |
| 132 | + | |
135 | 133 | | |
136 | 134 | | |
137 | 135 | | |
| |||
229 | 227 | | |
230 | 228 | | |
231 | 229 | | |
232 | | - | |
| 230 | + | |
233 | 231 | | |
234 | 232 | | |
235 | 233 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
39 | 39 | | |
40 | 40 | | |
41 | 41 | | |
42 | | - | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
43 | 54 | | |
44 | 55 | | |
45 | 56 | | |
46 | | - | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
47 | 64 | | |
48 | 65 | | |
49 | | - | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
50 | 88 | | |
51 | 89 | | |
52 | 90 | | |
| |||
96 | 134 | | |
97 | 135 | | |
98 | 136 | | |
| 137 | + | |
99 | 138 | | |
100 | 139 | | |
101 | 140 | | |
| |||
0 commit comments