• Hacker News
  • new|
  • comments|
  • show|
  • ask|
  • jobs|
  • letier 1 hours

    The extraction prompt would need some hardening against prompt injection, as far as i can tell.

  • openclaw01 2 hours

    [dead]

  • vetler 13 minutes

    My instinct was also to use LLMs for this, but it was way to slow and still expensive if you want to scrape millions of pages.

  • johnwhitman 3 hours

    [dead]

  • Remi_Etien 4 hours

    [dead]

  • hikaru_ai 2 hours

    [dead]

  • gautamborad 4 hours

    [dead]

  • dmos62 3 hours

    What's your experience with not getting blocked by anti-bot systems? I see you've custom patches for that.

    andrew_zhong 3 hours

    The anti-bot patches here (via Patchright) are about preventing the browser from being detected as automated — fixing CDP leaks, removing automation flags, etc. For sites behind Cloudflare or Datadome, that alone usually isn't enough — you'll need residential proxies and proper browser fingerprints on top. The library supports connecting to remote scraping browsers via WebSocket and proxy configuration for those cases.

  • AirMax98 2 hours

    This feels like slop to me.

    It may or may not be, but if you want people to actually use this product I’d suggest improving your documentation and replies here to not look like raw Claude output.

    I also doubt the premise that about malformed JSON. I have never encountered anything like what you are describing with structured outputs.

    andrew_zhong 1 hours

    In context of e-commerce web extraction, invalid JSON can occur especially in edge cases, for example:

    price: z.number().optional() -> price: “n/a”

    url: z.string().url().nullable() -> url: “not found”

    It can also be one invalid object (e.g. missing required field, truncated input) in an array causing the entire output to fail.

    The unique contribution here is we can recover invalid nullable or optional field, and also remove invalid nested objects in an array.

  • Flux159 4 hours

    This looks pretty interesting! I haven't used it yet, but looked through the code a bit, it looks like it uses turndown to convert the html to markdown first, then it passes that to the LLM so assuming that's a huge reduction in tokens by preprocessing. Do you have any data on how often this can cause issues? ie tables or other information being lost?

    Then langchain and structured schemas for the output along w/ a specific system prompt for the LLM. Do you know which open source models work best or do you just use gemini in production?

    Also, looking at the docs, Gemini 2.5 flash is getting deprecated by June 17th https://ai.google.dev/gemini-api/docs/deprecations#gemini-2.... (I keep getting emails from Google about it), so might want to update that to Gemini 3 Flash in the examples.

    andrew_zhong 39 minutes

    HTML -> markdown -> LLM is standard practice. We strip elements like aside, embed, head , iframe etc. the criteria is conservatively set to avoid removing too many elements (especially in extractMain mode)

    https://github.com/lightfeed/extractor/blob/main/src/convert...

    I have used gemma 3 and had good results.

    Once Gemini 3 flash drops the preview suffix, will update the examples. Thank you for the pointer.

  • sheept 3 hours

    > LLMs return malformed JSON more often than you'd expect, especially with nested arrays and complex schemas. One bad bracket and your pipeline crashes.

    This might be one reason why Claude Code uses XML for tool calling: repeating the tag name in the closing bracket helps it keep track of where it is during inference, so it is less error prone.

    faangguyindia 22 minutes

    Hardly matters, this isn't a problem that you'd have these days with modern LLMs.

    Also, a model can always use a proxy to turn your tool calls into XML

    And feed you back json right away and you wouldn't even know if any transformation did take place.

    andrew_zhong 3 hours

    Yeah that's a good observation. XML's closing tags give the model structural anchors during generation — it knows where it is in the nesting. JSON doesn't have that, so the deeper the nesting the more likely the model loses track of brackets.

    We see this especially with arrays of objects where each object has optional nested fields. For complex nested objects, the model can get all items well formatted but one with an invalid field of wrong type. That's why we put effort into the repair/recovery/sanitization layer — validate field-by-field and keep what's valid rather than throwing everything out.

    AbanoubRodolf 3 hours

    [dead]

  • zx8080 3 hours

    Robots.txt anyone?

    andrew_zhong 3 hours

    Good point. The anti-bot patches here (via Patchright) are about preventing the browser from being detected as automated — things like CDP leak fixes so Cloudflare doesn't block you mid-session. It's not about bypassing access restrictions.

    Our main use case is retail price monitoring — comparing publicly listed product prices across e-commerce sites, which is pretty standard in the industry. But fair point, we should make that clearer in the README.

    reyqn 3 hours

    https://news.ycombinator.com/item?id=47340079

    zendist 2 hours

    Regardless. You should still respect robots.txt..

    bilekas 6 minutes

    > comparing publicly listed product prices across e-commerce sites

    Those prices and information is for the public viewers, the reason why some people have ROBOTS.txt for example is to reduce the traffic load that slop crawlers generate. The bandwidth is not free so why would you assume to ignore their ROBOTS.txt when you're not footing the bill ?

  • plastic041 4 hours

    > Avoid detection with built-in anti-bot patches and proxy configuration for reliable web scraping.

    And it doesn't care about robots.txt.

    andrew_zhong 3 hours

    Good point. The anti-bot patches here (via Patchright) are about preventing the browser from being detected as automated — things like CDP leak fixes so Cloudflare doesn't block you mid-session. It's not about bypassing access restrictions.

    Our main use case is retail price monitoring — comparing publicly listed product prices across e-commerce sites, which is pretty standard in the industry. But fair point, we should make that clearer in the README.

    plastic041 2 hours

    robots.txt is the most basic access restrictions and it doesn't even read it, while faking itself as human[0]. It is about bypassing access restrictions.

    [0]: https://github.com/lightfeed/extractor/blob/d11060269e65459e...

    messe 2 hours

    > It's not about bypassing access restrictions.

    Yes. It is. You've just made an arbitrary choice not to define it as such.

    andrew_zhong 1 hours

    I will add a PR to enforce robots.txt before the actual scraping.

    zendist 2 hours

    Regardless. You should still respect robots.txt..

    andrew_zhong 1 hours

    We do respect robots.txt production - also scraping browser providers like BrightData enforces that.

    I will add a PR to enforce robots.txt before the actual scraping.

    plastic041 1 hours

    How can people believe that you are respecting bot detection in production when your software's README says it can "Avoid detection with built-in anti-bot patches"?