Hacker News

zachperkitny
Tadpole the Language for Scraping 0.2.0 – Complex Control Flow, Stealth and More

Hello,

I posted a few weeks ago about my custom scraping language. It definitely got some traction, which was very exciting to see.

Github Repo: https://github.com/tadpolehq/tadpole Docs: https://tadpolehq.com/

The past 2 weeks, I've been focusing my efforts in introducing specific stealth actions, more complicated control flow actions and a lot of various evaluators for cleaning data.

Here is an example for scraping from `books.toscrape.com`

  main {
    new_page {
      goto "https://books.toscrape.com/"
      loop {
        do {
          $$ article.product_pod {
            extract "books[]" {
              title { $ "h3 a"; attr title }
              rating {
                $ ".star-rating";
                attr "class";
                extract "star-rating (One|Two|Three|Four|Five)" caseInsensitive=#true;
                func "(v) => ({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5}[v.toLowerCase()] || null)"
              }
              price { $ "p.price_color"; text; as_float }
              in_stock { $ "p.availability"; text; matches "In stock" caseInsensitive=#true }
            }
          }
        }
        while { $ "li.next" }
        next {
          $ "li.next a" { click }
          wait_until
        }
      }
    }
  }
I've introduced actions like `apply_identity` to override User Agent Headers and User Agent Metadata. Here is an example module to selectively create different identities:

  module stealth {
    // Apple M2 Pro
    action apply_apple_m2 {
      apply_identity mac
      set_webgl_vendor "Apple Inc." "Apple M2"
      set_device_memory 16
      set_hardware_concurrency 8
      set_viewport 1440 900 deviceScaleFactor=2
    }

    // Windows Desktop
    action apply_windows_16_8 {
      apply_identity windows
      set_webgl_vendor "Google Inc. (Intel)" "ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0)"
      set_device_memory 16
      set_hardware_concurrency 8
      set_viewport 1920 1080
    }

    // Windows Budget Laptop
    action apply_windows_8_4 {
      apply_identity windows
      set_webgl_vendor "Google Inc. (Intel)" "ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0)"
      set_device_memory 8
      set_hardware_concurrency 4
      set_viewport 1366 768
    }
  }

The full release changelog is available here: https://github.com/tadpolehq/tadpole/releases/

My goals for the next 0.3.0 release is to heavily focus on Plugins, Distributed Execution through Message Queues, Redis Support for Crawling, Static Parsing as opposed to exclusively over CDP/Chrome.

I will keep trying to keep my release cadence at every 2 weeks!


hn-front (c) 2024 voximity
source