Hacker News

jasonwcfan
Show HN: Finic – Open source platform for building browser automations github.com

Last year we launched a project called Psychic that did moderately well on hacker news, but was a commercial failure. We were able to find customers, but none with compelling and overlapping use cases. Everyone who was interested was too early to be a real customer.

This was our launch: https://news.ycombinator.com/item?id=36032081

We recently decided to revive and rebrand the project after seeing a sudden spike in interest from people who wanted to connect LLMs to data - but specifically through browsers. It's also a problem we've experienced firsthand, having built scraping features into Psychic and previously working on bot detection at Robinhood.

If you haven’t built a web scraper or browser automation before, you might assume it’s very straightforward. People have been building scrapers for as long as the internet has existed, so there must be many tools for the job.

The truth is that web scraping strategies need to constantly adapt as web standard change, and as companies that don’t want to be scraped adopt new technologies to try and block it. The old standards never completely go away, so the longer the internet exists, the more edge cases you’ll need to account for. This adds up to a LOT of infrastructure that needs to be set up and a lot of schlep developers have to go through to get up and running.

Scraping is no easier today than it was 10 years ago - the problems are just different.

Finic is an open source platform for building and deploying browser agents. Browser agents are bots deployed to the cloud that mimic the behaviour of humans, like web scrapers or remote process automation (RPA) jobs. Simple examples include scripts that scrape static websites like the SEC's EDGAR database. More complex use cases include integrating with legacy applications that don’t have public APIs, where the best way to automate data entry is to just manipulate HTML selectors (EHRs for example).

Our goal is to make Finic the easiest way to deploy a Playwright-based browser automation. With this launch, you can already do so in just 4 steps. Check out our docs for more info: https://docs.finic.io/quickstart


ghxst4 months ago

Cool service but how will you deal / how do you plan to deal with anti scraping and anti bot services like Akamai, Arkose, Cloudflare, DataDome etc.? Automation of the web isn't solved by another playwright or puppeteer abstraction, you need to solve more fundemental problems in order to mitigate the issues you run into at scale.

jasonwcfanop4 months ago

I mentioned this in another comment, but I know from experience that it's impossible to reliably differentiate bots from humans over a network. And since the right to automate browsers has survived repeated legal challenges, all vendors can do is make it incrementally harder to weed out the low sophistication actors.

This actually creates an evergreen problem that companies need to overcome, and our paid version will probably involve helping companies overcome these barriers.

Also I should clarify that we're explicitly not trying to build a playwright abstraction - we're trying to remain as unopinionated as possible about how developers code the bot, and just help with the network-level infrastructure they'll need to make it reliable and make it scale.

It's good feedback for us, we'll make that point more clear!

ghxst4 months ago

> but I know from experience that it's impossible to reliably differentiate bots from humans over a network

While this might be true in theory, it doesn't stop them from trying! And believe me, it's getting to a point where the WAF settings on some websites are even annoying the majority of the real users! Some of the issues I am hinting at however are fundemental issues you run into when automating the web using any mainstream browser that hasn't had some source code patches, I'm curious to see if a solution to that will be part of your service if you decide to tackle it.

candiddevmike4 months ago

Don't take this the wrong way, but this is the kind of unethical behavior that our industry should frown upon IMO. I view this kind of thing on the same level as DDoS-as-a-Service companies.

I wish your company the kind of success it deserves.

jasonwcfanop4 months ago

Why is it unethical when courts have repeatedly affirmed browser automation to be legal and permitted?

If anything, it's unethical for companies to dictate how their customers can access services they've already paid for. If I'm paying hundreds of thousands per year for software, shouldn't I be allowed to build automations over it? Instead, many enterprise products go to great lengths to restrict this kind of usage.

I led the team that dealt with DDoS and other network level attacks at Robinhood so I know how harmful they are. But I also got to see many developers using our services in creative ways that could have been a whole new product (example: https://github.com/sanko/Robinhood).

Instead we had to go after these people and shut them down because it wasn't aligned with the company's long term risk profile. It sucked.

That's why we're focused on authenticated agents for B2B use cases, not the kind of malicious bots you might be thinking of.

tempest_4 months ago

> they've already paid for.

That is the crux, rarely is it a service being scraped that they paid for

ayanb94404 months ago

Depends on the use case. Lots of hospitals and banks use RPA to automate routine processes on their EHRs and systems of record, because these kinds of software typically don't have APIs available. Or if they do, they're very limited.

Playwright and other browser automation scripts are a much more powerful version of RPA but they do require some knowledge of code. But there are more and more developers every year and code just gets more powerful every year. So I think it's a good bet to make that browser automation in code will replace RPA altogether some day.

rgrieselhuber4 months ago

Many times it is scraping aggregators of data that those aggregators also did not pay for.

[deleted]4 months agocollapsed

suriya-ganesh4 months ago

I've been working on browser agent the last week[1]. So this is very exciting. There are also browser agent implementations like Skyvern[2] (Also YC backed) ,or Tarsier[3] Seems like, finic is providing a way to scale/schedule these agents? If that's the case what's the advantage over something like airflow or windmill ?

If I remember correctly, Skyvern also has an implementation of scaling these browser tasks built in.

ps. Is it not called Robotic Process Automation? First time I'm hearing it as Remote process Automation.

[1]https://github.com/ProductLoft/arachne

[2]https://www.skyvern.com/

[3]https://github.com/reworkd/tarsier

mdaniel4 months ago

https://github.com/reworkd/tarsier/pull/115/files represents someone who does not know what git is used for

  Cloning into 'tarsier'...
  remote: Enumerating objects: 15238, done.
  remote: Counting objects: 100% (1613/1613), done.
  remote: Compressing objects: 100% (929/929), done.
  Receiving objects: 100% (15238/15238), 3.01 GiB | 14.82 MiB/s, done.

ayanb94404 months ago

Looks like somebody forgot to update the gitignore lol

ayanb94404 months ago

Yup that's right its Robotic Process Automation.

Based on the feedback in this thread we're going to be releasing an updated version that focuses more around tooling for the browser agents themselves as opposed to scaling/scheduling, so stay tuned for that!

mdaniel4 months ago

And since the other two links are to GH: https://github.com/Skyvern-AI/skyvern (AGPLv3)

dataviz10004 months ago

I build browser automation systems with either Playwright or Chrome Extensions. The biggest issue with automating 3rd party websites is knowing when the 3rd party developer pushes changes which break the automation. The way I dealt with that is run a headless browser in the cloud which checks the behavior of the automated site periodically sending emails and sms messages when it breaks.

If you don't already have this feature for your system, I would recommend it.

ghxst4 months ago

IO between humans and websites can be broken down to only a few fundamental pieces (or elements I should say). This is actually where AI has a lot of opportunity to add value as it has the capability of significantly reducing the possibilty of breakage between changes.

ayanb94404 months ago

That's a great suggestion! Essentially a cron job to check for website changes before your automation runs and possibly breaks.

What does this check look like for you? Do you just diff the html to see if there are any changes?

dataviz10004 months ago

The issue with diffing html is selectors are autogenerated with any update to a website's code. Often website which combat scraping will autogenerate different HTML. First thing is to screen caption a website for comparison. Second, it is possible to determine all the visible elements on a page. With Playwright, inject event listeners to all elements on a page and start automated clicking. If the agent fills out forms, then make sure that all fields are available to populate. There are a lot of heuristics.

thestepafter4 months ago

Are you doing screenshot comparison with Playwright? If so, how? Based on my research this looks to be a missing feature but I could be incorrect.

sahmeepee4 months ago

Playwright has screenshot comparison built in, including screenshotting a single element, blanking specific elements, and comparing the textual aspects of elements without a visual comparison. You can even apply a specific stylesheet for comparisons.

Everything I can see in this demo can be done with Playwright on its own or with some very basic infrastructure e.g. from Azure to run the tests (automations). I can't see what it is adding. Is it doing some bot-detection countermeasures?

Checking if the page behaviour has changed is pretty easy in Playwright because its primary purpose is testing, so just write some tests to assert the behaviour you expect before you use it.

We use Playwright to both automate and scrape the site of a public organisation we are obliged to use, as another public body. They do have some bot detection because we get an email when we run the scripts, asking us to confirm our account hasn't been compromised, but so far we have not been blocked. If they ever do block us we will need to hire someone to do manual data entry, but the automation has already paid for itself many times over in a couple of years.

dataviz10004 months ago

Some ideas. First, are you saving the cookies and adding them when Playwright bootstraps? [0] Second, are you using the same IP address? Or better use a server running from your office or someone's house. Those are the big ones. The first prevents you from having to continuously login.

It is a game of cat and mouse. It is impossible to stop someone determined to circumvent bot protections.

[0] https://playwright.dev/docs/api/class-browsercontext#browser...

Oras4 months ago

Don't take this as a negative thing, but I'm confused. Is it a playwright? Is it a residential proxy? It's not clear from your video.

jasonwcfanop4 months ago

Proxies are definitely on our roadmap, but for now it just supports stock Playwright.

Thanks for the feedback! I just updated the repo to make it more clear that it's Playwright based. Once my cofounder wakes up I'll see if he can re-record the video as well.

ghxst4 months ago

What kind of proxies are on your road map, do you have any experience with in-house proxy networks?

mdaniel4 months ago

> Finic uses Playwright to interact with DOM elements, and recommends BeautifulSoup for HTML parsing.

I have never, ever understood anyone who goes to the trouble of booting up a browser, and then uses a python library to do static HTML parsing

Anyway, I was surfing around the repo trying to find what, exactly "Safely store and access credentials using Finic’s built-in secret manager" means

ayanb94404 months ago

We're in the middle of putting this together right now but it's going to be a wrapper around Google Secret Manager for those that don't want to set up a secrets manager themselves.

0x3444ac534 months ago

Often times websites won't load the HTML without executing the JavaScript. or uses JavaScript running client side to generate the entire page.

mdaniel4 months ago

I feel that we are in agreement for the cases where one would use Playwright, and for damn sure would not involve BS4 for anything in that case

msp264 months ago

What would you recommend for parsing instead?

mdaniel4 months ago

In this specific scenario, where the project is using *automated Chrome* to even bother with the connection, redirects, and bazillions of other "browser-y" things to arrive at HTML to be parsed, the very idea that one would `soup = BeautifulSoup(playright.content())` is crazypants to me

I am open to the fact that html5lib strives to parse correctly, and good for them, but that would be the case where one wished to use python for parsing to avoid the pitfalls of dragging a native binary around with you

xnyan4 months ago

I think there's some misunderstanding? Sometimes parsing HTML is the best way to get what you need, however there are many situations where one must use something like playwright to get the HTML in the first place (for example, the html is generated clientside by javascript). What's the better alternative?

mdaniel4 months ago

Yes, there is for sure some misunderstanding. Of course parsing HTML is the best way to get what you need in a thread about screen scraping using browser automation. And if the target site is the modern bloatware of <html><body><script src=/17gigabytes.js></script></body></html> then for sure one needs a browser (or equivalent) to solve that problem

What I'm saying is that doing the equivalent of

  chrome.exe --dump-html https://example.com/lol \
    | python -c "import bs4; print('reevaluate life choices that led you here')"
is just facepalm stupid. The first step by definition has already parsed all the html (and associated resources) into a very well formed data structure and then makes available THREE selector languages (DOM, CSS, XPath) to reach into that data structure and pull out the things which interest you. BS4 and its silly python friends implement only a small fraction of those selector languages, poorly. So it's fine if a hammer is all you have, but to launch Chrome and then revert to bs4 is just "what problem are you solving here, friend?"

ghxst4 months ago

In python specifically I like lxml (pretty sure that's what BS uses under the hood?), parse5 if you're using node is usually my go to. Ideally though you shouldn't really have to parse anything (or not much at all) when doing browser automation as you have access to the DOM which gives you an interface that accepts query selectors directly (you don't even need the Runtime domain for most of your needs).

mdaniel4 months ago

> pretty sure that's what BS uses under the hood?

it's an option[1], and my strong advice is to not use lxml for html since html5lib[2] has the explicitly stated goal of being WHATWG compliant: https://github.com/html5lib/html5lib-python#html5lib

1: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#insta...

2: https://pypi.org/project/html5lib/

ghxst4 months ago

That's good to know, will try it out. I haven't had many cases of "broken" html in projects where I use lxml but when they do happen it can definitely be a pain.

krick4 months ago

Does anyone know solid (not SaaS, obviously) solution for scraping these days? It's getting pretty hard to get around some pretty harmless cases (like bulk-downloading MY OWN gpx tracks from some fucking fitness-watch servers), with all these js tricks, countless redirects, cloudflare and so on. Even if you already have the cookies, getting non-403 response to any request is very much not trivial. I feel like it's time to upgrade my usual approach of python requests+libxml, but I don't know if there is a library/tool that solves some of the problems for you.

_boffin_4 months ago

- launch chrome with loading of specified data dir.

- connect to it remotely

- ghost cursor and friends

- save cookies and friends to data dir

- run from residential ip

- if get served captcha or cloudflare, direct to solver and to then route back.

- mobile ip if possible

…can’t go into anymore specifics than that

…I forget the site right now, but there a guy that gives a good rundown of this stuff. I’ll see id I can find it.

mhuffman4 months ago

I would be interesting if you can find it.

_boffin_4 months ago

mhuffman4 months ago

Thanks!

_boffin_4 months ago

Also keep the following in mind:

If you were to use an automated browser, such as puppeteer / playwright: - People don't move mouses in "straight" lines.

- People don't click on things that are out of viewport.

- Check the permissions you give sites.

Additional info:

- https://stackoverflow.com/questions/57987585/puppeteer-how-t...

- Look into connecting with CDP.

thealchemi1st4 months ago

You can give the open-source tools mentioned in this guide a look: https://scrapfly.io/blog/how-to-scrape-without-getting-block...

sebmellen4 months ago

https://browserless.io might be what you’re looking for. Open source although they do have a SaaS option.

djbusby4 months ago

I use a few things. First, I scrape from my home IP at very low rates. I drive either FF or Chrome using extension. Sometimes I have to start the session manually (not a robot) and then engage the crawler. Sometimes, site dependant, can run headless or puppeteer. But the extension in "normal" browser that goes slow has been working great for me.

It seems that some sites can determine when using headless or web-driver enabled profile.

Sometimes I'm through a VPN.

The automation is the easy part.

_boffin_4 months ago

Heads up, requests adds some extra headers on send.

One thing I’ve also been doing recently when I find a site that I just want an api is just use python and execute a curl via python. I populate the curl from chrome’s network tab. I also have a purpose built extension I have in my browser that saves cookies to a lan Postgres DB and then the use those values for the script.

Can even probably do more by automating the browser to navigate there on failure.

bobbylarrybobby4 months ago

On a Mac, I use keyboard maestro, which can interact with the UI (which is usually stable enough to form an interface of sorts) — wait for an graphic to appear on screen, then click it, then simulate keystrokes, run JavaScript on the current page and get a result back... looks very human to a website in a browser, and is nearly as easy to write as Python.

iansinnott4 months ago

In short: Don't use HTML endpoints, use APIs.

This is not always possible, but if the product in question has a mobile app or a wearable talking to a server, you might be able to utilize the same API it's using:

- intercept requests from the device - find relevant auth headers/cookies/params - use that auth to access the API

whilenot-dev4 months ago

If requests solves any 403 headaches for you, just pass the session cookies to a playwright instance, and you should be good to go. Just did that for scraping the SAP Software Download Center.

lambdaba4 months ago

I've found selenium with undetected-chromedriver to work best.

unsupp0rted4 months ago

Doesn't get around Cloudflare's anti-bot

lambdaba4 months ago

Ah, ok, I found it worked with YouTube unlike regular chromedriver, didn't encounter Cloudflare when I used it

whatnotests24 months ago

With agents like Finic, soon the web will be built for agents, rather than humans.

I can see a few years from now almost all web traffic is agents.

jasonwcfanop4 months ago

Yep. I used to be the guy responsible for bot detection at Robinhood so I can tell you firsthand it's impossible to reliably differentiate between humans and machines over a network. So either you accept being automated, or you overcorrect and block legitimate users.

I don't think the dead internet theory is true today, but I think it will be true soon. IMO that's actually a good thing, more agents representing us online = more time spent in the real world.

candiddevmike4 months ago

That is some bizarre mental gymnastics to justify the work you've done. What about the rest of us who don't want agents representing us?

ayanb94404 months ago

If you want to use an agent for scraping/automation, you would need to supply it with auth credentials. So permission is required by default.

whatshisface4 months ago

I think they're talking about agents that click through insurance and bank forms, not bots that post on social media.

j0r0b04 months ago

Thank you for sharing!

Your sign up flow might be broken. I tried creating an account (with my own email), received the confirmation email, but couldn't get my account to be verified. I get "Email not confirmed" when I try to log in.

Also, the verification email was sent from [email protected], which is a bit confusing.

jasonwcfanop4 months ago

Oops! We tested the Oauth flow but forgot to update the email one. Thanks for the heads up, fixing this now.

ayanb94404 months ago

This should be fixed now

skeptrune4 months ago

I wonder if there are hidden observality problems with scraping with ideal solutions of a different shape than a dashboard. Feels like sentry connection or other common alert monitoring solutions would combine well with the LLM proposed changes and help trams react more quickly to pipeline problems.

ayanb94404 months ago

We do support sentry. Finic projects are poetry scripts so you can `poetry add` any observability library you need.

computershit4 months ago

First, nice work. I'm certainly glad to see such a tool in this space right now. Besides a UI, what does this provide that something like Browserless doesn't?

jasonwcfanop4 months ago

Thanks! Wasn't familiar with Browserless but took a quick look. It seems they're very focused on the scraping use case. We're more focused on the agent use case. One of our first customers turned us on to this - they wanted to build an RPA automation to push data to a cloud EHR. The problem was it ran as a single page application with no URL routing, and had an extremely complex API for their backend that was difficult to reverse engineer. So automating the browser was the best way to integrate.

If you're trying to build an agent for a long-running job like that, you run into different problems: - Failures are magnified as a workflow has multiple upstream dependencies and most scraping jobs don't. - You have to account for different auth schemes (Oauth, password, magic link, etc) - You have to implement token refresh logic for when sessions expire, unless you want to manually login several times per day

We don't have most of these features yet, but it's where we plan to focus.

And finally, we've licensed Finic under Apache 2.0 whereas Browserless is only available under a commercial license.

sahmeepee4 months ago

Sounds like a prooblem that can be solved with a Playwright script with a bit of error checking in it.

I think this needs more elaboration on what the Finic wrapper is adding to stock Playwright that can't just be achieved through more effective use of stock Playwright.

xnyan4 months ago

I recently implemented something for a use case similar to what they described. To make something like that work robustly is actually quite a bit more effort than playwright script with a bit of error checking. I have not tried the product, but if it does what it claims on back of the box it would be quite valuable if for nothing more than the time savings of figuring it all out on your own.

ushakov4 months ago

I do not understand what this actually is. Any difference between Browserbase and what you’re building?

Also, curious why your unstructured idea did not pan out?

ayanb94404 months ago

Looking at their docs, it seems that with Browserbase you would still have to deploy your Playwright script to a long-running job and manage the infra around that yourself.

Our approach is a bit different. With finic you just write the script. We handle the entire job deployment and scaling on our end.

ilrwbwrkhv4 months ago

Backed by YC = Not open source. Eventually pressure to exit and hyper scale will take over.

ayanb94404 months ago

There are quite a few open source YC startups at this point. Our understanding is that:

1. Developer tooling should be open source by default 2. Open source doesn't meaningfully affect revenue/scaling because developers that would use your self-hosted version would build in-house anyway.

ilrwbwrkhv4 months ago

I know there are quite a few open source by default companies. But the ethos of open source is sharing / building something by the community and getting paid in a way which does not scale the way VC funding expectations work.

So to have some respect for the open source way on top of which you are building all this please stop advertising it as "open source infrastructure" in bold and sell it like a normal software product with "source available" on the footer.

If you do plan to go open source and actually follow its ethos, remove the funded by VC label and have self hosting front and center in the docs with the hosted bit somewhere in the footer.

ilrwbwrkhv4 months ago

Like again if you are not sure, what open source means, this is open source: https://appimage.org/

Hope it is abundantly clear with this example. Docker tried it's best to do the whole open source but business first and it led to disastrous results.

At best this will make your company suffer and second guess itself and at worst this is moral fraud.

Talk to your group partner about this and explain to them as well.

yard20104 months ago

I'm curious, can't do both?

slewis4 months ago

Is it stateful? Like can I do a run, read the results, and then do another run from that point?

ayanb94404 months ago

We currently don't save the browser state after the run has completed but that's something we can definitely add as a feature. Could you elaborate on your use case? In which scenarios would it be better to split a run into multiple steps?

mdaniel4 months ago

Almost any process that involves the word "workflow" (my mental model is one where the user would press alt-tab to look up something else in another window). The very, very common case would be one where they have a stupid SMS-based or "click email link" login flow: one would not wish to do that a ton, versus just leaving the session authenticated for reuse later in the day

Also, if my mental model is correct, the more browsing and mouse-movement telemetry those cloudflare/akamai/etc gizmos encounter, the more likely they are to think the browser is for real, versus encountering a "fresh" one is almost certainly red-alert. Not a panacea, for sure, but I'd guess every little bit helps

jasonwcfanop4 months ago

The way we plan to handle authenticated sessions is through a secret management service with the ability to ping an endpoint to check if the session is still valid, and if not, run a separate automation that re-authenticates and updates the secret manager with the new token. In that case, it wouldn't need to be stateful, but I can certainly see a case for statefulness being useful as workflows get even more complex.

As for device telemetry, my experience has been that most companies don't rely too much on it. Any heuristic used to identify bots is likely to have a high false positive rate and include many legitimate users, who then complain about it. Captchas are much more common and effective, though if you've seen some of the newer puzzles that vendors like Arkose Labs offers, it's a tossup whether the median human intelligence can even solve it.

sebmellen4 months ago

We use https://windmill.dev which is great for this!

hn-front (c) 2024 voximity
source