Using Python and Selenium for automated visual regression testing

Afton · on Feb 12, 2020

These examples show the system being very tightly coupled to the actual html. IME this leads to very brittle tests that fail due to restructoring/reorganizing/redesigning.

My two cents: I've never seen automated visual regression testing that wasn't terrible to work with, and where the most common result (by a large margin) of a test failing was someone would update the expected file/image so it would pass with the new visuals. It's a hard problem, and one that I've personally decided isn't worth doing for the customer facing software that I've been involved with.

gnagatomo · on Feb 12, 2020

>the most common result (by a large margin) of a test failing was someone would update the expected file/image so it would pass with the new visuals Previous solutions that my team developed ended up like this. We found the problem can be split in two:

1. Figuring out if some change is actually wanted/expected

We found that doing a d-hash + hamming distance (levenshtein) is a very good way to handle tolerance, as it will ignore most text and subtle layout changes, but unlike most examples the screenshot is scaled down to 200px wide (this number is arbitrary, we're still figuring out how much tolerance it does provides).

2. How to handle this changes

This is the hard part. For now we are only selecting the images that failed in the previous step and by using some jenkins plugins they are presented during the pipeline side by side (old, diff, new) and are reviewed manually, but the versioning process is automatized and the results are stored in the changelog.

sbergot · on Feb 12, 2020

Being coupled to the HTML is more stable than with a screenshot. GUI testing takes a lot of effort but it can absolutely be worth it. I work in a SaaS B2B app and we have been saved a few times by them. They also allow us to engage in big refactorings for reasonable costs.

Dowwie · on Feb 13, 2020

That's the point of testing. The test identifies a break. A person investigates and either updates the test or fixes a problem.

Afton · on Feb 13, 2020

I feel like you may not have read my comment very generously. I am very familiar with the point of a test. Let me try again to explain.

There is a cost to a brittle test. UI testing suffers from this more than other kinds because there are many 'plausible' UI arrangements, and as the product shifts and changes, you need to distinguish

1. "The dialog moved slightly to the left"/"We refactored the HTML, but it still looks the same" from

2. "The dialog is now underneath another element".

Suppose it takes 10 minutes to find, fix, get reviewed, and push the fix, deploy the fix, and validate the fix. If the number of failures that are more like the former are RADICALLY more than the latter type of failure, then it is easy enough to come to the conclusion that the test is not giving you a reasonable ROI. Perhaps the cost of releasing a latter-type-of-failure is not that bad, if you can fix it and get it to production quickly. It might be cheaper overall then the ongoing maintenance cost on a test that would prevent this failure.

Also, and this is culture and product dependent, but we're talking about this like it's a single test, when it's usually a suite of tests (or multiple suites). If they have a 5% failure rate, and 95% of the time its really an 'update the test, this is the new expected', people will stop trusting the tests, and will take shortcuts. So you may find that instead of spending 100% of the maintenance costs for the suite of brittle tests, you're spending 70%, but only getting 20% of the benefit because people become accustomed to the failures , and once a test is failing, no one will notice that the failure changed from a 'benign failure' to a "customer-can't use" failure. (Note: numbers are imaginary, but not crazy).

At one company I worked for, it was so bad, that when we tried to introduce testing/checkin rigor, multiple developers pulled me over to make me explain "Why this dumb test is failing on my checkin attempt", and we would look at the logs and other artifacts to uncover that "It's failing because you changed something without updating the relevant tests". It took quite a bit of time to re-train developers used to brittle tests, to respect and maintain non-brittle ones.

And that is why I am against automated UI testing in general. :)

2rsf · on Feb 13, 2020

That's not how it should work, you should include updating of the tests in the scope of any task and make sure enough time and resources are assigned.

Ideally the tests should be maintained by the developers themselves so they know how and where to quickly change them to accommodate code and behavior changes.

swampthing · on Feb 12, 2020

Have you tried Percy? There certainly can be false positives at times, but you can just approve the diffs so it's not a huge pain in the end.

webdiff · on Feb 12, 2020

I'm working on a product that focuses on doing visual regression testing on the html & css level, and the beta will be happening in one or two months. If anyone is interested in it drop your email here! https://webdiff.io

luhego · on Feb 12, 2020

I remember writing tests in Selenium in the past. Writing them was a horrible experience and some tests were not deterministic. The same test could fail or pass randomly.

ratbeard · on Feb 12, 2020

We invested a large amount of time in them recently, and I agree they are terrible to write. We dare not even turn on IE or any other browser, we can't even keep the tests green in Firefox. Really wish we'd used cypress instead as its 100x easier to debug and we're not getting cross-browser benefits anyways.

The underlying architecture is a bad design for heavy javascript apps in my opinion. The roundtrips between the test runner talking to selenium server talking to selenium driver in a browser and back the other way is slow and so much can change on the page in between steps in your tests. Cypress runs your test code in the javascript process of your browser so I believe theres no or minimal roundtrip lag.

We use `waitFor()` for the UI to stabilize but thats been a hard mental model for devs to follow and as a result we have tons of unnecessary waits in the tests which slows them down. Even things like waiting for a loading modal to disappear before trying to interact with the UI is hard since your code:

`waitFor('.loading-modal', false) // wait for it NOT to exist`

may run BEFORE its even appeared, then fail in the next step when you try clicking on a button and the modal is there now. You can't wait for it to appear first to prevent that, as your code may run after its already come and gone too.

Tons of annoyances or strange behavior like chromedriver doesn't support emojis in inputs, ieDriver had select boxes set value broken at one point, setValue('123') on a field actually does something like 'focus, clear, blur, APPEND 123' so blur logic to set a default value of '0' on your field will result in the final value being '0123' in your tests… just the worst.

2rsf · on Feb 13, 2020

> You can't wait for it to appear first to prevent that, as your code may run after its already come and gone too.

While valid that's not typical for many sites- what's th e point of a very short lived pop up? and even if it is part of your page, you can skip the "risky" part of the test and verify it otherwise (logs ? side effects ?) or not at all.

ratbeard · on Feb 13, 2020

A 'saving…' or 'loading…' popup or any type of interaction preventing mask is a common UX pattern in my opinion in javascripts heavy apps.

We didn't care about testing the popup at all, it was just breaking our other tests in the following way.

In our UI you could click a 'save' button, then a 'saving…' popup appears, meanwhile the 'save' button goes away and an 'edit' button appears (behind the popup), and when the response comes back it says 'Saved.' in the ui.

A test for `$('div=Saved.').toExist()` in wdio works, it does a waitFor under the covers and polls the UI until that text appears. It doesn't care if theres a popup shown or not.

However moving on to the next step in the tests, `$('button=Edit').click()` throws an error 'element is not clickable' if the popup is visible when it happens to runs. Doing multi-command steps like 'check if popup is there, if not click' in general doesn't work as theres so much latency between commands. You can inject javascript in to the page that does both checks in the browsers js process as a hacky workaround.

We did upgrade our webdriver library partly to get waitForClickable() which based on the name at least sounded like it handle the above, but there were no volunteers to update the 168 instances where waitForLoader() had spread in the codebase :/

Afton · on Feb 12, 2020

I've worked with selenium for a couple of years. There are a fair number of gotchas, but it's definitely possible to write pretty deterministic tests. For one, retries of selenium actions are just necessary. So necessary, that you build them in to your test framework, instead of making them something that test-writers have to interact with.

I don't know what specific problems you were having, so I can't give more specific advice than that.

seleniumbase · on Feb 13, 2020

SeleniumBase has those retries built-in. The browser will automatically wait for page elements to fully load before interacting with them. In addition, SeleniumBase includes all the abilities of pytest, which transforms Selenium from a library into a complete test automation framework.

bluntfang · on Feb 12, 2020

preface: i think there are tricks that experienced selenium developers use to make test less brittle, and it's annoying that there are tricks for that.

At the end of the day, if you can't get stable tests using the WebDriver W3C standard [0], you are doing something weird and overly complex with your web application. End to end testing isn't going to give you an objective right or wrong every time, but it should make you ask the question "Why does this happen?" and the answer is usually "Oh, we are doing something weird".

0: https://www.w3.org/TR/webdriver/

zmmmmm · on Feb 12, 2020

Yes, using bare selenium is a writeoff, especially if you are testing modern frameworks like React or VueJS which dynamically modify the DOM.

You need firstly a framework that understands these technologies and then on top of that you want something that offers a structure to capture the common actions and build an automatable view of a page.

For these requirements I have used Geb [1] with the Page Object Model approach with reasonable success - but you still have to approach writing your tests as actual software, not ad hoc random scripting.

[1] https://gebish.org/

sumedh · on Feb 13, 2020

Have you looked at cypress?

bluntfang · on Feb 13, 2020

cypress is a non-starter for end-to-end testing until it supports the WebDriver W3C spec.

I hear it's great for unit testing javascript though.

mintzworld · on Feb 13, 2020

SeleniumBase (which wraps Selenium) adds reliability by improving existing methods so that the browser waits for page elements to fully load before interacting with them. It might be worth trying some of the examples from the SeleniumBase GitHub page to see if you still feel that way.

nitwit005 · on Feb 13, 2020

I've fixed up tests that randomly failed in the past. It's definitely possible to get them to work reliably, but it does require a paranoid approach of carefully waiting for changes.

Humorously enough though, people had never bothered to debug why the tests were failing. One reason was that a backend service crashed. It turned out that quickly deactivating users after doing things in the UI caused stuck threads that ran infinite loops. No one had ever looked at the logs.

kccqzy · on Feb 13, 2020

I also remember doing so. It was such an effort that we only managed to get tests working for the registration and login flow, and only on Firefox ESR. They are frequent broken by changing CSS classes or IDs or the structure of the HTML. Errors are very opaque like "element not clickable at point" followed by a coordinate.

This was all before the W3C standardized browser automation, so I guess the situation may have improved a bit.

bluntfang · on Feb 13, 2020

>It was such an effort that we only managed to get tests working for the registration and login flow, and only on Firefox ESR. They are frequent broken by changing CSS classes or IDs or the structure of the HTML.

It sounds like you were approaching the end-to-end tests as an add-on and not part of your production code. If your problem is changing selectors, you need to make sure your developers know that changing selectors is going to make the tests fail and you need to equip them to handle it through tooling.

dplgk · on Feb 15, 2020

Same experience here. Invested a lot time rewriting tests after each framework did the same thing:

- spent days/weeks refactoring tests to work with new testing framework - got tests 100% passing locally, passing in CI env - tests fail randomly on other people's machine or in CI - pick new framework, rinse repeat.

Tried cypress, webdriver, jest, etc. Total waste of time.

seleniumbase · on Feb 20, 2020

SeleniumBase gives you more consistent results than just using WebDriver alone, as SeleniumBase wraps WebDriver methods to improve the reliability of browser actions.

pas · on Feb 13, 2020

We used NightwatchJS (went all the way with Page Objects) and it was a joy, because we had very real e2e coverage.

reggieband · on Feb 13, 2020

I see these kind of tools once in a while and it brings out my jaded side. About 15 years ago I recall some considerable effort was spent at a company I worked at for the Quality Engineering team to build a visual regression tool for a video game UI I was working on. Using some Windows api magic it could detect buttons and click them to navigate through a UI and then take a screenshot to do a visual diff. It was the most broken useless thing ever and after 2 or 3 months of development it was scrapped.

Last place I worked had a team that built a similar system using Selenium and some image diff. It worked 90% of the time. It was even integrated into their CI pipeline and the system would email you when it failed. You could make a change to one area of code and get an email from the system for a completely separate area that for whatever random reason failed. I once submitted some code to that project and when I got the email I asked one of the project maintainers what I should do to fix it. He told me to just ignore it. Their process was to check the output of the tool on those failures to see if there was any legitimate problem. When I checked the output myself a large amount of the output was garbage (about 100 application states tested and at least 10 in a garbled state).

My own team even tried to integrate Selenium and we even found an external vendor where you can ship your Selenium tests along with your app and they will run it against a matrix of browsers of various versions. We barely got a Chrome version running - no hope of Safari, Firefox or IE/Edge. It was a blackhole of time just fighting to get an equivalent of hello world running consistently as a test in all browsers.

One day someone will prove me wrong and get this kind of UI testing working reliably. But after nearly 20 years seeing optimistic people die on this hill - I do not support wasting more time on it.

beatthatflight · on Feb 13, 2020

I've done a ton of selenium, appium and winappdriver (kinda like selenium) in Java, C# and python, as a contractor at at least 7 different companies. Every company experienced these issues.

Selenium is great. It really is. But the level of flaky tests that generally get produced is just so painful. And as you say, getting it to run in Firefox - that's not been too bad, but any other browser as well was a pipe(line) dream. IEDriver last time I tried to use it with Grid only allowed one instance anyway, so couldn't run multiple on one machine.

UIPath, AutomationAnywhere and these other 'no code required but we have a place you can write code in case' tools - I'm looking forward to seeing how they deal with these sorts of problems. When you're scraping data off a dynamically loading page, or is time dependent, it's not trivial. Even with implicit waits, retries and other 'tricks' as someone described further down, it's frustratingly difficult at times to just do a single action in an app/page.

And then you hit shadow dom websites, or Windows apps that create new windows (not child popups) for every screen/dialog box. The joy :)

Still, I find it an enjoyable, if frustrating challenge.

jonperl · on Feb 13, 2020

I have had the same experience using Selenium but not with Chrome Dev Tools Protocol. However you still need a stable environment, trustable selectors, and automatic waiting logic to have stable e2e web tests.

I am working on an open source library that generates Playwright tests (uses Chrome Dev Tools protocol) and I hope we can prove you wrong about getting UI tests working reliably. https://github.com/qawolf/qawolf

mintzworld · on Feb 13, 2020

SeleniumBase methods wait for page elements to fully load before interacting with them, which brings test reliability. There's also built-in mobile testing and cross-browser testing features included: https://github.com/seleniumbase/SeleniumBase/blob/master/hel...

Dowwie · on Feb 13, 2020

People have moved on to headless Chrome

zelly · on Feb 13, 2020

Puppeteer (js) in particular. But since it's based on protobuf rpc, you can use the underlying protocol (chrome devtools protocol) from any language.

It's by far the leading player in this space and for good reason. Selenium is unreliable and too generic.

Dowwie · on Feb 13, 2020

There are puppeteer clones for many languages now. It's impressive how successful the devtools protocol has been.

paulryanrogers · on Feb 13, 2020

Webdriver can still work with headless modes. Firefox too has a headless mode. And they're simpler than the old XVFB.

zelly · on Feb 13, 2020

The unspoken subtext is that development on these tools is intended for building spammers. Automated testing is just the cover story like a torrent client putting Big Buck Bunny in the screenshot.

maciejgryka · on Feb 13, 2020

Very interesting to see this! From the comments it seems that a bunch of people were burned by Selenium-style testing in the past. There's a pretty interesting paper/thesis [0] about this which talks abut what kind of breakages are most common in these scenarios.

Which is why (shameless plug alert) we at Rainforest [1] are working with crowdsourced testers - humans are still much, much better at visual diffing and judgement than machines are. Feel free to shoot me an email if you're interested in talking about it and figuring out how much to automate and how much to leave to humans.

[0] Why do Record/Replay Tests of Web Applications Break?, Mouna Hammoudi, Gregg Rothermel, Paolo Tonella

[1] https://www.rainforestqa.com/

forgotmypw1 · on Feb 13, 2020

I wrote a similar tool in Java, and it was effective at picking up minute and unintended visual changes when testing with many browsers.

My approach was to capture DOM element attributes and store them in a database, then compare snapshots.

https://github.com/gulkily/Selenium-Utilities/tree/master/sr...

guest2143 · on Feb 13, 2020

I'd like to see some approval test infrastructure here as well, q.v. https://approvaltests.com/

Being able to have a customer accept what the output looks like and then listening to when the page changes would be great for giving non-technical people control over passing tests.

misiti3780 · on Feb 12, 2020

I think facebook released a similar project a few years ago that I never got a chance to try but I cant seem to find it within their github anymore.

Anyone remember that project ?

ceocoder · on Feb 12, 2020

Is this the one? https://jestjs.io/ or https://github.com/facebook/jest

emptysea · on Feb 12, 2020

I think this might be what you are thinking of: https://github.com/facebookarchive/huxley