I was recently contracted to go through and update a client's database values. The issue was that the URLs stored in the database were query URLs that would redirect to the destination URL instead of the URL itself. Example:
https://www.somedomain.com/Query/SomeProduct/Details.asp?Code=blahblah
https://www.somedomain.com/Query/SomeProduct/Details.asp?Code=blahblah
Instead of:
https://www.somedomain.com/SomeProduct
Pfft, easy.
So what I'm thinking:
Create a script that will iterate over each of the query URLs, perform an HTTP request with the query URL and, once the page has loaded, evaluate the current URL. I'll get the data in a CSV and use Python and it's awesomeness to read the data, then use a library to scrape the web and run a quick JS script to evaluate the query URL into a new URL.
Pfft, easy.
So what actually happened:
Python library and dependency hell. Despite using PIP to install and manage my dependencies, I found that using Python 3 for one dependency was not supported. So I had to reinstall Python 2.7, only to find out afterwards that another dependency requires Python 3. Eventually, I hacked together some patches and libraries to make this work.
TL;DR: Use a webdriver to open each query URL, wait for the page to load and run javascript "location.url" to evaluate the destination URL and return it's value.
However, the speed was horrible...at least 7-10 seconds per website and some of the websites wouldn't finish loading using the selenium driver, resulting in an undefined value. To further my frustration, I would have to download more sketchy dependencies in order to adjust the timeout of the webdriver to....holy shit this is terrible.
I decided to change my tech stack. I initially chose Python because I didn't want issues reading the data in the CSV, but thankfully there were Node.js libraries for that. Furthermore, I decided to analyse the actual HTTP requests that were being made and received. Surely, there had to be some sign of the desitnation URL in the HTTP response for the query URL since that's the nature of a redirect. There was!
So final solution:
Use a simple Node script to read query URLs from the CSV, use that data to make an HTTP request, read the HTTP response for the destination URL ($("link[rel='canonical']").attr('href');) and record it back next to the query URL. I did have some issues with async and writing the csv due to my order of operations, but I changed my algorithm to store all the URLs before writing them so I could maintain order of the data and ensure I wasn't accidentally mismatching query URLs with destination URLs because some HTTP requests were faster than others. The script was fast too, evaluating 8-12 query URLs per second changing my estimated 2.5 hour script with python script down to about 3 min with my node script.
PS: I'm happy to share my script with anyone who would like to use it or if you would like to see how I did what I did.