Quick and dirty image searches
A quick post about how I tried to find some images with specific dimensions with Google Image Search (GIS in this post).
GIS allows to look for defined aspect ratio's. These are broad definitions like 'tall' or 'panoramic'. I was looking for the size of images in pixels. This can be shown over images but isn't an option to finetune searches.
So to get what we're after, I had to dive in the raw html of a page with search results. Let's say I search for 'Ramona Flowers'. The resulting page contains plenty of images. Saving the page like one normally would with ctrl+s, 1) doesn't save the images in a useful format, and 2) isn't the html you'd expect. I think Google has a script in place to prevent the most basic scrapers (haven't looked for the answer, I was just working on trying to get what I was looking for).
So instead I opened the html source from the page, by pressing ctrl-u (prepending 'view-source:' to the URL is another road to the same result). Then I simply selected all code and saved that to a local new imgSearch.html file.
Looking at the html, the sections with images can be easily found. What I need is 1) dimensions of all the images and 2) if something matches what I'm looking for, the URL to the image.
These two values aren't contained together in one div, span or other html tag. To keep them together there must be some indexing. The div's don't match up either, by the way.
html; imageSearch.html
<div jscontroller="Q7Rsec" data-ri="2" class="rg_bx rg_di rg_el ivg-i" data-ved="0ahUKEwicu7CzidLXAhUEOxoKHUVyDXYQMwiAAigCMAI"><a jsname="hSRGPd" href="#" jsaction="fire.ivg_o;mouseover:str.hmov;mouseout:str.hmou" class="rg_l" rel="noopener" style="background:rgb(16,13,10)"><img class="rg_ic rg_i" data-sz="f" name="0MezjRyHo_PdPM:" jsaction="load:str.tbn" alt="Image result for ramona flowers" onload="typeof google==='object'&&google.aft&&google.aft(this)"><div class="_aOd rg_ilm"><div class="rg_ilmbg"><span class="rg_ilmn"> 1366 × 768 - youtube.com </span></div></div></a>
We open the html file and throw it in BeautifulSoup to efficiently get the contents we want from it.
With some digging the two class descriptions we need seem to be 'rg_ilmn' and 'rg_meta notranslate'. I set up two objects that contain all found items that match the criteria. This is in no way a solid or reliable approach, but it works for now. There are so many reasons why the two objects wouldn't match on an item-to-item level.
python; main.py
from bs4 import BeautifulSoup
htmlFile = r"C:\x\y\imgSearch.html"
page = open(htmlFile)
soup = BeautifulSoup(page.read(),"html.parser")
dimensions = soup.find_all('span', {'class' : 'rg_ilmn'})
urls = soup.find_all('div', {'class' : 'rg_meta notranslate'})
index = 0
for dimension in dimensions:
snippet = str(dimension).replace(u'\xa0', u' ')
snippetItems = snippet.split(" ")
firstDim = int(snippetItems[2])
secondDim = int(snippetItems[4])
if secondDim / firstDim == 3/4:
urlSnippet = str(urls[index]).replace(u'\xa0', u' ')
urlItems = urlSnippet.split(" ")
urlJson = urlItems[3]
urlJson = urlJson.split("\"ou\":\"",1)[1]
urlString = urlJson.split("\",\"ow\"",1)[0]
print(urlString)
index += 1
We go over all the found snippets that contain dimensions. I inspected such a string and split it in such a way that two dimensions remained. Again, plenty could go wrong here.
If the dimensions match what I'm looking for (in this case a specific aspect ratio of 3/4), we look op the same item in the URLs object. Assuming that this pertains to the same images whose dimensions we just got.
Some more cleaning and splitting of the string results in an URL for an image. I just print them and copy them one-by-one in the browser to see what is actually in the images.
Of course it would be nicer to automatically download all the images, or to at least do something with the URLs. But I got the result I was looking for. I said it would be quick and dirty. Now anybody who would need specific dimensions of his images can (partly) automate the search.
Comments
Post a Comment