Dog Web Scraper V2
Purpose
To add onto the previously build dog web scraper.
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import datetime
Web Scraper & Initial Data Table
url = "https://fetchwi.org/adopt"
results = requests.get(url)
soup = BeautifulSoup(results.text, "html.parser")
names = []
tags = []
pagelinks = []
name_div = soup.find_all('div', class_="summary-content sqs-gallery-meta-container")
for container in name_div:
name = container.find('a', class_="summary-title-link").text
names.append(name)
href = container.find('a', class_="summary-title-link")['href']
pagelinks.append(href)
div1 = container('div', class_="summary-metadata-container summary-metadata-container--below-content")
for span in div1:
tag = span('div', class_="summary-metadata summary-metadata--primary")
for tag1 in tag:
tag2 = tag1.text
tags.append(tag2)
## Additional date column to potentially track over-time changes
date = datetime.datetime.now()
dates = [date] * len(tags)
doggos = pd.DataFrame({
'name': names,
'tags':tags,
'link':pagelinks,
'date': dates
})
doggos = doggos.replace(to_replace=r"\n", value="", regex=True)
doggos["link"] = doggos["link"].replace(to_replace="/doggos/", value="", regex=True)
doggos = doggos.astype({
'name': "string",
'tags': "string",
'link': "string",
'date': "object"
})
Selecting Dogs that Fit my Lifestyle
potential_doggos = doggos.loc[doggos["tags"].str.contains("Could live in an apartment")]
potential_doggos
| name | tags | link | date | |
|---|---|---|---|---|
| 15 | Sugar | Housebroken, Good in the car, Can free roam wh... | sugar3 | 2022-02-06 16:29:22.521470 |
| 22 | Pride | Good with dogs, Crate trained, Housebroken, Go... | pride | 2022-02-06 16:29:22.521470 |
| 28 | Myla | Crate trained, Good for beginner dog owner, No... | myla | 2022-02-06 16:29:22.521470 |
Second Scraper
This is a new scraper that will scrape from the specific dog page (found in the “link” column) using a for loop and appending the specific “doggo” weblink from the base URL.
dog_url = []
dog_url = potential_doggos['link'].tolist()
names2 = []
dog_infos = []
tags2 = []
description = []
url = "http://fetchwi.org/doggos/"
for dogs in dog_url:
results = requests.get(url + dogs)
soup = BeautifulSoup(results.text, "html.parser")
name = soup.find('h1', class_="entry-title entry-title--large p-name").text
names2.append(name)
dog_info = soup.find('div', class_="blog-item-content e-content").h4.text
dog_infos.append(dog_info)
tag = soup.find_all('p', class_="")[0].text
tags2.append(tag)
desc = soup.find_all('p', class_="")[1].text
description.append(desc)
Length Checking
print("names:",len(names2))
print("dog_info:",len(dog_infos))
print("tags:",len(tags2))
print("descrpt:",len(description))
names: 3
dog_info: 3
tags: 3
descrpt: 3
Potential Dog DataTable
potential_doggos_ext = pd.DataFrame({
'name': names2,
'dog_info': dog_infos,
'tags': tags2,
'description': description
})
new = potential_doggos_ext["dog_info"].str.split("|", n=3, expand=True)
potential_doggos_ext["Breed"] = new[0]
potential_doggos_ext["Sex"] = new[1]
potential_doggos_ext["Age"] = new[2]
potential_doggos_ext["Weight"] = new[3]
potential_doggos_ext = potential_doggos_ext.astype("string", errors="ignore")
potential_doggos_ext
| name | dog_info | tags | description | Breed | Sex | Age | Weight | |
|---|---|---|---|---|---|---|---|---|
| 0 | Sugar | English Bulldog | Female | 2 Years Old | 43 Lbs | QUICK FACTS: ✔️ Housebroken! ✔️ Good in car... | Sugar gets up around 6:30/7am in the morning f... | English Bulldog | Female | 2 Years Old | 43 Lbs |
| 1 | Pride | pointer/terrier Mix | Female | 11 months Old |... | QUICK FACTS: ✔️ Good with other dogs! ✔️ Cr... | PUPDATE 4 | pointer/terrier Mix | Female | 11 months Old | 46 Lbs |
| 2 | Myla | german shepherd Mix | Female | 1 year old | 48... | QUICK FACTS: ✔️ Housebroken! ✔️ Good for be... | Myla is an energetic and loving German Shepher... | german shepherd Mix | Female | 1 year old | 48 Lbs |
Final Thoughts
This was a great way to revisit my original scraper. I learned a lot about what html containers are displayed on the “Inspect” tool of the webpage vs. what actually is scraped. It is also unfortunate that certain parts of the page were not structured to fit what the web scraper is scraping. Most of the “Updates” section is in a “p” container with a null class, which is shared by other, non-update related page elements. For that reason, I chose to only scrape the first paragraph. While many dogs will have more than one paragraph of updates, it was better to grab just the first paragraph as all dogs will have this filled out. Similarly, for dogs with multiple paragraphs in their “Updates” section, displaying all paragraphs in the data table may not be the best formatting wise. If I were to want to grab all paragraphs in the future, it wouldn’t be too much more work to add another ‘for’ loop, string concatenate the results above the 0th container, and filter out the unrelated results.