Purpose

To add onto the previously build dog web scraper.

import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import datetime

Web Scraper & Initial Data Table

url = "https://fetchwi.org/adopt"
results = requests.get(url)
soup = BeautifulSoup(results.text, "html.parser")

names = []
tags = []
pagelinks = []

name_div = soup.find_all('div', class_="summary-content sqs-gallery-meta-container")

for container in name_div:
    name = container.find('a', class_="summary-title-link").text
    names.append(name)
    
    href = container.find('a', class_="summary-title-link")['href']
    pagelinks.append(href)
    
    div1 = container('div', class_="summary-metadata-container summary-metadata-container--below-content")
    
    for span in div1:
        tag = span('div', class_="summary-metadata summary-metadata--primary")
        
        for tag1 in tag:
            tag2 = tag1.text
            tags.append(tag2)

## Additional date column to potentially track over-time changes
date = datetime.datetime.now()

dates = [date] * len(tags)

doggos = pd.DataFrame({
    'name': names,
    'tags':tags,
    'link':pagelinks,
    'date': dates
    
})

doggos = doggos.replace(to_replace=r"\n", value="", regex=True)
doggos["link"] = doggos["link"].replace(to_replace="/doggos/", value="", regex=True)
doggos = doggos.astype({
    'name': "string",
    'tags': "string",
    'link': "string",
    'date': "object"
})

Selecting Dogs that Fit my Lifestyle

potential_doggos = doggos.loc[doggos["tags"].str.contains("Could live in an apartment")]

potential_doggos
name tags link date
15 Sugar Housebroken, Good in the car, Can free roam wh... sugar3 2022-02-06 16:29:22.521470
22 Pride Good with dogs, Crate trained, Housebroken, Go... pride 2022-02-06 16:29:22.521470
28 Myla Crate trained, Good for beginner dog owner, No... myla 2022-02-06 16:29:22.521470

Second Scraper

This is a new scraper that will scrape from the specific dog page (found in the “link” column) using a for loop and appending the specific “doggo” weblink from the base URL.

dog_url = []
dog_url = potential_doggos['link'].tolist()

names2 = []
dog_infos = []
tags2 = []
description = []

url = "http://fetchwi.org/doggos/"

for dogs in dog_url: 
    results = requests.get(url + dogs)
    soup = BeautifulSoup(results.text, "html.parser")

    name = soup.find('h1', class_="entry-title entry-title--large p-name").text
    names2.append(name)

    dog_info = soup.find('div', class_="blog-item-content e-content").h4.text
    dog_infos.append(dog_info)

    tag = soup.find_all('p', class_="")[0].text
    tags2.append(tag)

    desc = soup.find_all('p', class_="")[1].text
    description.append(desc)

Length Checking

print("names:",len(names2))
print("dog_info:",len(dog_infos))
print("tags:",len(tags2))
print("descrpt:",len(description))
    names: 3
    dog_info: 3
    tags: 3
    descrpt: 3

Potential Dog DataTable

potential_doggos_ext = pd.DataFrame({
    'name': names2,
    'dog_info': dog_infos,
    'tags': tags2,
    'description': description
})

new = potential_doggos_ext["dog_info"].str.split("|", n=3, expand=True)

potential_doggos_ext["Breed"] = new[0]
potential_doggos_ext["Sex"] = new[1]
potential_doggos_ext["Age"] = new[2]
potential_doggos_ext["Weight"] = new[3]

potential_doggos_ext = potential_doggos_ext.astype("string", errors="ignore")

potential_doggos_ext
name dog_info tags description Breed Sex Age Weight
0 Sugar English Bulldog | Female | 2 Years Old | 43 Lbs QUICK FACTS: ✔️ Housebroken! ✔️ Good in car... Sugar gets up around 6:30/7am in the morning f... English Bulldog Female 2 Years Old 43 Lbs
1 Pride pointer/terrier Mix | Female | 11 months Old |... QUICK FACTS: ✔️ Good with other dogs! ✔️ Cr... PUPDATE 4 pointer/terrier Mix Female 11 months Old 46 Lbs
2 Myla german shepherd Mix | Female | 1 year old | 48... QUICK FACTS: ✔️ Housebroken! ✔️ Good for be... Myla is an energetic and loving German Shepher... german shepherd Mix Female 1 year old 48 Lbs

Final Thoughts

This was a great way to revisit my original scraper. I learned a lot about what html containers are displayed on the “Inspect” tool of the webpage vs. what actually is scraped. It is also unfortunate that certain parts of the page were not structured to fit what the web scraper is scraping. Most of the “Updates” section is in a “p” container with a null class, which is shared by other, non-update related page elements. For that reason, I chose to only scrape the first paragraph. While many dogs will have more than one paragraph of updates, it was better to grab just the first paragraph as all dogs will have this filled out. Similarly, for dogs with multiple paragraphs in their “Updates” section, displaying all paragraphs in the data table may not be the best formatting wise. If I were to want to grab all paragraphs in the future, it wouldn’t be too much more work to add another ‘for’ loop, string concatenate the results above the 0th container, and filter out the unrelated results.