📂
SEI 1019
  • Introduction
  • About These Notes
  • Syllabus
  • Development Workflow
    • Command Line
      • The Terminal
      • Filesystem Navigation
      • File Manipulation
      • Additional Topics
    • Intro to Git
      • Version Control
      • Local Git
      • Remote Git
      • Git Recipes
    • Group Collaboration
      • Git Workflows
      • Project Roles and Tools
    • VS Code Tips & Tricks
  • HTML/CSS
    • HTML
    • CSS Selectors
    • CSS Box Model and Positioning
      • Box Model
      • Display and Positioning
      • Flexbox
      • Grid
      • Flexbox & Grid Games
      • Floats and Clears
      • Additional Topics
    • Advanced CSS
      • Responsive Design
      • Pseudo-Classes/Elements
      • Vendor Prefixes
      • Custom Properties
      • Additional Topics
    • Bootstrap
    • CSS Frameworks
    • Accessibility
  • JavaScript
    • Primitives
    • Arrays
    • Objects
    • Control Flow
      • Boolean Expressions
      • Conditionals
      • Loops
      • Promises
    • Functions
      • Callbacks
      • Timing Functions
      • Iterators
    • DOM and Events
    • DOM Manipulation
    • HTML5 Canvas
    • How To Reduce Redundancy
    • (2019) JavaScript OOP
    • (2016) OOP with Classes
    • (1995) OOP with Prototypes
      • Constructors
      • Prototypes
    • Intro to TDD
    • Scoping
    • Inheritance
      • Prototypal Inheritance
      • Call, Apply, and other Functions
      • ES6 Inheritance
      • Resources
    • Custom Node Modules
    • Additional Topics
      • AJAX, Fetch, and Async/Await
      • AJAX w/JSON and Localstorage
        • AJAX w/JSON
        • Local Storage
      • Async module
      • Data Scraping
  • jQuery
    • Intro
      • DOM Manipulation
      • Reddit Practice
      • Styling
      • Events
    • Plugins
    • AJAX
  • APIs
    • Fetch
    • AJAX w/jQuery
    • AJAX w/Fetch
  • Databases
    • Intro to SQL
    • Advanced SQL
    • MongoDB
      • Intro to NoSQL
      • CRUD in MongoDB
      • Data Modeling
      • Intermediate Mongo
  • Node/Express
    • Node
      • Intro to Node
      • Node Modules
      • Node Package Manager (NPM)
    • Express
      • Intro to Express
        • Routes
        • Views
        • Templates
        • Layouts and Controllers
        • CRUD & REST
          • Get and Post
          • Put and Delete
      • APIs with Express (request)
      • APIs with Express (axios)
    • Sequelize
      • Terminology
      • Setup
      • Using Models
      • Seeding Data
      • Validations and Migrations
      • Resources
      • 1:M Relationships
      • N:M Relationships
    • Express Authentication
      • Research Components
      • Code Components
      • Auth in Theory
        • Sessions
        • Passwords
        • Middleware
        • Hooks
      • Auth in Practice
        • Create the User
        • User Signup
        • Sessions
        • User Login
        • Authorization and Flash messages
    • Testing with Mocha and Chai
    • Mongoose
      • Mongoose Associations
    • JSON Web Tokens
      • Codealong
    • Additional Topics
      • oAuth
      • Geocoding with Mapbox
      • Geocoding and Google Maps
      • Cloudinary
      • Websockets with Socket.io
      • SASS
  • Ruby
    • Intro to Ruby
    • Ruby Exercises
    • Ruby Classes
    • Ruby Testing with Rspec
    • Ruby Inheritance
    • Ruby Data Scraping
  • Ruby on Rails
    • Intro to Rails
    • APIs with Rails
    • Asset Pipeline
    • Rails Auth and 1-M
      • Auth Components
    • Rails N:M
    • ActiveRecord Polymorphism
    • Additional Topics
      • oAuth
      • SASS
      • Rails Mailers
      • Cloudinary
      • Jekyll
  • React (Updated 2019)
    • ES6+/ESNext
      • Const and Let
      • Arrow Functions
      • Object Literals and String Interpolation
      • ES6 Recap
      • ES6 Activity
    • Intro to React
      • Create React App
      • Components and JSX
      • Virtual DOM
      • Props
      • Dino Blog Activity
      • Nested Components
      • Lab: LotR
    • React State
      • Code-Along: Mood Points
      • Code-Along: Edit Dino Blog
      • Lab: Simple Calc
      • Lifting State
    • React Router
      • Browser History/SPAs
      • React Router (lesson and full codealong)
      • Router Lab
    • Fetch and APIs
      • APIs with Fetch and Axios
      • Fetch the Weather
    • React Hooks
    • React LifeCycle
      • Lab: Component LifeCycle
    • React Deployment
    • Additional Topics
      • React Frameworks
        • Material UI Theming
      • Typescript
        • More Types and Syntax
        • Tsconfig and Declaration Files
        • Generics with Linked List
      • Redux
      • TypeScript
      • Context API
      • React Native
  • Meteor
  • Deployment and Config
    • Installfest
      • Mac OSX
      • Linux
      • Git Configuration
      • Sublime Packages
    • Deploy - Github Pages
    • Deploy - Node/Sequelize
    • Deploy - Node/MongoDB
    • Deploy React
    • Deploy - Rails
      • Foreman (Environment Variables)
    • Deploy - AWS Elastic Beanstalk
    • Deploy - S3 Static Sites
    • Deploy - Django
    • Deploy - Flask
  • Data Structures and Algorithms
    • Recursion
    • Problem Solving - Array Flatten
    • Binary Search
    • Algorithm Complexity
    • Stacks and Queues
    • Bracket Matching
    • Ruby Linked Lists
      • Sample Code
      • Beginner Exercises
      • Advanced Exercises
    • JS Linked Lists
      • Sample Code
      • Beginner Exercises
      • Beginner Solutions
    • Hash Tables
    • Intro to Sorting
    • Insertion Sort
    • Bucket Sort
    • Bubble Sort
    • Merge Sort
    • Quick Sort
    • Heap Sort
    • Sorting Wrapup
    • Hashmaps
    • Trees and Other Topics
  • Python
    • Python Installation
    • Intro to Python
    • Python Lists
    • Python Loops
    • Python Dictionaries
    • Python Sets and Tuples
    • Python Cheatsheet
    • Python Functions
    • Python Classes
    • Python Class Inheritance
    • Intro to Flask
    • Intro to SQLAlchemy
      • Flask and SQLAlchemy
    • Using PyMongo
    • Intro to Django
    • CatCollector CodeAlong
      • URLs, Views, Templates
      • Models, Migrations
      • Model Form CRUD
      • One-to-Many Relations
      • Many-to-Many Relations
      • Django Auth
    • Django Cheatsheet
    • Django Auth
    • Django Polls App Tutorial
    • Django School Tool Tutorial
    • Django 1:M Relationships
    • Custom Admin Views
    • Data Structures and Algorithms
      • Recursion
      • Binary Search
      • Stacks and Queues
      • Linked Lists
      • Binary Trees
      • Bubble Sort
      • TensorFlow & Neural Networks
    • Adjacent Topics
      • Raspberry Pi
      • Scripting
  • Assorted Topics
    • History of Computer Science
    • Regular Expressions
    • Intro to WDI (Course Info)
    • Being Successful in WDI
    • Internet Fundamentals
      • Internet Lab
    • User Stories and Wireframing
      • Wireframing Exercise: Build an Idea
    • Post WDI
      • Learning Resources
      • Deliverables -> Portfolio
      • FAQ
  • Projects
    • Project 1
    • Project 2
    • Project 3
      • Project 3 Pitch Guidelines
    • Project 4
    • Past Projects
      • Project 1
      • Project 2
      • Project 3
      • Project 4
      • Portfolios
    • Post Project 2
    • MEAN Hackathon
      • Part 1: APIs
      • Part 2: Angular
    • Portfolio
  • Web Development Trends
  • Resources
    • APIs and Data
    • Tech Websites
    • PostgreSQL Cheat Sheet
    • Sequelize Cheat Sheet
    • Database Administration
  • Archived Section
    • (Archived) ReactJS
      • Intro to React
        • Todo List Codealong
        • Additional Topics
      • Deploy React
      • React with Gulp and Browserify
        • Setting up Gulp
        • Additional Gulp Tasks
      • React Router
        • OMDB Router
        • OMDB Search
        • Additional Resources
      • React Animations
        • CSS Animations
    • AngularJS
      • Intro to AngularJS
        • Components and SPA
        • Create an Angular App
      • Angular Directives and Filters
      • Angular Animation
      • Angular Bootstrap Directives
        • Bootstrap Modals
      • Angular $http
      • Angular Services
        • Service Recipes
        • ngResource
        • Star Wars Codealong
      • Angular Routing
      • Angular + Express
      • Angular Authentication
        • Additional Topics
      • Angular Components
      • Angular Custom Filters
      • Angular Custom Directives
Powered by GitBook
On this page
  • Objectives
  • What is web scraping?
  • Getting Started: Scraping Seattle Neighborhoods
  • Step 1: Get the HTML document
  • Step 2: Parse the HTML
  • Step 3: Identify the content you want to scrape.
  • Step 4: Traverse the DOM (scrape!)
  • Exercise:
  • Some other resources on scraping

Was this helpful?

  1. JavaScript
  2. Additional Topics

Data Scraping

Objectives

  • Identify situations where data scraping would be beneficial

  • Understand the methods and legality of data scraping

  • Use modules such as Cheerio to scrape data from the web

What is web scraping?

Scraping (Screen Scraping, Web Data Extraction, Web Harvesting, etc) refers to the process of requesting an HTML page and picking out relevant data from the document string. In other words, you can scrape content off of web pages by parsing the html.

Why scrape?

  • no API available

  • API is unreliable/unkept/etc.

  • no fee or call limit (unless a rate-limit is set up)

  • more anonymous than getting data through dev resources

Why not scrape?

  • need to log into site in order to access desired data

  • organization of data makes it hard/laborious to access

  • web page structure frequently changes (program that uses scraped data would need constant updates)

  • copywrite and other legal issues

NOTE: The legality of data scraping and using a site's data may depend on a site's terms of use. Scraping a site and using the data for profit may violate a site's terms of use, so be careful before scraping a site. This is not legal advice, and we are not lawyers, but we recommend that you contact a lawyer if you want to scrape data for a for-profit application.

Getting Started: Scraping Seattle Neighborhoods

Let's try creating a program that will scrape neighborhood data from this site:

To get started, create a new folder, and initialize npm. We'll also want to install two modules:

  • request - for accessing external resources via HTTP

  • cheerio - essentially, this is server side jQuery. We will be using this to traverse the data we get back from our request.

Step 1: Get the HTML document

There are multiple ways to get an HTML document, but we'll use the request module in this example. To scrape data from the site, we need to request the webpage. In a getbusinesses.js file, import the request, then make a request to the Seattle Neigbhborhoods website.

const request = require('request')
const URL = 'https://visitseattle.org/partners/?frm=partners&ptype=visitors-guide&s=&neighborhood=Capitol+Hill'

request(URL, (error, response, body) => {
    console.log(body);
});

Run the program and take a look at your output. What did the request return?

Step 2: Parse the HTML

The request to the seattle neighborhoods url gave us the entire HTML document string - now we need to parse it in order to pick out the specific data we're looking for. This is where Cheerio comes in! Import Cheerio to your getbusinesses.js:

const cheerio = require('cheerio')

Inside the callback function of request, we'll pass the html we got back into the cheerio.load() function. We store the result, which is a cheerio object, in the dollar sign variable because cheerio is designed to mimic jQuery selectors (though technically, we could store it in any variable we'd like).

request(URL, (error, response, body) => {
  let $ = cheerio.load(body);
  console.log($);
});

Run the program and take a look at the cheerio object. How might we find the html again? Does the cheerio object contain a method for this?

Step 3: Identify the content you want to scrape.

  • First you have to identify what content you're looking to scrape and how to access it. Let's aim to scrape the names of all the busineses listed on this page of Capitol Hill businesses. Open the dev tools and inspect the page to see if you can pinpoint the elements that have the relvant information for each result.

Upon some inspection, we can see that the results live inside of a search-results div, which contains a search-results-partner section element. Inside that there is a search-result-container div that has a search-result div for each result.

Let's try to target the name of the first business. Inspect the first search-result div and identify exactly where the business name lives.

The business name can be found inside the search-result-preview div as both the title of the nested a tag, as well as the h3 child of that anchor.

request(URL, (error, response, body) => {
    let $ = cheerio.load(body);
    let result = $('.search-result-preview').html();
    console.log(result)
});

Now let's target the title attribute:

request(URL, (error, response, body) => {
    let $ = cheerio.load(body)
    let result = $('.search-result-preview').find('a').attr('title');
    console.log(result)
});

Great! Now we know how to find the title of one business, but how do we get all of them?

Step 4: Traverse the DOM (scrape!)

Cheerio actually gives us the option of selecting the first or all of the elements that match the selector. Let's take a closer look at ('.search-result-preview') by getting it's length:

request(URL, (error, response, body) => {
    let $ = cheerio.load(body)
    let result = $('.search-result-preview')
    console.log(result.length)
})

It looks like that result object actually contains all of the results on the page! Cheerio has iterators for traversing cheerio objects like this. Let's use the each iterator, which functions similarly to the javascript Array.forEach():

request(URL, (error, response, body) => {
    let $ = cheerio.load(body)
    let results = $('.search-result-preview')
    results.each((index, element)=>{
        console.log($(element).find('a').attr('title'))
    })
})

Logging to the console is great, but in practice, we'll likely want to store all of these titles in an an array-like cheerio object. We can use the built in .map() iterator to pull out just the titles and store them in their own cheerio object:

request(URL, (error, response, body) => {
    let $ = cheerio.load(body);
    let results = $('.search-result-preview')
    let resultTitles = results.map((index, element)=>{
        return $(element).find('a').attr('title')
    })
    console.log(resultTitles)
})

This still gives us a lot of gobbledy-gook we didn't ask for. Use the .get() function after .map() to see an array of exactly what we asked for (see docs -> traversing -> get):

request(URL, (error, response, body) => {
    let $ = cheerio.load(body)
    let results = $('.search-result-preview')
    let resultTitles = results.map((index, element)=>{
        return $(element).find('a').attr('title')
    })
    console.log(resultTitles.get())
})

Exercise:

What if we want more than just the name of the businesses? Let's modify our code to also store the URL for the image associated with the result:

request(URL, (error, response, body) => {
    let $ = cheerio.load(body);
    let results = $('.search-result-preview')
    let filteredResults = results.map((index, element)=>{
        return {
            title: $(element).find('a').attr('title'),
            img: [ INSERT YOUR CODE HERE ]
        }
    })
    console.log(filteredResults.get())
})

HINT

```javascript $(element).find('.image-container').attr('style') ``` Now you need to modify the string to isolate the url!

SOLUTION

```javascript request(URL, (error, response, body) => { let $ = cheerio.load(body); let results = $('.search-result-preview') let filteredResults = results.map((index, element)=>{ let imgurl = $(element).find('.image-container').attr('style') imgurl = imgurl.substring(22, imgurl.length-15) return { title: $(element).find('a').attr('title'), img: imgurl } }) console.log(filteredResults.get()) }) ``` Now you need to modify the string to isolate the url!

Some other resources on scraping

PreviousAsync moduleNextjQuery

Last updated 4 years ago

Was this helpful?

Look over the - for more info about our next steps.

First let's grab the first search-result-preview element. Cheerio uses to identify elements.

https://visitseattle.org/partners/?frm=partners&ptype=visitors-guide&s=&neighborhood=Capitol+Hill
Cheerio Documentation
jQuery selectors
Scraping with Node
Web Scraping For Fun and Profit