Web Scrape Reddit



After submitting the arguments and Reddit validation, URS will display a table of Subreddit scraping settings as a final check before executing. You can include the -y flag to bypass this and immediately scrape. Exported files will be saved to the subreddits directory. I don't know website scraping lingo, but deal mostly with Excel. I'm come to a point where I'm looking to hire someone but I want to ensure I'm speaking the same language. I need to web scrape different outcomes from about 1000 different outcomes on the same site. It's a drop down list and each option have about 5 outcomes and 20 lines of data. Web scraping is a task that has to be performed responsibly so that it does not have a detrimental effect on the sites being scraped. Web Crawlers can retrieve data much quicker, in greater depth than humans, so bad scraping practices can have some impact on the performance of the site.

There’s a subreddit for everything.

No matter what your interests are, you will most likely find a subreddit with a thriving community for each of them.

This also means that the information on some subreddits can be quite valuable. Either for marketing analysis, sentimental analysis or just for archival purposes.

Web scrape reddit app

Reddit and Web Scraping

Today, we will walk through the process of using a web scraper to extract all kinds of information from any subreddit. This includes links, comments, images, usernames and more.

To achieve this, we will use ParseHub, a powerful and free web scraper that can deal with any sort of dynamic website.

Want to learn more about web scraping? Check out our guide on web scraping and what it is used for.

Reddit and Web Scraping

For this example, we will scrape the r/deals subreddit. We will assume that we want to scrape these into a simple spreadsheet for us to analyze.

Additionally, we will scrape using the old reddit layout. Mainly because the layout allows for easier scraping due to how links work on the page.

Getting Started

  1. Make sure you download and open ParseHub, this will be the web scraper we will use for our project.
  2. In ParseHub, click on New Project and submit the URL of the subreddit you’d like to scrape. In this case, the r/deals subreddit. Make sure you are using the old.reddit.com version of the site.

Scraping The Subreddit’s Front page

  1. Once submitted, the URL will render inside ParseHub and you will be able to make your first selection.
  1. Start by clicking on the title of the first post on the page. It will be highlighted in green to indicate that it has been selected.
  1. The rest of the post titles on the page will also be highlighted in yellow. Click on the second post title on the page to select them all. They should all now turn green.
  1. On the left sidebar, rename your selection to posts. We have now told ParseHub to extract both the title and link URL for every post on the page.
  2. Now, use the PLUS (+) sign next to the post selection and select the Relative Select command.
  1. Using Relative Select, click on the title of the first post on the page and then on the timestamp for the post. An arrow will appear to show the selection. Rename this new selection to date.
  1. You will notice that this new selection is pulling the relative timestamp (“2 hours ago”) and not the actual time and date on which the post was made. To change this, go to the left sidebar, expand your date selection and click on the extract command. Here, use the drop down menu to change the extract command to “title Attribute”.
  1. Now, repeat step 5 to create new Relative Select commands to extract the posts’ usernames, flairs, number of comments and number of votes.
  2. Your final project should look like this:
Scrape

Downloading Reddit Images

You might be interested in scraping data from and image-focused subreddit. The method below will be able to extract the URL for each image post.

You can then follow the steps on our guide “How to Scrape and Download images from any Website” to download the images to your hard drive.

Adding Navigation

ParseHub is now setup to scrape the first page of posts of the subreddit we’ve chosen. But we might want to scrape more than just the first page. Now we will tell ParseHub to navigate to the next couple of pages and scrape more posts.

  1. Click the PLUS(+) sign next to your page selection and choose the Select command.
  1. Using the Select command, click on the “next” link at the bottom of the subreddit page. Rename your selection to next.
  1. Expand the next selection and remove the 2 Extract commands created by default.
  1. Now click on the PLUS(+) sign on the next selection and choose the Click command.
  1. A pop-up will appear asking you if this a “next page' button. Click “Yes” and enter the number of times you’d like ParseHub to click on it. In this case, we will input 2, which equals to 3 full pages of posts scraped. Finally, click on “Repeat Current Template” to confirm.

Scraping Reddit Comments

Scraping reddit comments works in a very similar way. First, we will choose a specific posts we’d like to scrape.

In this case, we will choose a thread with a lot of comments. In this case, we will scrape comments from this thread on r/technology which is currently at the top of the subreddit with over 1000 comments.

  1. First, start a new project on ParseHub and enter the URL you will be scraping comments from. (Note: ParseHub will only be able to scrape comments that are actually displayed on the page).
  1. Once the site is rendered, you can use the Select command to click on the first username from the comment thread. Rename your selection to user.
  1. Click on the PLUS(+) symbol next to the user selection and select the Relative Select command.
  1. Similarly to Step 5 in the first section of this post, use Relative Select to extract the comment text, points and date.
  1. Your final project should look like this:

Running and Exporting your Project

Once your project is fully setup, it’s time to run your scrape.

Start by clicking on the Get Data button on the left sidebar and then choose “Run”.

Pro Tip: For longer scrape jobs, we recommend running a Test Run first to verify that the data will be extracted correctly.

You will now be able to download the data you’ve scraped from reddit as an excel spreadsheet or a JSON file.

Which subreddit will you scrape first?

Thousands of new images are uploaded to Reddit every day.

Downloading every single image from your favorite subreddit could take hours of copy-pasting links and downloading files one by one.

A web scraper can easily help you scrape and download all images on a subreddit of your choice.

Web Scraper Reddit Free

Web Scraping Images

To achieve our goal, we will use ParseHub, a free and powerful web scraper that can work with any website.

We will also use the free Tab Save Chrome browser extension. Make sure to get both tools set up before starting.

If you’re looking to scrape images from a different website, check out our guide on downloading images from any website.

Web Scrape Reddit Free

Scraping Images from Reddit

Scrape

Now, let’s get scraping.

  1. Open ParseHub and click on “New Project”. Enter the URL of the subreddit you will be scraping. The page will now be rendered inside the app. Make sure to use the old.reddit.com URL of the page for easier scraping.

NOTE: If you’re looking to scrape a private subreddit, check our guide on how to get past a login screen when web scraping. In this case, we will scrape images from the r/photographs subreddit.

  1. You can now make the first selection of your scraping job. Start by clicking on the title of the first post on the page. It will be highlighted in green to indicate that it has been selected. The rest of the posts will be highlighted in yellow.
  1. Click on the second post on the list to select them all. They will all now be highlighted in green. On the left sidebar, rename your selection to posts.
  1. ParseHub is now scraping information about each post on the page, including the thread link and title. In this case, we do not want this information. We only want direct links to the images. As a result, we will delete these extractions from our project. Do this by deleting both extract commands under your posts selection.
  1. Now, we will instruct ParseHub to click on each post and grab the URL of the image from each post. Start by clicking on the PLUS(+) sign next to your posts selection and choose the click command.
  1. A pop-up will appear asking you if this a “next page” button. Click on “no” and rename your new template to posts_template.
  1. Reddit will now open the first post on the list and let you select data to extract. In our case, our first post is a stickied post without an image. So we will open a new browser tab with a post that actually has an image in it.
  2. Now we will click on the image on the page in order to scrape its URL. This will create a new selection, rename it to image. Expand it using the icon next to its name and delete the “image” extraction, leaving only the “image_url” extraction.

Adding Pagination

ParseHub is now extracting the image URLs from each post on the first page of the subreddit. We will now make ParseHub scrape additional pages of posts.

  1. Using the tabs at the top and the side of ParseHub return to the subreddit page and your main_template.
  2. Click on the PLUS(+) sign next to your page selection and choose the“select: command.
  1. Scroll all the way down to the bottom of the page and click on the “next” link. Rename your selection to “next”.
  1. Expand your next selection and remove both extractions under it.
  2. Use the PLUS(+) sign next to your next selection and add a “click” command.
  3. A pop-up will appear asking you if this a “next page” link. Click on Yes and enter the number of times you’d like to repeat this process. In this case, we will scrape 4 more pages.

Running your Scrape

It is now time to run your scrape and download the list of image URLs from each post.

Start by clicking on the green Get Data button on the left sidebar.

Here you will be able to test, run, or schedule your web scraping project. In this case, we will run it right away.

Once your scrape is done, you will be able to download it as a CSV or JSON file.

Downloading Images from Reddit

Now it’s time to use your extracted list of URL to download all the images you’ve selected.

For this, we will use the Tab Save Chrome browser extension. Once you’ve added it to your browser, open it and use the edit button to enter the URLs you want to download (copy-paste them from your ParseHub export).

Once you click on the download button, all images will be downloaded to your device. This might take a few minutes depending on how many images you’re downloading.

Closing Thoughts

You now know how to download images from Reddit directly to your device.

If you want to scrape more data, check out our guide on how to scrape more data from Reddit, including users, upvotes, links, comments and more.