Part 3: Neat Little Snippets of Code That Are Useful for WordPress
The Content Scraper (for sites that have no feed).
::now holding my ears to shut out the yelps::
Okay, okay – calm down. Here’s the thing. I debated long and hard about putting this here. I truly did. People hate content scrapers, and yes – for the most part – the ones that use it for horrid purposes are evil bastards. But being in the Christmas spirit that I always get in at this time of year, I’m going to try and believe that people are good for the most part – and although I know this might become abused here and there…well at least I can take slight solace in the fact that this one, at the very least, will cache the content on the spammers server, so your bandwidth isn’t too huge.
I still have trepidation about this one – so seriously, if it does become an issue, and I get a lot of people who really don’t want this one here, I will remove it. This one was a toughie to consider sharing, but I’m hoping that people use it for the greater good. I’ve seen a lot of requests for this kind of stuff on the forums – people who would like to “scrape” the content on their own site that they have in one place and stick it on another one they own so they don’t have to write the same thing twice in two different places – and honestly, that’s what this was developed for.
How this came about was that I had a client who is a local non-profit organization for the little city I live in. Their site is meant for newcomers to the area, and is meant to provide new people with information about the town and what goes on. They wanted to have a page where certain official city stuff was posted, but they also wanted that information to appear on their site so it’s all located in one central place. However, they didn’t want to have to force the people who ran the city’s websites to have to log into their site and update it as well (and HA! Like they would anyway). They also didn’t want to ruin the city’s bandwidth, nor steal anything. The idea was to give a snippet of information that would link to the rest of the articles on the original site. Oh yes, and the BIG issue I was facing: none of the town websites had an RSS feed. You can do this a LOT easier if you have an RSS feed to work from, but in my case, I did not – this one is how to get and format the content yourself, even when there’s no feed to eat from.
I will warn you now – this code works very well. However, it can be a bear to get around. Someday, I might – MIGHT – make it into a customizable plugin (it is a half-assed plugin now, but you have to mess with it – no admin panel), but I’m still unsure as to whether or not I’ll actually release it like that to the public because of the serious nature of what it could possibly do. But it is useful, so be warned – there’s some wading to do on this one.
The great thing about this one is this: most times, when people scrape content, they don’t give a crap. They just scrape, and every time the original author’s page loads on the scraper’s site, it loads the page on the original author’s site, thus increasing the bandwidth for the original author, while the scraper feels nothing. Bad for the original author – good for the spammer (until he gets caught).
This script has a nicety. What happens is, you create a folder on your server – I prefer to keep this folder outside of the actual public_html area, but that’s me – and…how can I explain this in plain terms? Okay, it’s sort of like on a “timer” of sorts. This one is set for 24 hours – but if you know the content you’re scraping is going to be updated once a month (rather than once a day), you can change it to monthly, weekly – yearly – whatever you want. The “timer” starts when the script is first called – usually when you go to test it to make sure it works. What it does is, it’ll “scrape” the site, and cache the contents in the folder you’ve specified. From then on, for the next 24 hours (or week, month, year – whatever you’ve set the “timer” to) it will scrape that cache folder – NOT the original site in question. When the time runs out, it’ll scrape the site again, cache the content, and it all goes over and over again.
So basically, you’re giving the original site *one* hit, and not increasing their bandwidth by any huge margin.
But PLEASE PLEASE PLEASE if you use this, PLEASE use it legitimately. Use it to cross-reference your own sites, or “scrape” only sites that give you permission to do so. And DO NOT pass the content off as your own (unless it truly is – in the “cross-referencing your own sites” variable) – be kind and make it a simple snippet/excerpt and then provide a link to the REAL author’s site. I beg of you all to not abuse this – truly this was a hard decision for me to make, and if I do become aware of abuse issues, I will not hesitate to remove it immediately.
(I know, I can’t say that enough…truly I can’t!)
Okay, so this is based on Troy Wolf’s Screen Scraping Class. As stated before, if RSS feeds are available, there’s better methods than this. (You probably could get this to work with a feed as well – I haven’t tried that yet – but if you could, that would be much better!) But if you don’t have feed access, then you simply use this script. I won’t go into the script itself, and – as with the last part of this series – I will provide you with a link. I’ll just overview certain key points here.
Really, the only part you need to worry about is at the bottom. I’ve already commented out some of it, but I’m going to elaborate a bit.
- function Site1() { this is the function name that you place on the Page template file, where you want the content to show. For example, after you put in all the variables you need to change, then open up your page.php, archive.php, sidebar,php – wherever you want this to show up – and put in “<?php site1(); ?>” and there you go. It’ll appear.
- $h->dir = “FOLDER HERE”; You need to create a folder on your server (as I said, I prefer mine to be cached *outside* the public area of the site – above “public_html” or “www” – because the older needs to be CHMOD’d to 777 to work. Where it says “FOLDER HERE”, you put the server path to the folder you’ve created for caching the files. In other words “http://sitename/folderhere” ain’t gonna cut it (especially if the folder is housed outside of the public area).
- $c = “URL HERE”; URL HERE should be the URL of the original author of the content you’re taking. Don’t be an ass – give credit where it’s due. (plus, if you don’t, some of the code won’t work right anyway – so just do it.)
- $url = “FULL URL HERE”; FULL URL HERE is asking for the full path to the page you’re scraping content from. You must have the right URL to the very page you need – so it knows where to look.
- if (!$h->fetch($url, 86400)) { This is where you set the time limit on how long to cache the page before visiting the site again to scrape fresh stuff. “86400″ is what you’re looking to replace – that is the equivalent to 1 day (60 seconds (1 minute) x 60 minutes (1 hour) x 24 hours (1 day) – 86400 seconds.) So if you’re looking for, say, 1 week, you’d replace the “86400″ with “604800″. 1 month – or rather, 30 days: 2592000.
If you cache the page (meaning, load up your page – it’ll immediately cache it) and you see “Whoops! We had a problem loading this content. Please try refreshing the page.”, then try refreshing the page. Sometimes it works. Sometimes, you have to go into the folder and delete the cached page, then reload. I have one site where I have to take away the caching altogether for it to work (set “86400″ to “0″).
- $matches = http::table_into_array($h->body, “ITEM 1″, 1, “ITEM 2″); This is the biggie. ITEM 1 and ITEM 2 are the spots where you need to enter the code for what surrounds the stuff you want to take. If you want it *all*, then <body>, </body> will get everything between the body tags. If you want to get really specific, then look for classes or certain tags with ID’s that you can use to identify exactly what you want.
And that’s pretty much it. Oh yes, I *should* note that there’s nothing in this script that will delete cached pages. They don’t get overwritten either. So every now and then, you’ll have to go in and clear out the folder, or it’ll grow to be HUGE.
So, after reading all that, basically you just download the file, open it in a text editor and make the necessary changes, upload to your plugins folder, activate and insert the function name where you want it to appear (as stated in step 1 above). You can have more than one site appear on a page, as well – just copy the entire function, rename it and edit it for the second site.
Part 4: Accurate Word Count
Part 2: Send to Friend Script
Part 1: Easy Styled Blockquotes