wget help, or something like that...

iroc409 · 11-07-2003, 01:04 AM

i've used wget here and there, and it's overall an awesome program. but, i'm looking to do something that wget may not be able to do (or i may have not stumbled across it yet...).

it appears to me wget only spiders a website, and finds information that way. what i'm looking to do is pull everything in a website, not necessarily those things that can be spidered.

for example, if a site has an index.html, but has unlinked html and content in the site (we'll say ass.html and ass.jpg), how can i grab everything? what if it's in a directory off root, but not linked?

this would be rather useful, i hope you get what i'm talking about. will wget grab it with a switch i'm not using, or is there something out there that can do it? so far i haven't found anything that will grab stuff unless it's linked in an html file.

KnightElite · 11-07-2003, 01:44 AM

It can't be done, I don't think. Unless the program has some way of finding all the files on a site, it can't grab them for you. So if they're unlinked, you are screwed.

#Rotor · 11-07-2003, 10:25 AM

why don't you just grep for anything with *.htm *.html etc...

and build your own index.html for that particular site, making sure all is linked, and then feed this new file to the spider...

airspirit · 11-07-2003, 12:53 PM

Try Mozilla Firebird and the spider plugins. Those work pretty damn good.

iroc409 · 11-07-2003, 12:55 PM

thanks for the replies, i kinda thought there wasn't a way to do it

the grep idea is a good one, i can use that on some of my stuff (except where i don't have shell access, of course). i'll have to give that a whirl, i've got one site that has a ton of crap on it, and i don't even know what half of it is... lol.

as long as stuff is linked in the file, it doesn't matter where it is, right? i can bury the index file somewhere on the server off the site's root, and it still should find everything, or only things that are in directories below it?

the only way i would think is if somehow you could get apache on the target server to return a directory listing like it does on a directory without index.html (but the directory does have an index file). i'm guessing there isn't really a way do do that. not really worried about stuff with htaccess protection on it, but that would be a plus in some cases.

iroc409 · 11-07-2003, 01:50 PM

Quote:

Originally posted by airspirit
Try Mozilla Firebird and the spider plugins. Those work pretty damn good.

firebird kicks ass. i installed a few plugins last night, including the 'popup counter'.

that's actually pretty cool, not sure what i'd do without the blocker. for work i have to go through a lot of sites with tons of popups which totally blows. it's kinda interesting to see which ones have how many

tried out the spiderzilla thing, works pretty well, kinda handy since its built into the browser. works just like wget

another thing that i found totally spectacularly cool is the "view in IE". ie has pretty much disappeared, i used to be able to right click on the file and "open with" but it doesn't do that anymore. now i have a shortcut on the desktop for ie, and have to go through all that crap.

but with this, any page that is open, or any link in any page you just right click and view in ie. totally priceless object for a webdesigner, whoever thought up that extension is a genius.

heh... i even installed a skin. usually i totally despise skins, but this one is verra nice. with the tiny buttons, it makes the window frame very very small, and more viewing area. less BS. the skin is "breeze"... snag it off the site

airspirit · 11-07-2003, 07:17 PM

And another convert is created ... *sigh*

It is sad that Firebird is superior to IE in every way but nobody ever gives it a try. Everyone I've got to use it once never use IE anymore.

iroc409 · 11-07-2003, 08:08 PM

yeah, i used to use netscape, although i really don't like all the netscape addons. i just want a browser that works well, nothing more.

using firebird from a design perspective is great, because 99% of the time if it works in firebird, it works everywhere else (and is more strict on code).

however, i find IE has a terrible, terrible time rendering web documents. and if a page does get broken in IE, i find it much more difficult to fix IE pages than fixing an IE page for mozilla. ugh.

my other bitch about IE is freaking png's. IE has built-in support for png, but it's very difficult to get it to run (lots of code, yuck!). everything else on the planet supports png's, and they're so much nicer. i wish M$ would pull their head out of their asses.

11-07-2003, 01:04 AM	#1
iroc409 Cooling Savant Join Date: Oct 2002 Location: midwest side, yo Posts: 596	wget help, or something like that... i've used wget here and there, and it's overall an awesome program. but, i'm looking to do something that wget may not be able to do (or i may have not stumbled across it yet...). it appears to me wget only spiders a website, and finds information that way. what i'm looking to do is pull everything in a website, not necessarily those things that can be spidered. for example, if a site has an index.html, but has unlinked html and content in the site (we'll say ass.html and ass.jpg), how can i grab everything? what if it's in a directory off root, but not linked? this would be rather useful, i hope you get what i'm talking about. will wget grab it with a switch i'm not using, or is there something out there that can do it? so far i haven't found anything that will grab stuff unless it's linked in an html file. __________________ :shrug:

11-07-2003, 01:44 AM	#2
KnightElite Cooling Savant Join Date: Sep 2002 Location: Saskatoon, Saskatchewan Posts: 294	It can't be done, I don't think. Unless the program has some way of finding all the files on a site, it can't grab them for you. So if they're unlinked, you are screwed. __________________ Can anyone else here say that they have a watercooled monster that's 45" tall?

11-07-2003, 10:25 AM	#3
#Rotor Cooling Savant Join Date: Feb 2002 Location: Dione, sector 4s1256 Posts: 852	why don't you just grep for anything with .htm .html etc... and build your own index.html for that particular site, making sure all is linked, and then feed this new file to the spider... __________________ There is no Spoon....

11-07-2003, 12:53 PM	#4
airspirit Been /.'d... have you? Join Date: Jul 2002 Location: Moscow, ID Posts: 1,986	Try Mozilla Firebird and the spider plugins. Those work pretty damn good. __________________ #!/bin/sh {who;} {last;} {pause;} {grep;} {touch;} {unzip;} mount /dev/girl -t {wet;} {fsck;} {fsck;} {fsck;} {fsck;} echo yes yes yes {yes;} umount {/dev/girl;zip;} rm -rf {wet.spot;} {sleep;} finger: permission denied

11-07-2003, 12:55 PM	#5
iroc409 Cooling Savant Join Date: Oct 2002 Location: midwest side, yo Posts: 596	thanks for the replies, i kinda thought there wasn't a way to do it the grep idea is a good one, i can use that on some of my stuff (except where i don't have shell access, of course). i'll have to give that a whirl, i've got one site that has a ton of crap on it, and i don't even know what half of it is... lol. as long as stuff is linked in the file, it doesn't matter where it is, right? i can bury the index file somewhere on the server off the site's root, and it still should find everything, or only things that are in directories below it? the only way i would think is if somehow you could get apache on the target server to return a directory listing like it does on a directory without index.html (but the directory does have an index file). i'm guessing there isn't really a way do do that. not really worried about stuff with htaccess protection on it, but that would be a plus in some cases. __________________ :shrug:

11-07-2003, 07:17 PM	#7
airspirit Been /.'d... have you? Join Date: Jul 2002 Location: Moscow, ID Posts: 1,986	And another convert is created ... sigh It is sad that Firebird is superior to IE in every way but nobody ever gives it a try. Everyone I've got to use it once never use IE anymore. __________________ #!/bin/sh {who;} {last;} {pause;} {grep;} {touch;} {unzip;} mount /dev/girl -t {wet;} {fsck;} {fsck;} {fsck;} {fsck;} echo yes yes yes {yes;} umount {/dev/girl;zip;} rm -rf {wet.spot;} {sleep;} finger: permission denied

11-07-2003, 08:08 PM	#8
iroc409 Cooling Savant Join Date: Oct 2002 Location: midwest side, yo Posts: 596	yeah, i used to use netscape, although i really don't like all the netscape addons. i just want a browser that works well, nothing more. using firebird from a design perspective is great, because 99% of the time if it works in firebird, it works everywhere else (and is more strict on code). however, i find IE has a terrible, terrible time rendering web documents. and if a page does get broken in IE, i find it much more difficult to fix IE pages than fixing an IE page for mozilla. ugh. my other bitch about IE is freaking png's. IE has built-in support for png, but it's very difficult to get it to run (lots of code, yuck!). everything else on the planet supports png's, and they're so much nicer. i wish M$ would pull their head out of their asses. __________________ :shrug:

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)