Author Topic: Static Site Snapshot/mirror  (Read 21569 times)

speculatrix

  • Administrator
  • Hero Member
  • *****
  • Posts: 3709
    • View Profile
Static Site Snapshot/mirror
« on: November 10, 2006, 05:46:03 am »
Just as you can get a snapshot of wikipedia and store on your PC (or even PDA), it occurred to me that a snapshot of the OESF forum would be a very useful thing if combined with free text search.

There's a huge amount of wisdom on it (a lot of which should be in the wiki, but isn't), and so an archive would be a real asset. The low-graphics version of course would be best, but provided the archive was created without the attachments, it'd be OK as there'd be only one copy of the site.

I did try a speed-throttled wget once, but it wasn't too satisfactory.  

Any chance of considering being able to do this on the server itself and make a monthly snapshot downloadable in .zip or .tar.bz2? Some of us could burn CDs or DVDs to send to people without broad-band.

thanks
Paul
« Last Edit: December 09, 2006, 05:13:14 pm by speculatrix »
Gemini 4G/Wi-Fi owner, formerly zaurus C3100 and 860 owner; also owner of an HTC Doubleshot, a Zaurus-like phone.

daniel3000

  • Hero Member
  • *****
  • Posts: 1003
    • View Profile
    • http://
Static Site Snapshot/mirror
« Reply #1 on: November 10, 2006, 06:42:53 am »
This is a terrific idea and I also wish this would be possible.
I think a forum like this is based on a database. it should not be too hard to dump the database content into pute text files (or HTML maybe).

If it isn't possible on the server side, there is still the "lo-fi" versoin of the forums which are simpler HTML pages and probalby easier and faster to collect via a tool like wget or plucker. So this may be another idea.

Another nice feature would be to install an NNTP gateway or POP/IMAP server inorder to make it possible to download the forum contents to client programs like email clients or usenet clients. If it would be possible to add content (reply, add topics) also from these clients, this would be even better.

But this would probablyl be a major effort and since we can be thankful that some nice people drive this forum without profit, I totally understand if this will never happen :-)

daniel
SL-C3200 with weeXpc, based on pdaXrom 1.1.0beta3
HP 200LX with MS-DOS 5.0

speculatrix

  • Administrator
  • Hero Member
  • *****
  • Posts: 3709
    • View Profile
Static Site Snapshot/mirror
« Reply #2 on: November 14, 2006, 04:33:26 pm »
Quote
This is a terrific idea and I also wish this would be possible.
I think a forum like this is based on a database. it should not be too hard to dump the database content into pute text files (or HTML maybe).

If it isn't possible on the server side, there is still the "lo-fi" versoin of the forums which are simpler HTML pages and probalby easier and faster to collect via a tool like wget or plucker. So this may be another idea.

I now found a program which will do such a job, and am trying it out... the good thing is that it allows me to choose to only download URLs with "lofiversion" in them and to throttle bandwidth. The problem with wget is that it doesn't fix links, whereas this program does.

http://www.httrack.com/

When it's finished, I will see what the results are like and offer a download on my website.

Quote
Another nice feature would be to install an NNTP gateway or POP/IMAP server inorder to make it possible to download the forum contents to client programs like email clients or usenet clients. If it would be possible to add content (reply, add topics) also from these clients, this would be even better.

But this would probablyl be a major effort and since we can be thankful that some nice people drive this forum without profit, I totally understand if this will never happen :-)

daniel
[div align=\"right\"][a href=\"index.php?act=findpost&pid=145980\"][{POST_SNAPBACK}][/a][/div]

I did ask about an RSS feed, but I think this forum doesn't have the facility.

Paul
Gemini 4G/Wi-Fi owner, formerly zaurus C3100 and 860 owner; also owner of an HTC Doubleshot, a Zaurus-like phone.

speculatrix

  • Administrator
  • Hero Member
  • *****
  • Posts: 3709
    • View Profile
Static Site Snapshot/mirror
« Reply #3 on: November 14, 2006, 06:30:46 pm »
well, the first run seems to be complete... created 8600-ish HTML files...  tweaking the results now, such as putting in the CSS files and such. seems to work quite well.
Gemini 4G/Wi-Fi owner, formerly zaurus C3100 and 860 owner; also owner of an HTC Doubleshot, a Zaurus-like phone.

desertrat

  • Hero Member
  • *****
  • Posts: 743
    • View Profile
    • http://
Static Site Snapshot/mirror
« Reply #4 on: November 14, 2006, 08:22:16 pm »
Quote
The problem with wget is that it doesn't fix links, whereas this program does.
Doesn't the --convert-links, --html-extension options cover it?
SL-C3100 / Ambicon WL1100C-CF / pdaXrom 1.1.0beta3 / IceWM

Jon_J

  • Hero Member
  • *****
  • Posts: 1853
    • View Profile
    • http://
Static Site Snapshot/mirror
« Reply #5 on: November 14, 2006, 11:39:20 pm »
speculatrix,
I tried using the program you listed to make an offline copy of this wiki site:
http://www.uesp.net/wiki/Oblivion:Oblivion

It's a wiki about a computer RPG game.
"The Elder Scrolls IV - Oblivion"

It downloaded about 9,000 pages that resulted in a 120MB.
It took 1 hour on my adsl connection.
This is just for a computer game...!?
Is there any way to know beforehand, how much data is going to be downloaded?
I was thinking about placing this wiki on my Zaurus, but it ended up being way too big for just one game.
I decided to cancel this because it was downloading 8 other previous versions of (Elder Scrolls game wikis) which are in the left-hand frame.

I had been just saving individual web pages that interested me from this wiki previously.
C3100 Multiboot-->Angstrom 2007.12-r18 | Cacko 1.23 | ArchLinuxARM
C3200 pdaxii13v2-5.5-alpha4 Akita on NAND

Ambicom WL1100C-CF Wifi - Ambicom CF modem - Ambicom CF GPS - Belkin-F5D5050 USB LAN
Socket CF Bluetooth rev K - Iogear 4 port USB micro hub - pocket CF card reader
Targus mini USB optical mouse - 2 Targus SD card readers

speculatrix

  • Administrator
  • Hero Member
  • *****
  • Posts: 3709
    • View Profile
Static Site Snapshot/mirror
« Reply #6 on: November 15, 2006, 05:21:10 am »
Quote
speculatrix,
I tried using the program you listed to make an offline copy of this wiki site:
http://www.uesp.net/wiki/Oblivion:Oblivion
It downloaded about 9,000 pages that resulted in a 120MB.
It took 1 hour on my adsl connection.
[div align=\"right\"][a href=\"index.php?act=findpost&pid=146308\"][{POST_SNAPBACK}][/a][/div]

make sure you only include HTML files and CSS, don't allow it to leave the domain, ensure you specify the correct url to match just the wiki, and so on. there are countless options!

Quote
Doesn't the --convert-links, --html-extension options cover it?

Hmm, possibly, but I was struggling to get it to convert the query string and fix everything and stay on the lofiversion URL. This windows program just made it easier to do.

Paul
Gemini 4G/Wi-Fi owner, formerly zaurus C3100 and 860 owner; also owner of an HTC Doubleshot, a Zaurus-like phone.

ShiroiKuma

  • Hero Member
  • *****
  • Posts: 900
    • View Profile
Static Site Snapshot/mirror
« Reply #7 on: November 15, 2006, 05:33:47 am »
I think it's not a very smart idea to download forums with a tool lik Htttrack. This forum is based on PHP scripts which interface to a database. Downloading with htttrack makes the forum engine generate thousands of pages, resulting in high usage and high traffic.

It should not be too hard to make the database available for download regularly in a compressed format. Then one would create scripts to serve the database locally on the Z...
[span style=\'font-size:8pt;line-height:100%\']Das ganze tschechische Volk ist eine Simulantenbande.[/font][/span]
Militäroberarzt Bautze

speculatrix

  • Administrator
  • Hero Member
  • *****
  • Posts: 3709
    • View Profile
Static Site Snapshot/mirror
« Reply #8 on: November 15, 2006, 07:27:41 am »
Quote
I think it's not a very smart idea to download forums with a tool lik Htttrack. This forum is based on PHP scripts which interface to a database. Downloading with htttrack makes the forum engine generate thousands of pages, resulting in high usage and high traffic.
[div align=\"right\"][a href=\"index.php?act=findpost&pid=146323\"][{POST_SNAPBACK}][/a][/div]

httrack has nice features to limit the load it puts on servers, and in fact the latest version has a restriction of 100kbps bandwidth use.

I made sure it restricted the number of parallel page fetches too, so that it too quite a while to download.

I've got building work on at home and they've cut the power again, so I can't access my fileserver and do the upload at the moment.
Gemini 4G/Wi-Fi owner, formerly zaurus C3100 and 860 owner; also owner of an HTC Doubleshot, a Zaurus-like phone.

speculatrix

  • Administrator
  • Hero Member
  • *****
  • Posts: 3709
    • View Profile
Static Site Snapshot/mirror
« Reply #9 on: November 15, 2006, 07:29:00 am »
I've just had a thought... what's the limit on the number of files in the same directory on a fat32 memory card? Would the forum archive actually be storable on flash?
Gemini 4G/Wi-Fi owner, formerly zaurus C3100 and 860 owner; also owner of an HTC Doubleshot, a Zaurus-like phone.

ShiroiKuma

  • Hero Member
  • *****
  • Posts: 900
    • View Profile
Static Site Snapshot/mirror
« Reply #10 on: November 15, 2006, 07:38:59 am »
Quote
I've just had a thought... what's the limit on the number of files in the same directory on a fat32 memory card? Would the forum archive actually be storable on flash?
[div align=\"right\"][a href=\"index.php?act=findpost&pid=146330\"][{POST_SNAPBACK}][/a][/div]
I don't think this should be a problem really, since you probably should put it into a squashfs archive anyway and mount it on the Z. That way you'll save on space bigtime, because this is mostly text.
[span style=\'font-size:8pt;line-height:100%\']Das ganze tschechische Volk ist eine Simulantenbande.[/font][/span]
Militäroberarzt Bautze

matthis

  • Full Member
  • ***
  • Posts: 217
    • View Profile
    • http://badaboum.bidibom.free.fr/mat/
Static Site Snapshot/mirror
« Reply #11 on: November 15, 2006, 10:37:17 am »
Actually these kind of forums store data in a sql database. The admistrator has a button to dump/save all the messages, so asking for a dump could be a good start.

zmiq2

  • Sr. Member
  • ****
  • Posts: 383
    • View Profile
    • http://
Static Site Snapshot/mirror
« Reply #12 on: November 15, 2006, 10:57:48 am »
Or maybe a link to a file with the daily activity, with all posts modified on that date, so you don't need to download the whole db!
sl-c750, archos av580, socket cf [bt, wifi, modem], noname cf lan, audiovox rtm800 gsm-gprs cf, rom: sharp -> oz3.5.3 -> cacko -> oz3.5.4.1

speculatrix

  • Administrator
  • Hero Member
  • *****
  • Posts: 3709
    • View Profile
Static Site Snapshot/mirror
« Reply #13 on: November 21, 2006, 05:00:39 pm »
hmm, well, it's a pretty damn big file when zipped up - over 60MB! It took a long time to spider the forum with the rate throttled right down - 21000 files or so!

I'm uploading the file now as a .zip to my website at http://www.zaurus.org.uk/downloads.html , because that way people who want a cramfs/squashfs mountable archive can create one, and people with windows can unpack it and use a local search function like google desktop to index it.

Paul
Gemini 4G/Wi-Fi owner, formerly zaurus C3100 and 860 owner; also owner of an HTC Doubleshot, a Zaurus-like phone.

speculatrix

  • Administrator
  • Hero Member
  • *****
  • Posts: 3709
    • View Profile
Static Site Snapshot/mirror
« Reply #14 on: November 22, 2006, 10:45:36 am »
The snapshot is now up as a mirror... see mainstream discussion at
https://www.oesf.org/forums/index.php?showtopic=22041

for more details
Gemini 4G/Wi-Fi owner, formerly zaurus C3100 and 860 owner; also owner of an HTC Doubleshot, a Zaurus-like phone.