OESF Portables Forum
Everything Else => General Support and Discussion => Zaurus General Forums => Archived Forums => Software => Topic started by: Fromwithin on November 01, 2005, 04:08:40 pm
-
I thought I'd make this available, as someone mentioned the horror of ebook formatting in the opie-reader thread. By the way, one of the best features of opie-reader: reading gzipped text. I have everything in that format.
textbath zaurus executable (http://ftp://fromwithin.com/zaurus/textbath)
It's my Textbath program for re-formatting text files. It's getting pretty fat now. The only thing missing is that is doesn't convert HTML/XML entities fully yet (it will only replace about three of them at the moment).
At it's most basic, it will remove all of the line-breaks and replace multi-line breaks with a single one, indenting new paragraphs if defined. Thus the text comes out much cleaner and easier to read on word-wrapping applications. It can decide when to add a line-break or paragraph based on the length of the current line, start-of-line string matching, and capitalization. It can also re-join hyphenated lines, remove tabs and HTML/XML tags (and convert some HTML/XML entities), add <p></p> and <br> tags, convert all non-ASCII characters into pure ASCII (e.g. the © symbol into (C), that annoying binary apostrophe into an ASCII one, and all others), display file stats.
You can also use it to convert files between Unix/DOS/OldMac text formats without any editing. I don't know what I would have done without it.
It needs to be run from the terminal. Let me know if you find it useful, or have any problems with it.
-
I've just updated it a bit, it now converts almost all numeric HTML entities, and the most common text-based ones (" < > and all that). Use the link as above.
-
Updated again. It will now convert all HTML/XML entities that are in CP-1252 or ISO 8859-1 range, and all non-ascii chars will now convert to their nearest match or relevant string. Previously, it would use a space when it didn't understand a character.
Also, when converting from HTML, it will look for certain tags as hints to format the text, thus retaining formatting.
There's not much else I can think of doing to it now. The only thing left on the list is to only allow line breaks on certain punctuation. After that, I don't know, so if anyone has any ideas let me know.
Give it a whirl (http://ftp://fromwithin.com/zaurus/textbath).
-
I will try it tonight. Hopefully, it is simple enough for non-programmers.