|
Note: Because this script parses HTML, and thus contains HTML, it had to be
altered in order to
include it in this HTML file. Thus, if you want to
make a copy of the script, you should highlight
everything below and do
a copy-and-paste. If you try to copy it from "View Source", it will not
be a valid script since (for example) the "less-than" characters had to
be replaced with < symbols.
;===============================================================================
; ; Script to filter an HTML file and extract raw text. ; ; The HTML file should be properly formatted, with a <head> or at least a ; <body> tag. The script can usually cope with files that lack these tags, ; but it is best if they are included. ; ; The output is not word-wrapped -- this script makes no attempt to render ; the appearance of the original web page in plain text. It simply extracts ; the text and tries to get rid of any extraneous material, such as scripts. ; The idea here is to eliminate just about all formatting. If you wanted to ; retain the format, you could load the HTML page into a word processing ; program such as Word. ; ; Most special HTML symbols, such as (no-break space) are translated, ; though in that particular case it is simply translated to an ordinary ; space. ; ; You can use wildcards in the Input File box to process an entire folder. ; Only files with the .htm or .html extensions are processed. ; ; This script is designed for use with the Parse-O-Matic Power Tool, ; which is available from www.parse-o-matic.com. ; ;=============================================================================== ; Configuration ;=============================================================================== Config ;----------------------------------------------------------------------------- ; Interface ;----------------------------------------------------------------------------- $CfgEnableOptionX = 'N' $CfgEnableOptionY = 'N' $CfgEnableOptionZ = 'N' ;----------------------------------------------------------------------------- ; Files ;----------------------------------------------------------------------------- $CfgDefaultIFN = 'Index.html' $CfgDefaultOFN = 'Output.txt' ;----------------------------------------------------------------------------- ; Documentation ;----------------------------------------------------------------------------- $CfgCopyright = 'Copyright © 2005-2008 by Pyroto, Inc.' $CfgVersion = '1.00.00' $CfgProgrammer = 'Timothy Campbell' AtSym = $40 ; Anti-spam gimmick $CfgEmail = 'info' AtSym 'parse-o-matic.com' $CfgLicense = 'This script may be used by anyone who has a valid ' >> 'Advanced Scripting License from Pyroto, Inc. ' >> ', or is evaluating one of our ' >> 'Parse-O-Matic products (for up to 30 days).' End ;=============================================================================== ; TaskInit ;=============================================================================== TaskInit BlankLineTags = >> ' <BR><BR>' >> ; Two breaks = a blank line ' <DIV' >> ' <H1 <H2 <H3 <H4' >> ' </H1 </H2 </H3 </H4' >> ' <HR' >> ' <OL' >> ' <P' >> ' <TABLE' >> ' <TR' >> ' <UL' CRLF = $0A$0D ; Carriage Return and Line Feed CRLF2 = CRLF CRLF ; Two CRLF's NumInpFiles = 0 OnlyHTML = 'Only files with the .htm or .html extension are processed' SepLine = Padded '' 80 'Left' '=' ValidDataStart = ' <BODY <FORM <HEAD <HTML' End ;=============================================================================== ; TaskDone ;=============================================================================== TaskDone ShowNote '' End ;=============================================================================== ; FileInit ;=============================================================================== FileInit ShortFName = Parse $ActualIFN '>*\' '' ShowNote ShortFName If $ActualIFN ^ '.htm' FileOkay = 'Y' Otherwise FileOkay = 'N' FirstLog = 'Y' HadLF = 'Y' ; Avoid starting with null line LeftOver = '' ValidData = 'N' SawOkayFile = 'N' Inc NumInpFiles If NumInpFiles #> 1 OutNull OutEnd SepLine OutEnd $ActualIFN OutEnd SepLine OutNull End ;=============================================================================== ; Main ;=============================================================================== ; Skip invalid files ;------------------------------------------------------------------------------- Begin FileOkay = 'N' LogMsgLF LogMsg OnlyHTML OutEnd OnlyHTML NextStep Else SawOkayFile = 'Y' End ;------------------------------------------------------------------------------- ; Prefix any unresolved data ;------------------------------------------------------------------------------- Begin LeftOver <> '' $Data = LeftOver $Data LeftOver = '' End ;------------------------------------------------------------------------------- ; Ignore null lines ;------------------------------------------------------------------------------- If $Data = '' Done ;------------------------------------------------------------------------------- ; Look for scripts ;------------------------------------------------------------------------------- Call ScriptCheck 'script' Call ScriptCheck 'noscript' ;------------------------------------------------------------------------------- ; Are we seeing actual HTML? ;------------------------------------------------------------------------------- Begin ValidData = 'N' ScanPosn $Ignore $Ignore $Data ValidDataStart If $Success = 'N' ScanPosn $Ignore $Ignore $Data BlankLineTags Begin $Success = 'Y' ValidData = 'Y' Else Call LabelLog LogMsg $Data Done End End ;------------------------------------------------------------------------------- ; Parse out the HTML ;------------------------------------------------------------------------------- Line = $Data Begin ScanPosn FromPosn $Ignore Line ' <[A-Z] <[a-z] </ <! <?' 'First RegExp' Begin FromPosn = 0 ; Did we find an HTML tag? Break ; No tags found, so bale out Else ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ; We found a tag ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Insertion = '' ; For special tag processing LineLeft = Parse Line 1 FromPosn 'Cut' ; Get up to the start of tag $Ignore = Parse LineLeft -1 -1 'Cut' ; Remove the < character FullTag = Parse Line '' '1*>' 'Cut Include' ; Look for the end of tag Begin FullTag = '' ; Didn't find the end of tag? ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ; Didn't find the end of this tag ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - LeftOver = $Data ; Restore the line Done ; Save line for next pass Else ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ; Found full tag; see if it needs to be treated specially ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - FullTag = '<' FullTag ScanPosn $Ignore $Ignore FullTag BlankLineTags Begin $Success = 'Y' Insertion = CRLF ; Embed a CRLF Else ScanPosn $Ignore $Ignore FullTag '/<LI /<LI>' Begin $Success = 'Y' Insertion = '- ' Else ScanPosn $Ignore $Ignore FullTag '/<BR>' If $Success = 'Y' Insertion = ' ' End End End ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ; Tag successfully removed ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Line = LineLeft Insertion Line ; Stitch the line back together End ; We found an HTML tag Again ; Loop while we have tags ;------------------------------------------------------------------------------- ; Output ;------------------------------------------------------------------------------- Begin Line <> '' ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ; Translate HTML symbols. This script would probably run somewhat faster if ; we put these symbols in a lookup file and used the MassChange command, but ; that is beyond the scope of this demonstration script. Also, we only ; include a few of the frequently-used numeric symbols. ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Change Line '™' '™' Change Line ' ' ' ' Change Line '£' '£' Change Line '¤' '¤' Change Line '¥' '¥' Change Line '©' '©' Change Line '®' '®' Change Line '°' '°' Change Line '±' '±' Change Line 'Á' 'Á' Change Line 'á' 'á' Change Line 'â' 'â' Change Line 'Â' 'Â' Change Line '´' '´' Change Line 'Æ' 'Æ' Change Line 'æ' 'æ' Change Line 'à' 'à' Change Line 'À' 'À' Change Line '&' '&' Change Line 'å' 'å' Change Line 'Å' 'Å' Change Line 'ã' 'ã' Change Line 'Ã' 'Ã' Change Line 'Ä' 'Ä' Change Line 'ä' 'ä' Change Line '„' '„' Change Line '¦' '¦' Change Line '•' '•' Change Line 'ç' 'ç' Change Line 'Ç' 'Ç' Change Line '¸' '¸' Change Line '¢' '¢' Change Line 'ˆ' 'ˆ' Change Line '©' '©' Change Line '¤' '¤' Change Line '†' '†' Change Line '‡' '‡' Change Line '°' '°' Change Line '÷' '÷' Change Line 'é' 'é' Change Line 'É' 'É' Change Line 'Ê' 'Ê' Change Line 'ê' 'ê' Change Line 'È' 'È' Change Line 'è' 'è' Change Line ' ' ' ' Change Line ' ' ' ' Change Line 'Ð' 'Ð' Change Line 'ð' 'ð' Change Line 'ë' 'ë' Change Line 'Ë' 'Ë' Change Line '€' '€' Change Line 'ƒ' 'ƒ' Change Line '½' '½' Change Line '¼' '¼' Change Line '¾' '¾' Change Line '>' '>' Change Line '…' '…' Change Line 'Í' 'Í' Change Line 'í' 'í' Change Line 'Î' 'Î' Change Line 'î' 'î' Change Line '¡' '¡' Change Line 'ì' 'ì' Change Line 'Ì' 'Ì' Change Line 'ℑ' 'I' Change Line '¿' '¿' Change Line 'ï' 'ï' Change Line 'Ï' 'Ï' Change Line '«' '«' Change Line '“' '“' Change Line '‹' '‹' Change Line '‘' '‘' Change Line '<' '<'
Change Line '¯' '¯'
Change Line '—' '—' Change Line 'µ' 'µ'
Change Line '·' '·' Change Line '−' '-'
Change Line ' ' ' ' Change Line '–' '–'
Change Line '¬' '¬' Change Line 'Ñ' 'Ñ'
Change Line 'ñ' 'ñ' Change Line 'Ó' 'Ó' Change Line 'ó' 'ó'
Change Line 'Ô' 'Ô' Change Line 'ô' 'ô' Change Line 'Œ' 'Œ' Change Line 'œ' 'œ'
Change Line 'Ò' 'Ò' Change Line 'ò' 'ò'
Change Line 'ª' 'ª' Change Line 'º' 'º'
Change Line 'Ø' 'Ø'
Change Line 'ø' 'ø' Change Line 'õ' 'õ' Change Line 'Õ' 'Õ'
Change Line 'Ö' 'Ö' Change Line 'ö' 'ö' Change Line '¶' '¶' Change Line '‰' '‰' Change Line '±' '±' Change Line '£' '£' Change Line '"' '"' Change Line '»' '»' Change Line '”' '”' Change Line 'ℜ' 'R' Change Line '®' '®' Change Line '›' '›' Change Line '’' '’' Change Line '‚' '‚' Change Line 'š' 'š' Change Line 'Š' 'Š' Change Line '§' '§' Change Line '­' '' Change Line '¹' '¹' Change Line '²' '²' Change Line '³' '³' Change Line 'ß' 'ß' Change Line ' ' ' ' Change Line 'þ' 'þ' Change Line 'Þ' 'Þ' Change Line '˜' '˜' Change Line '×' '×' Change Line '™' '™' Change Line 'ú' 'ú' Change Line 'Ú' 'Ú' Change Line 'û' 'û' Change Line 'Û' 'Û' Change Line 'ù' 'ù' Change Line 'Ù' 'Ù' Change Line '¨' '¨' Change Line 'ü' 'ü' Change Line 'Ü' 'Ü' Change Line 'ý' 'ý' Change Line 'Ý' 'Ý' Change Line '¥' '¥' Change Line 'ÿ' 'ÿ' Change Line 'Ÿ' 'Ÿ' ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ; Handle leading CRLF's ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Change Line CRLF2 CRLF ; Remove multiple linefeeds NeedLFBefore = 'N' Begin Line[1 2] = CRLF $Ignore = Parse Line 1 2 'Cut' If HadLF = 'N' NeedLFBefore = 'Y' End ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ; Handle trailing CRLF's ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
NeedLFAfter = 'N' Begin Line Len>= 2 LastTwoChars = Parse Line -2 -1 Begin LastTwoChars = CRLF $Ignore = Parse Line -2 -1 'Cut' NeedLFAfter = 'Y' End End ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ; Remove spaces on the edges, since many HTML coders tend to indent their text ; to highlight its structure. ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - TrimChar Line ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ; Send the line to the output file ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Begin Line = '' If HadLF = 'N' OutNull HadLF = 'Y' Done End If NeedLFBefore = 'Y' OutNull OutEnd Line HadLF = 'N' Begin NeedLFAfter = 'Y' OutNull HadLF = 'Y' End End Done ;=============================================================================== ; Subroutines ;=============================================================================== Procedure LabelLog Begin FirstLog = 'Y' FirstLog = 'N' LogMsgLF LogMsg '-----------------------------' LogMsg 'Header and Script Information' LogMsg '-----------------------------' End End Procedure ScriptCheck EndList = ' </no' ScriptCheck '>' ; e.g. ' </noscript>' ;----------------------------------------------------------------------------- ; Have we exited a multi-line script section? ;----------------------------------------------------------------------------- Begin ValidData = 'N' ScanPosn SCFrom SCTo $Data EndList Begin $Success = 'Y' ValidData = 'Y' SCTag = Parse $Data SCFrom SCTo 'Cut' Call LabelLog LogMsg SCTag TrimChar $Data If $Data = '' Done
End End ;----------------------------------------------------------------------------- ; Look for a script starting, and maybe ending on the same line ;----------------------------------------------------------------------------- StartList = ' <' ScriptCheck ; e.g. ' <script' ScanPosn $Ignore $Ignore $Data StartList Begin $Success = 'Y' ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ; Found the start of the script ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ScanPosn $Ignore $Ignore $Data EndList Begin $Success = 'N' ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ; Script does not end on the same line ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Call LabelLog LogMsg $Data ValidData = 'N' Done Else ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ; Script ends on the same line ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ExtFrom = '1*<' ScriptCheck ExtTo = '>*</no' ScriptCheck Extract = Parse $Data ExtFrom ExtTo Call LabelLog LogMsg Extract End End End
===============================================================================
Editing a Text File
Text files can generally be loaded by a text editor program (such as Windows Notepad, or NoteTab from Fookes Software), and most word-processing programs can load them as well. However, when you save a text file loaded this way it may lose its original format.
For example, you might load a Mac text file (in which each line ends with LF), but if you edit and save it using a Windows text editor you might find that each line in the file now ends with CRLF. This might cause problems later, if the next program to use the file does not know how to deal with CRLF-delimited files. In such case, the extra LF may appear in the program as a strange-looking character at the beginning of each line (starting with the second line).
Worse problems can arise if you edit a text file in a word-processing program. When you save the file, you must ensure that you save it as a text file rather than a word-processing file. In Word 2002 you can select "File/Save As", and then select "Plain Text (*.txt)". If you should inadvertently save a text file in a word-processing format, it will now contain a lot of additional information it did not have before. This will probably render it useless to the next program that tries to use it, since it expected an ordinary text file. Fortunately, it will probably be easy to load the file back into the word processing program and save it again, this time making sure to specify a text file format.
Some Examples of Text File Extensions
A file whose name ends with the characters .txt is almost certainly a text file. Other extensions typical of text files include .me (as in a file named Read.Me) and .htm — which is an HTML file, as used by web pages.
Windows files with the .ini extension are also text files, so they could be loaded into a text editor program. However, just because you can do this does not mean that you should do this. An ini file typically contains the settings for a program, and if you alter the file the program might stop working.
Files with the .csv extension are comma-separated-value files. These can be loaded into a text editor, but your operating system may be configured to open them in a spreadsheet if you double-click on them.
To summarize the foregoing: many programs save data in text files, but not all text files are supposed to be loaded into a text editor program.
   
Parse-O-Matic Free and Advanced Editions are data conversion tools that allow you to parse, convert, mine, import and export data files, reports, web capture, logs, legacy databases, text, CSV (comma separated; comma delimited), ASCII, EBCDIC, and almost any data format that you may have.
|