PDA

View Full Version : Filtering Text with Basic Formatting Only


UK_Smithy
12-17-2012, 09:24 AM
I've been working with QuarkXPress and InDesign for many years, but have not yet found a way to filter text files from MS Word so that ONLY very basic formatting is imported. For instance, I want to import Word files WITHOUT font name, size and spacing information, but I want to RETAIN the formatting data for Roman, Italic, Bold and Bold Italic. I want to remove everything else, so in essence I want a plain text file that contains the <b>, <i>, <bi> and <p> flags but no other formatting data.

Does anyone know if this is possible, and if so what are the best tools?

Thanks.

Steve Rindsberg
12-17-2012, 10:54 AM
I'm probably missing something, but it seems you could open the file in Word, save under a new name, then Ctrl+A to select all, change everything to a single font name, size and spacing. That shouldn't affect bolding/italic/etc.

UK_Smithy
12-17-2012, 11:42 AM
The problem with that solution is the fact that the Word file continues to contain font & size definitions. Even if it's all the same font/size, the information is there regardless. I want to be able to import the file into InDesign or Quark in the same way as I would a plain text file, but it would contain just the roman, italic, bold etc styles and nothing else.

I'm importing Word files for a number of academic books, and the fonts and sizes must be standardised using Style Sheets, but I don't want to have to go through manually redefining the italic and bold bits, of which there are thousands.

UK_Smithy
12-17-2012, 12:42 PM
Put another way, all I want to do is import a Word document into Quark or InDesign and then apply Style Sheets cleanly but without losing the four basic text attributes.

terrie
12-17-2012, 01:26 PM
uk smithy: Does anyone know if this is possible, and if so what are the best tools?I don't use Word so I'm not sure if this is possible and if possible, it will give you what you want but...'-}}

Can you save the file as an RTF (rich text format) and try importing the RTF into Quark/ID?

Terrie

UK_Smithy
12-17-2012, 01:32 PM
Thanks, but no, that doesn't work. The RTF file contains just as much formatting data as the Word file. I'm currently experimenting with InDesign's ability to export as 'Tagged Text'. When I import that back into a text editor (TextWrangler in my case) it shows all of the formatting and style data. If I can delete everything except the tags for the aforementioned styling I should be able to import it back, adopting the InDesign Style Sheets 'cleanly' but also retaining the basic styles. I'll let you folks know if I'm successful.

Michael Beloved
12-17-2012, 02:00 PM
I have found that the only way to remove the hidden Word format markings completely is to move the contents into Notepad and then copy it from there and paste it in. The problem with this method is that you lose the elementary styling which you had before.

I found this out when I first began converting .docx files from Word 2007 to html format in preparation for making kindle files.

After trying many solutions, I came to the conclusion that you cannot remove the Word markings in total. To have a clean file you have to begin with a txt file and style it in the desired program from day one.

terrie
12-17-2012, 02:28 PM
uk smithy: I'll let you folks know if I'm successful. Sorry the RTF idea was a no go. Do let us know how it goes...

One of the reasons I have always liked WordPerfect is because of its Reveal Codes option which allow you to see the internal codes and you can do a lot of playing with them although I don't know if more current versions still have it--I'm still using WordPerfect 8...

Terrie

Howard Allen
12-17-2012, 02:54 PM
I feel your pain. I do a palaeontological abstracts volume every year, as well as a quarterly newsletter and the submissions (almost all Word files) are liberally peppered with italicized latin names, all in different fonts and styles. I want the italics, but not all the other junk.

I'm not sure if I've fully grasped your problem, however, because it seems to me that InDesign already does what you want if you simply "Place" using the "Remove Styles and Formatting from Text and Tables" with the "Preserve Local Overrides" box checked.

After it's placed in the ID document, I apply my paragraph style sheet to the text, and it's done. See the attached screenshots "before" (Word document, in 12 pt Times New Roman and 18 pt Arial) and "after" (ID document in 11 pt Minion Pro). All the italic and bold comes through with no fiddling. Note that your ID style sheet should specify only the font family, not a particular face (plain, bold, italic, etc.).

Am I barking up the wrong tree? :)

UK_Smithy
12-17-2012, 03:05 PM
Howard - thanks, you're barking up the right tree, that's spot on!

I hadn't understood the 'Preserve Local Overrides' box! I believe Quark 9 has a similar feature but I only have version 8, but obviously ID does sport it, so even if I need to work in Quark I can almost certainly convert the text in ID first.

All that said, it does seem feasible to edit tagged text using a text editor, but that's still long-winded compared to InDesign's little gizmo.

Joy of joys - happy Christmas one and all!

Steve Rindsberg
12-17-2012, 03:06 PM
Hm. I wonder if that's because Word sees the text as being styled with our Select-All, One Font, One Size, One Etc changes as a style override.

What happens if you select all, then set the style to something innocuous; normal maybe.

Howard Allen
12-17-2012, 03:20 PM
You're welcome! It's gratifying to know that among all the shots in the dark I've made, one or two have found their mark.

Best regards,

Michael Beloved
12-17-2012, 03:42 PM
Word has a deformatting command which is a little icon that says Aa in the font formatting tools section. These removes all formatting and sets everything to the default font style, whatever that is set to.

However if you life that deformatted text and put it in a html text editor and then read the code you will find that most of it is marked with Word's special html way of doing things.

In other words the removal of formatting only means the removal of everything which is not the default settings behind the scenes.

I used to struggle with this when converting my books into html in preparation for kindle format but I found it was easier to just deformat completely wiping out all formatting and then to reset it as intended in an html editor like Expression Web or Dreamweaver.

Steve Rindsberg
12-18-2012, 09:12 AM
Yeah, when you bring Word's HTML into the discussion, Rationality leaves the room in a huff.