Continue to Site

Eng-Tips is the largest engineering community on the Internet

Intelligent Work Forums for Engineering Professionals

  • Congratulations cowski on being selected by the Eng-Tips community for having the most helpful posts in the forums last week. Way to Go!

searching and form filling software

Status
Not open for further replies.

snowshoe2

Mechanical
Sep 10, 2012
58
For projects we need to manually search thru documents and find answers to the same questions on a recurring basis. Essentially there is a form that is filled with technical data pulled from documents that could either be from the customer or internally created. It is time consuming because because of variations in the document structures.

What I am looking for is a software tool that I could point to a directory of documents and it would find the answers to the standard questions. All the data mining software I have looked at is either geared towards marketing or legal firms, and actually are more complex than is required.

Windows advanced search is not that effective on it's own (or that is my experience anyway). It seems like a simple concept considering the power of searching, text mining and semantic searching tools on the market. Just find the answers/data for 30 repeating questions per project.

thanks
Snowshoe2

 
Replies continue below

Recommended for you

When I was using Windows, I found Agent Ransack to be far, far superior to Windows Search.
Also, I hated that stupid dog.

Note also that Windows Search is preset to not bother searching certain kinds of files; MS does not go out of its way to make this point clear. There's a system setting for it, I forget where or how to change its behavior, but Google should help with that.

None of that will help with your problem.

Are the documents stored as word processor files, e.d. *.doc, or as scanned images, or what?

If they'e in a form that the computer recognizes as text, you might be able to use *nix tools like grep and awk (Windows versions exist) to extract data from them. This becomes easier if the formats are standardized or the data fields are tagged somehow, but probably cannot be completely automated.

... But wait a sec. The hospitals I have visited in recent years have equipped every one of their computers with a document scanner, and they scan _everything_. I was thinking it was odd, because of the huge amount of space required to store all those documents as scanned images.

It occurs now that maybe they don't keep the images forever. If instead, they cached the scanned images and had them OCR'd and, er, mined, they could get whatever information they needed from each image, and just store the information, which takes up much less space than an image, and can be sorted, searched, etc.

A Google search on "data mining service", without the quotes, was quite productive. Clearly, multiple outfits have made a business out of data mining for other businesses, so maybe you don't have to buy the data mining software yourself.


Mike Halloran
Pembroke Pines, FL, USA
 
Thanks for the ideas Mike, and your hospital experience with computers pulling data and making it work is what I what I would like to create here, it is such a time waster looking for information, I just want to define what I need and let the computer do the work.

all the best
Snowshoe2

 
snowshoe2,

Do you know what these questions are?

Could you write a FAQ?

Learn HTML. The language is dead simple. Composing it in NOTEPAD probably is simpler than using the commercial editors. A FAQ written in HTML allows you to answer the questions, and link to the more detailed documentation.

--
JHG
 
It should be straightforward to write a program in Python or Perl to parse your data and then output it to a text document. You could then write it to a text file with the proper spacing to fill out your forms.

You may, in fact, be able to use tools like awk, sed, and grep to do what you'd like to do without having to get into a full-fledged programming language like Python or Perl.

Not knowing how much variation you have from document to document, it's hard to say how successful an automated search will be... but if there are known keywords, etc. you can probably work out some logic to parse the files.
 
Thanks flash3780,
there are known keywords that we can work with, I was hoping to find an off the shelf program that would just take a bit of setup.

 
I think that grep and sed are probably the closest thing to an off-the-shelf solution that you'll come across. grep is a tool that parses text files and returns lines which match an expression (e.g. your keywords). sed can manipulate that text to output only the information that you want. grep can be piped into sed, so between the two, you should be able to grab lines containing the keywords you're looking for, trim any info you don't need from those lines, and dump it into a text file with stdout.

If you're dealing with Word documents, you may need a tool like vmware ( or docx2txt( to batch convert them to text files so that you can parse them more easily. I haven't tried those, so I can't vouch for how well they work... but the price is right.

At the end of the day, you should be able to put together a short shell script to do what you want, I think.
 
Textpad has a batch mode? Neat. If the OP is still following the thread, an example of the text to be parsed would be helpful.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor