Eng-Tips is the largest engineering community on the Internet

Intelligent Work Forums for Engineering Professionals

AI for a library 2

Status
Not open for further replies.

shvet

Petroleum
Aug 14, 2015
661
0
0
RU
Dear forummembers

I have a large library of standards, practices, articles, books, reports and similar which I use everyday. The problem is the library has been growing up to 100+ GB and 40+k files and keeps evolving. Eventually it became hard to find info/data on demand as one is remembering that it does exist and even how it seems but cannot recall where exactly. So one has to look through all relevant files which takes enormous amount of time.

The question is - are there AI enhanced tools that can help? Some software / application which a user can use for indexing by or upload files to in the purpose to use a context search. Something like Copilot or ChatGPT but for private library.

Hope the core idea is clear. I have spent a couple of weekends googling and have found nothing.

Please guide me if this is incorrect forum for such issues.
 
Replies continue below

Recommended for you

100+ GB and 40+k files

Is that all you got? ;-)

There are non-AI programs that can help:
> EVERYTHING is a somewhat easier program than the Windoze Indexer, and if your filenames are informative, then you can potentially find things, particularly if your folder structure is helpful
> dnGREP, which can read both filenames and contents

Otherwise, it would require getting a clean-box AI and essentially feed it your library, not unlike the way Google search's AI will summarize its top findings and provide citations

TTFN (ta ta for now)
I can do absolutely anything. I'm an expert! faq731-376 forum1529 Entire Forum list
 
I'd suggest a wiki but then there is the problem of filling it. Wikipedia currently has nearly 7 million articles/topics, so 40k files is a drop in that bucket.

Many of the wikis have development staff for hire to install and, I assume, populate the wiki to begin with.

Even if you use whatever you think an AI might do, someone needs to tell the AI what is considered to be a good job.

Natural language queries is perhaps what you are looking for and having a database (which a wiki is) may be advantageous as there isn't a need to dig through proprietary formats after the data has been extracted from the files to the wiki pages.

See I haven't watched the entire thing, but it appears to be in the direction you are looking.

Unlike other schemes, wikis allow the user to add whatever metadata they want in addition to the basic file contents. For example, you could have a project that depends on 1000 files. With the wiki you can create a project page that lists the 1000 related pages. On that page you can add information about why each of the files/pages was important to the project, both ones that were used and the ones that were discarded. None of the original pages needs to be modified. In some wikis you can also mark or tag the pages to be sent an e-mail if the page is altered, which no AI system by itself is likely to do, so if a page the project depends on is altered those responsible for it are informed. Since pages allow for version comparison, the email should be able to either include what changed or the user can do that comparison themselves. Same with access control - pages can be set anywhere from anyone can read/write to only a small number can even be allowed to know the page exists. Again, no AI or folder system is easily going to do that.
 
@IRstuff
The idea is to use context/narrative search. These tools provide no more benefits over traditional cataloging and tagging.

@3DDave
I am a process engineer and an experienced user but still not a software developer. Your advice will take more time&efforts then I will save.

Thank you, gentlemen, but this is definitely not what I am looking for.
 
I said - hire the people to set up the wiki for you. You don't have to know any more than how to go to a web site and type notes.

How do you expect to interact with an AI if you cannot give it commands?

I would have recommended to not bother with AI, but you were so intent on it I thought an example of what using one would be like might be helpful.
 
So - that's what I showed you, but you called it programming.

I also noted that the AI needs to be trained on your data - and that training needs to be validated.

Copilot and ChatGPT merely mimic what their input is and falsify outputs as required in order to make the mimicry convincing.
 
It's called RAG. Retrieval Augmented Generation.

You feed in your pdf, txt, docs, diagrams, scans of book pages, whatever. You can split large files into segments for maximum granularity, and supposedly better results, if you want. As always, it's data entry time now vs. Computer time GPT token cost and quality of results later. Feed it to your LLM of choice. I suggest Minstral 7B. A very good pptimized LLM that runs in a small footprint. The LLM digests all the info. You ask a question of your LLM and it retrieves the most relevant document(s) and tells you what it thinks you want to know, citing your various docs as sources. It is not difficult to DIY, BUT it does take time. AI "prompt engineers" are not known for their C++ Skills. And you only need to know a very little amount of programming, as there are LLM tenders, for want of a better word, that will help you deal with the LLM interaction. Check out Ollama, LangChain and OlamaIndex, and a number of others. LM studio, etc. which all have at least some RAG capabilities that will allow you to used with your LLM of choice, even running on your own computer, if you have enough memory or Graphics card to handle it. 16GB RAM and another 16GB of graphics card is enough to get started. I have got Minstral running on my hardware with LM studio and Ollama, in Windows WSL and Docker, but have not yet started to do RAG with it. That's next.


--Einstein gave the same test to students every year. When asked why he would do something like that, "Because the answers had changed."
 
Does that sound like what you want to do?

--Einstein gave the same test to students every year. When asked why he would do something like that, "Because the answers had changed."
 
Large Language Model

tenders - probably means software that acts as an interface to help the user, as in ship's tenders; aka attendants.

Again, the caution. Someone asked one of these models for any information of college professors known to have sexual misconduct with students. It dutifully made up an article naming at least one. A complete lie. It got a list of well known professors and information about sexual misconduct with students and fabricated a libel. Worse, once the professor pointed out this was a lie and it ended up in media, the same LLM used that as proof of its original contention.

Go with the Wiki and when (if) they ever stop making things up, you will have a curated database that is adequately linked. Until then you will have a curated database that contains only what you put into it - no lies.
 
It dutifully made up an article naming at least one.

There's a famous case from last year where a lawyer got lazy and asked ChatGPT to write a brief for him, and ChatGPT even created fake case law with fake citations for him; a truly monumental blunder for the lawyer and severe admonishment from the judge when he was caught out.

LLMs are currently very untrustworthy without a lot of backchecking. Part of the problem is the inability to properly curate the huge amounts of training data ingested by the LLM during training.

TTFN (ta ta for now)
I can do absolutely anything. I'm an expert! faq731-376 forum1529 Entire Forum list
 

An LLM does no reasoning; it simply predicts the mostly response to a prompt, based on its training set. Even so, a law-based LLM might have fewer hallucinations than a general use LLM, but training on bad case law might still effectively result in occasional lies. The article is written by a lawyer, so there's zero information about how the AI's actually work and what is used to prevent or detect hallucinations.

Case in point, current chatbots cannot actually do math, so questions about which US cities are further north than London, England, can result in gibberish answers. Moreover, they cannot realistically learn from their mistakes, even intra-session, since anything newly learned can potentially mess up connections in the neural network.

TTFN (ta ta for now)
I can do absolutely anything. I'm an expert! faq731-376 forum1529 Entire Forum list
 
Yes "tenders" (yeah, not the best word I could have picked) are the LLM/user interfaces, such as Ollama, or LlamaIndex, LangChain. probably the most popular for running the independent LLMs on your desktop. They have a collection of subroutines for common user functions and interactions with several LLM specific data formats. Some are more devoted to handle specific tasks, but most have considerable overlap and the selection reduces to what is more convenient to set up on a user's particular operating system. Programming abilities in the classic sense are not really required at all. Line up the subroutines you want to run according to the results you want to obtain and push go.

RAG is an attempt to supplement an LLM's knowledge by providing access to specific information that may not have been included in its original training data set, especially current events. The LLM would otherwise tend to hallucinate and respond with made up answers when questioned about information outside its original training data. The cost-benefit of RAG is questionable because of what is usually an arduous workflow of preparing the input stream from documents that may have widely varying formats and multiple types of content within those documents, but there isn't an alternative, except for a more complete retraining, or upping the LLM memory model and your desktop processing capability, or paying the cost of commercial access to them.

If you are not pressing an LLM for expert insight, or expecting it to substitute for expertise, i.e., simply asking it to act as an expert librarian, that would appear to be well within a basic low memory (8GB) LLM capabilities.

--Einstein gave the same test to students every year. When asked why he would do something like that, "Because the answers had changed."
 
Status
Not open for further replies.
Back
Top