花粉乐分享平台宣传视频
> 乐学堂 > > 使用KNN及TF进行中文PDF搜索,类似于AutoGPT或ChatPDF实现原理!
使用KNN及TF进行中文PDF搜索,类似于AutoGPT或ChatPDF实现原理!
来源:civilpy
2023-05-25 18:04:13
421
管理

接上回,如何使用AI模型(如GPT、LLaMA),训练某一考试的教材、历年试题?

直接上代码,结合该功能与GPT进行搞基:

PDF文本搜索

import osimport reimport shutilimport urllib.requestfrom pathlib import Pathfrom tempfile import NamedTemporaryFileimport fitzimport numpy as npimport openaiimport tensorflow_hub as hubfrom sklearn.neighbors import NearestNeighbors# 对每页PDF进行预处理,生成一个text_listdef preprocess(text): text = text.replace('n', ' ') text = re.sub('s ', ' ', text) return textdef pdf_to_text(path, start_page=1, end_page=None): doc = fitz.open(path) total_pages = doc.page_count if end_page is None: end_page = total_pages text_list = [] for i in range(start_page - 1, end_page): text = doc.load_page(i).get_text("text") text = preprocess(text) text_list.append(text) doc.close() return text_listdef text_to_chunks(texts, word_length=150, start_page=1): text_toks = [t.split(' ') for t in texts] page_nums = [] chunks = [] for idx, words in enumerate(text_toks): for i in range(0, len(words), word_length): chunk = words[i : i word_length] if ( (i word_length) > len(words) and (len(chunk) < word_length) and (len(text_toks) != (idx 1)) ): text_toks[idx 1] = chunk text_toks[idx 1] continue chunk = ' '.join(chunk).strip() chunk = f'[Page no. {idx start_page}]' ' ' '"' chunk '"'# print({idx start_page}) chunks.append(chunk) return chunksclass SemanticSearch: def __init__(self): self.use = hub.load("F:/*******") # 中文 https://www.intumu.com/article/203 self.fitted = False def fit(self, data, batch=100, n_neighbors=3): # batch=1000, n_neighbors=5 self.data = data self.embeddings = self.get_text_embedding(data, batch=batch) n_neighbors = min(n_neighbors, len(self.embeddings)) self.nn = NearestNeighbors(n_neighbors=n_neighbors) self.nn.fit(self.embeddings) self.fitted = True def __call__(self, text, return_data=True): inp_emb = self.use([text]) neighbors = self.nn.kneighbors(inp_emb, return_distance=False)[0] if return_data: return [self.data[i] for i in neighbors] else: return neighbors def get_text_embedding(self, texts, batch=1000): embeddings = [] for i in range(0, len(texts), batch): text_batch = texts[i : (i batch)] emb_batch = self.use(text_batch) embeddings.append(emb_batch) embeddings = np.vstack(embeddings) return embeddingsdef load_recommender(path, start_page=1): global recommender texts = pdf_to_text(path, start_page=start_page) chunks = text_to_chunks(texts, start_page=start_page) recommender.fit(chunks) return 'Corpus Loaded.'# 开始训练语料库pdf_path='第3章 岩土工程勘察.pdf'recommender = SemanticSearch()load_recommender(pdf_path) # 使用fit生成语料库 https://www.intumu.com/article/203question='钻孔深度相关规定?'topn_chunks = recommender(question)print(topn_chunks)GPT查询代码

def generate_answer(question, openAI_key): topn_chunks = recommender(question) prompt = "" prompt = 'search results:nn' for c in topn_chunks: prompt = c 'nn' prompt = ( "Instructions: Compose a comprehensive reply to the query using the search results given. " "Cite each reference using [ Page number] notation (every result has this number at the beginning). " "Citation should be done at the end of each sentence. If the search results mention multiple subjects " "with the same name, create separate answers for each. Only include information found in the results and " "don't add any additional information. Make sure the answer is correct and don't output false content. " "If the text does not relate to the query, simply state 'Text Not Found in PDF'. Ignore outlier " "search results which has nothing to do with the question. Only answer what is asked. The " "answer should be short and concise. Answer step-by-step. nnQuery: {question}nAnswer: " ) prompt = f"Query: {question}nAnswer:" answer = generate_text(openAI_key, prompt, "text-davinci-003")# answer = handle_message(prompt) return answerdef generate_text(openAI_key, prompt, engine="text-davinci-003"): openai.api_key = openAI_key completions = openai.Completion.create( engine=engine, prompt=prompt, max_tokens=512, n=1, stop=None, temperature=0.7, ) message = completions.choices[0].text return messageopenAI_key = 'sk-zo59kJ9gV7yx8xgsn8jrT3BlbkFJT******' #https://www.intumu.com/article/203generate_answer(question, openAI_key)结语

以上类似于AutoGPT或chatPDF的实现原理,感兴趣的读者可以试试。

花粉社群VIP加油站

1
点赞
赏礼
赏钱
0
收藏
免责声明:本文仅代表作者个人观点,与花粉乐分享无关。其原创性以及文中陈述文字和内容未经本网证实,对本文以及其中全部或者 部分内容、文字的真实性、完整性、及时性本站不作任何保证或承诺,请读者仅作参考,并请自行核实相关内容。
凡本网注明 “来源:XXX(非花粉乐分享)”的作品,均转载自其它媒体,转载目的在于传递更多信息,并不代表本网赞同其观点和对 其真实性负责。
如因作品内容、版权和其它问题需要同本网联系的,请在一周内进行,以便我们及时处理。
QQ:2443165046 邮箱:info@hflfx.com
关于作者
飘絮惹尘埃..(VIP会员)
文章
472
主题
0
关注
0
粉丝
1
点击领取今天的签到奖励!
签到排行
随手拍
54个圈友 0个话题
华为手机随手拍,记录生活点滴之美好
华为P30pro
51个圈友 0个话题
这里是华为P30pro手机交流圈,欢迎华为P30pro用户进群交流
体验官
60个圈友 2个话题
华为花粉体验官,体验官专属的交流群
登录后查看您创建的圈子
登录后查看您创建的圈子
所有圈子
猜你喜欢
杭州互联网违法和不良信息举报平台 网络110报警服务 浙ICP备17046585号
1
0
分享
请选择要切换的马甲:

个人中心

每日签到

我的消息

内容搜索