Daily Mission

Posted on 2024-01-25 Edited on 2024-02-01 In Learning

Day 3

Words and phrases for Day 3

Word	Example
intimidate	He tries to intimidate his rivals.
reclaim	He returned three years later to reclaim his title as director of advertising.
discern	I can discern no difference between the two policies.
saturate	Japan’s electronics industry began to saturate the world markets.
alleviate	He couldn’t prevent her pain, only alleviate it.
postulate	The chapter was then allowed to postulate the bishop of Bath.
aggravate	military action would only aggravate the situation.
repel	Like poles repel and unlike poles attract.
mobilize	He used the press to mobilize support for his party.
inaugurarte	We inaugurate our new plant tomorrow.
immerse	Immerse the paper in water for twenty minutes.

get the better of
boast of
take the floor
last but one
in no case
center one’s attention on
round the clock
commit one’s idea to writing

Listening

Hi there. I’ve come to see the flat. My name is Mark Atoms, we spoke on the phone on wednesday?
Hi Mark, come on up. I’ll buzz you in. Green door on the second floor on the right side.
Nice to meet you. I spoken to all your refrences, and they all checked out ok. So let me show you around. The place actually belongs to my mother, but her health isn’t great, so we finally managed to pesuad her to move in with us, and rent this old place out.
It’s a great size, plenty of space., very versatile. I think it’s a winner for us.
Yes! All the appliances are brand new. There is a washing machine and a tumble dryer in the utility room next to the kitchen.
Lots of closet space too, which is fabulas! My wife has redicules number of shoes. Now, the big question: What about the noise and the neighbors?
Well, all the neighbors are elderly so no noisy kids and the back of the house over looks clear and pieceful pond. So it’s perfect if tran quility is what you are looking for.
That’s good news! We’ve been living in a less than Glamorous part of Aberdeen constently harassed day and night by noisy neighbors. Getting to work was a night mare too. As we only have one car, and my wife has to use it, as she works at nights at a hospital.
Well, if you like the place. It’s yours, as soon as I got the contract drawn up with the solicitor. The first month rent in a deposit are mandatory on siginning the contract. Then we can work out when is the best day for you to pay rent each month.
We’ll be incradibly happy to be your new tenant. Thank you so much. My wife will be therialled to get out of the shaby place we are now in. And start filling with those wardrobes with all those shoes.

Reading for Day 3

The now extinct passenger pigeon has the dubious honor of being the last species anyone ever expected to disappear. At one point, there were more passenger pigeons than any other species of bird. Rough estimates of their population went as high as five billion and they accounted for around 40 percent of the total indigenous bird population of North America in the early 19th century.

Despite their huge population, passenger pigeons were vulnerable to human intrusion into their nesting territory. Their nests were shabby things and two weeks after the eggs hatched , the parent pigeons would abandon their offspring, leaving them to take care of themselves. People discovered that these baby pigeons were really tasty, and the adult birds were also quite edible . First the Native Americans and then the transplanted Europeans came to consider the birds a great delicacy.

By the 1850s, commercial trapping of passenger pigeons was proceeding at an unprecedented pace. Hundreds of thousands of the birds were being harvested every day to be made into popular pigeon pies. In addition, large tracts of the pigeons’ nesting territory were being cleared away for planting crops and creating pasture land. As numerous as the passenger pigeons were, they were not an infinite resource. By the 1880s, it was noticed that the bird population had become seriously depleted. The last passenger pigeons killed in the wild were shot in 1899.

Eventually those billions and billions of birds shrank to a single remaining specimen , a passenger pigeon named Martha? who died on September 1, 1914, in captivity at the Cincinnati Zoo. In addition to being the end of an era, it was also the first time humans were able to exactly time the extinction of a species.

Deep Learning for Day 3

Create milvusDB

from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility

connections.connect("default", host="localhost", port="19530", user='root', password='Milvus')
collection_name = 'qiye_gpt'
if utility.has_collection(collection_name):
    utility.drop_collection(collection_name)
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1024),
    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=8000)
]
schema = CollectionSchema(fields, collection_name)
collection = Collection(collection_name, schema)
index_params = {
    "index_type": "IVF_FLAT",
    "metric_type": "IP",
    "params": {"nlist": 1024},
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()
print("success Create collection: ", collection_name)

Extract embedding

import torch
from transformers import AutoTokenizer, AutoModel

device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained('D:/qiyegpt/knowledge_augment/model/bge_large_zh_v1.5', cache_dir='./model')
model = AutoModel.from_pretrained('D:/qiyegpt/knowledge_augment/model/bge_large_zh_v1.5', cache_dir='./model').to(device)


def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


def get_embedding(sentences,is_insert, batch_size=10):
    if is_insert:
        res_embeddings = []
        for i in range(0,len(sentences), batch_size):
            batch = sentences[i:i + batch_size]
            encoded_input = tokenizer(batch, padding=True, truncation=True,
                                  max_length=512, return_tensors='pt').to(device)
            with torch.no_grad():
                model_output = model(**encoded_input)
            # Perform pooling. In this case, mean pooling.
            sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
            sentence_embeddings_norm = sentence_embeddings / sentence_embeddings.norm(p=2, dim=-1, keepdim=True)
            #sentence_embeddings_norm = sentence_embeddings_norm.squeeze(0)
            sentence_embeddings_norm = sentence_embeddings_norm.cpu().detach().numpy().tolist()
            res_embeddings.extend(sentence_embeddings_norm)
        return res_embeddings
    else:
        encoded_input = tokenizer(sentences, padding=True, truncation=True,
                                 max_length=512, return_tensors='pt').to(device)
        with torch.no_grad():
            model_output = model(**encoded_input)
        # Perform pooling. In this case, mean pooling.
        sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
        sentence_embeddings_norm = sentence_embeddings / sentence_embeddings.norm(p=2, dim=-1, keepdim=True)
        sentence_embeddings_norm = sentence_embeddings_norm.squeeze(0)
        sentence_embeddings_norm = sentence_embeddings_norm.cpu().detach().numpy().tolist()
        return sentence_embeddings_norm

Insert_db

from pymilvus import connections, Collection
# from pdf_load import extract_text_from_pdf
from extract_embedding import get_embedding
import json
import os

def get_all_files_in_folder(folder_path):
    file_paths = []

    for root, dirs, files in os.walk(folder_path):
        for file in files:
            file_paths.append(os.path.join(root, file))

    return file_paths

folder_path = "D:/qiyegpt/knowledge_augment/data"
file_paths = get_all_files_in_folder(folder_path)


connections.connect("default", host="localhost", port="19530", user='root', password='Milvus')
collection_name = 'qiye_gpt'
collection = Collection(collection_name)

# def InsertDb_pdf(filepath):
#     docs = extract_text_from_pdf(filepath)
#     #sentences = [i.page_content for i in docs]
#     embeddings = get_embedding(docs, is_insert=True, batch_size=10)
#     mr = collection.insert([embeddings,docs])
#     collection.flush()
#     print('提取特征并存放ai数据库', mr.succ_count)


# def Insert_json(filepath):
#     with open(filepath, "r", encoding="utf-8") as f:
#         data = json.load(f)
#         temp = []
#         for i in data:
#             if len(i["content"]) != 0:
#                 temp.append(i["name"] + ":" + i["content"])
#             for j in i["subsections"]:
#                 if j["name"] in i["name"]:
#                     temp.append(j["name"] + ":" + j["content"])
#                 else:
#                     temp.append(i["name"] + j["name"] + ":" + j["content"])
#         embeddings = get_embedding(temp, is_insert=True, batch_size=10)
#         mr = collection.insert([embeddings, temp])
#         collection.flush()
#         print('提取特征并存放ai数据库', mr.succ_count)


def Insert_json(filepath):
    with open(filepath, "r", encoding="utf-8") as f:
        data = json.load(f)

        name = []
        content = []
        for i in data:
            name.append(i["name"])
            if len(i["content"]) > 1500:
                content.append((i["name"] + ":" + i["content"][:1500]))
            else:
                content.append(i["name"] + ":" + i["content"])
        embeddings = get_embedding(name, is_insert=True, batch_size=10)
        mr = collection.insert([embeddings, content])
        collection.flush()
        print('提取特征并存放ai数据库', mr.succ_count)


# InsertDb_pdf('中科大不完全入学指南.pdf')
for path in file_paths:
    Insert_json(path)

print('读取完成')

newapp

from transformers import AutoModel, AutoTokenizer
import streamlit as st
from streamlit_chat import message
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility
from search_db import GeneratePrompt
import torch

connections.connect("default", host="localhost", port="19530", user='root', password='Milvus')
collection_name = 'qiye_gpt'
collection = Collection(collection_name)

st.set_page_config(
    page_title="企业gpt 演示",
    page_icon=":robot:",
    layout='wide'
)

def alter_format(string):
    quanjiao=[ '｀', '”', '’', '“', '‘', '＿', '－',
     '～', '＝', '＋',  '｜',  '（', '）', '［', '］', '【', '】', '｛', '｝', '＜', '＞',  '，', '；',  '！', '＾', '％',
     '＃', '＠', '＄', '＆', '？', '＊']
    banjiao=['`', '"', "'", '"', "'", '_', '-',
     '~', '=', '+',  '|',  '(', ')', '[', ']', '[', ']', '{', '}', '<', '>',  ',', ';', '!', '^', '%',
     '#', '@', '$', '&', '?', '*']
    for i in range(len(string)):
        for j in range(len(banjiao)):
            if string[i]==banjiao[j]:
                string = string.replace(string[i], quanjiao[j], 1)
                break
    return string

@st.cache_resource
def get_model():
    device = "cuda" if torch.cuda.is_available() else "cpu"
    tokenizer = AutoTokenizer.from_pretrained(
        "D:/qiyegpt/knowledge_augment/model/chatglm2-6b", trust_remote_code=True)
    model = AutoModel.from_pretrained(
        "D:/qiyegpt/knowledge_augment/model/chatglm2-6b", trust_remote_code=True) \
        .half().quantize(4).to(device)
    model = model.eval()
    return tokenizer, model


def new_click():
    st.session_state['show_history'] = []
    st.session_state['input_history'] = []
    st.session_state['past_key_values'] = None


tokenizer, model = get_model()

st.title("企业-GPT")

max_length = st.sidebar.slider(
    'max_length', 0, 32768, 8192, step=1
)
# top_p = st.sidebar.slider(
#     'top_p', 0.0, 1.0, 0.8, step=0.01
# )
top_k = st.sidebar.slider(
    'top_k', 0, 5, 1, step=1
)
temperature = st.sidebar.slider(
    'temperature', 0.0, 1.0, 0.8, step=0.01
)

if 'show_history' not in st.session_state:
    st.session_state.show_history = []

if 'input_history' not in st.session_state:
    st.session_state.input_history = []

if 'past_key_values' not in st.session_state:
    st.session_state.past_key_values = None

button2 = st.button("重新发起对话", key="delete")
if button2:
    new_click()


for i, (query, response) in enumerate(st.session_state.show_history):
    with st.chat_message(name="user", avatar="user"):
        st.markdown(query)
    with st.chat_message(name="assistant", avatar="assistant"):
        st.markdown(response)

with st.chat_message(name="user", avatar="user"):
    input_placeholder = st.empty()
with st.chat_message(name="assistant", avatar="assistant"):
    message_placeholder = st.empty()

prompt_text = st.text_area(label="用户输入",
                           height=100,
                           placeholder="请在这儿输入您的问题")



button1 = st.button("发送", key="predict")

if button1:
    input_placeholder.markdown(prompt_text)
    input_history, show_history, past_key_values = st.session_state.input_history, st.session_state.show_history,\
        st.session_state.past_key_values
    if len(show_history) == 0:
        query = GeneratePrompt(prompt_text, top_k)
        print(query)
    else:
        query = prompt_text
        print(query)
    i = 0
    top_p = 0.8
    # for response, input_history, past_key_values in model.stream_chat(tokenizer, query, input_history,
    #                                                             past_key_values=past_key_values,
    #                                                             max_length=max_length, top_p=top_p,
    #                                                             temperature=temperature,
    #                                                             return_past_key_values=True):
    out = model.stream_chat(tokenizer, query, input_history,
                                                                past_key_values=past_key_values,
                                                                max_length=max_length, top_p=top_p,
                                                                temperature=temperature,
                                                                return_past_key_values=True)

    for response, input_history, past_key_values in out:
        response = alter_format(response)
        message_placeholder.markdown(response)
        i += 1
    # print(type(past_key_values))
    # print(past_key_values)
    st.session_state.input_history = input_history
    st.session_state.show_history = show_history + [(prompt_text, input_history[-1][1])]
    st.session_state.past_key_values = past_key_values

Requirements

cpm_kernels==1.0.11
pandas==2.0.3
pymilvus==2.2.14
sentencepiece==0.1.99
streamlit==1.25.0
streamlit_chat==0.1.1
torch==2.0.1+cu117
transformers==4.31.0

Search Db

from extract_embedding import get_embedding
from pymilvus import connections, Collection

connections.connect("default", host="localhost", port="19530", user='root', password='Milvus')
collection_name = 'qiye_gpt'
collection = Collection(collection_name)

def SearchDb(query_str:str, topk=5):
    #输入问题，ai数据库中返回topk个相近的答案
    collection.load()
    search_params = {"metric_type": "IP", "params": {"nprobe": 64}}
    embedding = get_embedding(query_str,is_insert=False)
    results = collection.search(
        data=[embedding],
        anns_field="embedding",
        param=search_params,
        limit=topk,
        output_fields=["text"]
    )
    res = []
    for hits in results:
        for hit in hits:
            score = round(hit.distance, 3)
            if score >= 0.2:
                res.append({
                    "score": score,
                    "text": hit.entity.get('text')
                })
    print("Successfully searched similar texts!")
    return res

def GeneratePrompt(query: str,topk):
    # 基于上下文的prompt模版，请务必保留"{question}"和"{context}"
    related_docs = SearchDb(query, topk)
    if len(related_docs)>0:
        PROMPT_TEMPLATE = """已知信息：
        {context}

        上述各条信息之间不存在关联，根据上述已知信息，来回答用户的问题。如果上述信息没提到相关问题，请拒绝回答。问题是：{question}"""
        # context = "\n".join([doc['text'] for doc in related_docs])
        # prompt = PROMPT_TEMPLATE.replace("{question}", query).replace("{context}", context)
        context = ""
        for i, doc in enumerate(related_docs):
            context += "\n信息" + str(i + 1) +":"+ doc["text"]
        prompt = PROMPT_TEMPLATE.replace("{question}", query).replace("{context}", context)

    else:
        prompt = query
    print(prompt)
    return prompt

Day 4

Words and phrases for Day 4

Word	Example
manipulate	He manipulates people.
contend	It is time, once again, to contend with racism.
refrain	Mrs Hardie refrained from making any comment.
mediate	My mom was the one who mediated between Zelda and her mom.
conform	Conform to sth.

keep company with
compare… with…
compare… to…
by comparison
comply with
concentrate on
at the conclusion of
condemn sb. to
in that
now that
for all that
confess to
confide in
in confidence
confine to
be confront with
consent to

Listening for Day 4

A new study has found a positive corlation between how much television children whatch and their parents’ stress levels. Why? Because the more television children watch the more they are exposed to advertizing. The more advertizement they see, the more likely they are to insist on purchesing items when they go with their parents to stores. This could generate conflict if the parents refuse. All that reshercher say can contribute to parents overall stress levels. Whats the solution? Perhaps the most obvius is to curtenling screen time. Conmercial content is there for a reason to a lisen purchasing behavier. So parents might want to shut off the TV. Researchers conseed that this is easier said than done. So they suggest another option. Parents can change how they talk to their kids about purchases. The researchers suggest that parents seek inpact from their children on family purchasing decisions. They shouldn’t try to control all purchases. Instead, parents might tell their children things like “I will listen to your advise on some certain products or brands.” This type of conmunication the reserchers assert can lead children making fewer purchasing demends. That means less parents stress. However the protective of affect of this kind of conmunication deminishis with greater exposure to television. This is because advertizing inda children is a specialy persuasive. Advertizers use a sortment of technics such as bright colors, happy music and cerleberty indosment to appeal to children. Plus, children don’t have the cogneti ability to fully understand advertizing is intent. That makes them perticularly vulnerable to advertisements.

Reading for Day 4

sought-after
keep up
constraint
enrollment
bill
university admission
amended
admission purpose
well-rounded
correlate
mutual respect
resume

Linear Algebra for Day 4

Day 5

Words and phrases for Day 5

Word	Expamles