We get the data and it’s time to deal with it.

Normalize the time series

Firstly, the data contains time with hours, minutes and seconds. However, we only cares about date. Thus we should normalize it:

import pandas as pd

df=pd.read_csv("text3.csv")
df['time'] = pd.to_datetime(df['time']).dt.date
df.to_csv("Savier.csv",encoding="utf-8-sig",index=False)

Mean Sentiment and Sentimental divergence

we denote the mean sentimental divergence as the variance of sentiment in a day. So we should get the data about the average sentiment and its variance in each day for each stock. The code is:

import pandas as pd
import numpy as np

df=pd.read_csv("Savier.csv")

date_list=set(df["time"])#每天
code_list=set(df['url'])
for c in code_list:#每股
    senti=pd.DataFrame(columns=['code', 'time', "senti",'div'])
    df_acode = df[df['url']==c]
    for i in date_list: #第i天
        df_adate=df_acode[df_acode['time']==i].reset_index(drop=True) #i天所有senti
        avg_senti=0
        index=0
        div=0
        for j in range(0,len(df_adate['sentiment'])):#i天某senti
            sentiment_dict=eval(df_adate['sentiment'][j][1:-1])
            print(sentiment_dict)
            posi_prob = sentiment_dict['positive_prob']
            nega_prob = sentiment_dict['negative_prob']
            if sentiment_dict['sentiment']-1>0:
                sentiment = (sentiment_dict['sentiment']-1)*posi_prob
            else:
                sentiment = (sentiment_dict['sentiment']-1)*nega_prob
            avg_senti= avg_senti+sentiment
            index= index+1
        if index == 0:
            avg_senti=0
        else:
            avg_senti = avg_senti/index
        for j in range(0,len(df_adate['sentiment'])):#i天某senti
            sentiment_dict=eval(df_adate['sentiment'][j][1:-1])
            div=(sentiment_dict['sentiment']-1-avg_senti)**2+div
        senti.loc[len(senti)] = [c,i,avg_senti,div]

    senti.to_csv(f"stock_div_/{c}.csv",encoding="utf-8-sig",index=False)

In general, its a good habit to store the data for each stock in different files but in a folder.

Find stock price volatility and Trade Volume

these part of data comes from Choices terminal. So the data is clean and easy to obtain. we just need to normalize the time and use pd.concat to adhere them onto previous data.

import pandas as pd

df=pd.read_csv("total_i.csv")
code_list = set(df['url'])
trade_volume = pd.DataFrame()
for i in code_list:
    df_volume=pd.read_excel(f"stock_info/{i}.xls")
    df_volume=df_volume.loc[0:241,:]
    df_volume['url']=i

    trade_volume=pd.concat([trade_volume,df_volume],axis=0)

#print(trade_volume)
trade_volume.to_csv("trade_vol.csv",encoding="utf-8-sig",index=False)
# the same applies for price volatility.

After this, we may find these DataFrames are not generally ordered by time. So we should make them ordered.

import pandas as pd
import datetime

total=pd.read_csv("Savier.csv")
code_list=set(total['url'])

for i in code_list:
    df=pd.read_csv(f'stock_div_/{i}.csv')
    df2=pd.read_excel(f'stock_info/{i}.xls')
    df3=pd.read_csv(f'stock_p/{i}.csv')
    df2.rename(columns={'交易日期':'time','成交量(股)':'vol'},inplace=True)
    df2=df2.loc[0:241,:]
    df2['time']=df2['time'].astype('datetime64[ns]')
    df3.rename(columns={'交易日期':'time','成交量(股)':'vol'},inplace=True)
    df3=df3.loc[0:241,:]
    df3['time']=df3['time'].astype('datetime64[ns]')
    date1=datetime.date(2023, 1, 1)
    date2=datetime.date(2023, 1, 2)

    df2.loc[len(df2)]=[date1,df2.loc[0,'vol']]
    df2.loc[len(df2)]=[date2,df2.loc[0,'vol']]
    df3.loc[len(df3)]=[date1,df3.loc[0,'vola']]
    df3.loc[len(df3)]=[date2,df3.loc[0,'vola']]
    td = df2.set_index(pd.DatetimeIndex(pd.to_datetime(df2.time)))
    #td = td.resample('1D').ffill()
    td['time'] = pd.to_datetime(td['time']).dt.date

    td2 = df3.set_index(pd.DatetimeIndex(pd.to_datetime(df3.time)))
    #td2 = td2.resample('1D').ffill()

    td2['time'] = pd.to_datetime(td2['time']).dt.date
    td2.drop(columns='time',inplace=True)
    general = pd.concat([td,td2],axis=1)

    df = df.sort_values(by='time')
    df=df.reset_index(drop=True)
    td3 = df.set_index(pd.DatetimeIndex(pd.to_datetime(df.time)))
    date3=datetime.date(2023, 12, 29)
    td3 = td3.loc[date1:date3,:]
    td3.rename(columns={"time": "date"},inplace=True)

    final = pd.concat([td3,general],axis=1)

    final.to_csv(f'Savier_test/{i}.csv',encoding='utf-8',index=False)

We also filled the empty data sample with previous sample. Because in weekends stocks are no trading, so we just use Friday data to fill these gaps.

SZ50 index

Our last thing is the macro economic index used as control variable. To get this, we use sz50 index as the domestic aggregate economic output level (not precise, because sz50 is affected by expectations, which is not a part of GDP).

Luckily, I found the data for the whole 2023. So as the previous steps we just need to adhere it to the matrix. The data I held was all string format with comma, so I need to slide it into pure numbers.

import pandas as pd
import datetime

total=pd.read_csv("sz50.csv")

for i in range (0,len(total['date'])):
    print(total)
    total.loc[i,'avg_p']=(float(total.loc[i,'final'][0:1]+total.loc[i,'final'][2:])+float(total.loc[i,'begin'][0:1]+total.loc[i,'begin'][2:])+float(total.loc[i,'high'][0:1]+total.loc[i,'high'][2:])+float(total.loc[i,'low'][0:1]+total.loc[i,'low'][2:]))/4

total.drop(columns=['high','low','begin','final'],inplace=True)
total.to_csv('sz50_d.csv',encoding='utf-8',index=False)

Eventually, It is around the end.

Final Step

at last, concatenate sz50 into our previous matrix.

import pandas as pd
import datetime

total=pd.read_csv("Savier.csv")
code_list=set(total['url'])
a=pd.read_csv('sz50_d.csv')
date1=datetime.date(2023, 1, 1)
date2=datetime.date(2023, 1, 2)
a['date']=a['date'].astype('datetime64[ns]')
a.loc[len(a)]=[date1,a.loc[241,'szvol'],a.loc[241,'szvola'],a.loc[241,'avg_p']]
a.loc[len(a)]=[date2,a.loc[241,'szvol'],a.loc[241,'szvola'],a.loc[241,'avg_p']]
a['date']=pd.to_datetime(a['date'])
a = a.sort_values(by='date')
a=a.reset_index(drop=True)
a = a.set_index(pd.DatetimeIndex(pd.to_datetime(a.date)))

#a = a.resample('1D').ffill()
#a=a.reset_index(drop=True)

for i in code_list:
    df=pd.read_csv(f'Savier_test/{i}.csv')
    df = df.set_index(pd.DatetimeIndex(pd.to_datetime(df.date)))

    df=pd.concat([df,a],axis=1)
    df.to_excel(f'final__test/{i}.xlsx',index=False)

we got the data we need.

Simple Regression

By making a simple regression on a random stock, the result seems to be reasonable:

the average sentiment has no significant effect, but the divergence has.

By using time series regression we have:

the result is that: there are up to two days interval for divergence to have impact on trade volume.

if we do not take account the price volatility, we would have:

the same applies for price volatility: