KMOLab 網誌: 資料儲存的永續性

假如將時間倒轉 20 年, 看看當時的電腦程式課程在教些什麼? 大家是如何上課, 結果應該會讓現在這些初出茅廬, 剛剛成年的大一生非常驚訝. 是的, 當年並沒有人手一機, 上課是需要抄筆記的......

而且當時全球科技界正度過所謂的千禧年之禍, 利用電腦程式產生中文字仍處於 Big-5 的陰影下, 倚天中文仍然到處可見, 即使處在所謂的數位科技前緣, 某些人手上已經有小而美的易利信手機, 口袋裡也放著一個由 HTC 打造的頗有重量 HP PDA, 但所謂數位資料的永續性, 距離仍然很遠, 因此二十年多年後, 當時能留下與上課有關的數位資料非常有限.

之後就在 Google 逐漸成熟, 而 Facebook 騰空出世 7 年後的 2011年, Red Hat 推出可以免費使用的 Openshift, 不僅能夠伺服 PHP 與 Python, 還可免費存放各種數位資料, 當時以為資料終於可以永續存放的假象, 到 2016 年夢想逐漸破滅, 還好 2016 年之後有 Github 接手, Heroku 也很意外地從 2007 年活到現在, 目前, Github 與 Heroku (只能儲存 500 MB), 加上 Gitlab 的同步資料備份與 Google Drive 上的大檔案存放, 全球網友前撲後繼用隱私換取數位資料免費存放的所謂永續性, 似乎終於有了眉目.

目前教育版的 Google Drive 仍不限容量, 但也許未來的某一天這樣的所謂永續仍會畫上休止符, 大家仍必須有所因應.

資料存至 Google Drive

從 https://github.com/mdecourse/cd2020pj1/blob/master/myflaskapp.py 可以看出如何利用 Google Drive API, 在網際環境中將數位檔案送到特定伺服器之外, 還能利用 AJAX 存備份至特定 Google Drive 目錄.

@app.route('/saveToDB' , methods=['POST'])
@login_required
def saveToDB():

    """axuploader.js 將檔案上傳後, 將上傳檔案名稱數列, 以 post 回傳到 Flask server.

    截至這裡, 表示檔案已經從 client 上傳至 server, 可以再設法通過認證, 將 server 上的檔案上傳到對應的 Google Drive, 並且在上傳後的 GDrive 目錄, 設定特定擷取權限 (例如: 只允許 @gm 用戶下載.
    以下則可將 server 上傳後的擷取目錄與 GDrive 各檔案 ID 存入資料庫, 而檔案擷取則分為 server 擷取與 GDrive 擷取等兩種 url 連結設定
    """

    if request.method == "POST":
        files = request.form["files"]
        # split files string
        files = files.split(",")
        # files 為上傳檔案名稱所組成的數列
        for i in range(len(files)):
            # 逐一將已經存在 server downloads 目錄的檔案, 上傳到 GDrive uploaded 目錄
            fileName = files[i]
            fileLocation = _curdir + "/downloads/" + fileName
            mimeType = mimetypes.MimeTypes().guess_type(fileLocation)[0]
            # for GDrive v2
            #gdriveID = uploadToGdrive(fileName, mimeType)
            # for GDrive v3
            gdriveID = uploadToGdrive3(fileName, mimeType)
            fileSize = str(round(os.path.getsize(fileLocation)/(1024*1024.0), 2)) + " MB"
            date = datetime.datetime.now().strftime("%b %d, %Y - %H:%M:%S")
            user = session.get("user")
            print(user + "|" + str(fileSize) + "|" + str(mimeType) + "|"  + gdriveID)
            # 逐一將上傳檔案名稱存入資料庫, 同時存入mimeType, fileSize 與 gdriveID
            # 資料庫欄位
            #g.db.execute('insert into grouping (user , date, fileName, mimeType, fileSize, memo) values (?, ?, ?, ?, ?, ?)',(user, date, fileName, mimeType, fileSize, "memo"))
            #g.db.commit()
            #flash('已經新增一筆 upload 資料!')
    return "Uploaded fileName and gdriveID save to database"
def uploadToGdrive(fileName, mimeType):
    gauth = GoogleAuth()
    # 必須使用 desktop 版本的 client_secrets.json
    gauth.LoadClientConfigFile("./../gdrive_desktop_client_secrets.json")
    drive = GoogleDrive(gauth)

    '''
    # View all folders and file in your Google Drive
    fileList = drive.ListFile({'q': "'root' in parents and trashed=false"}).GetList()
    for file in fileList:
      print('Title: %s, ID: %s' % (file['title'], file['id']))
      # Get the folder ID that you want
      # 檔案會上傳到根目錄下的 uploaded  目錄中
      if(file['title'] == "uploaded"):
          fileID = file['id']
    '''
    # GDrive 上 uploaded 目錄的 fileID
    with open("./../gdrive_uploaded_id.txt", 'r') as content_file:
        fileID = content_file.read()

    # 由上述目錄外的檔案讀取 uploaded 目錄對應 ID
    #fileID = "your_folder_file_ID"
    # 上傳檔案名稱為輸入變數
    #fileName = "DemoFile.pdf"
    filePath = _curdir + "/downloads/"
    # parents 為所在 folder, 亦即 uploaded 目錄, fileID 為 uploaded 目錄的 ID
    file1 = drive.CreateFile({"mimeType": mimeType, "parents": [{"kind": "drive#fileLink", "id": fileID}], "title":  fileName})
    file1.SetContentFile(filePath + fileName)
    file1.Upload() # Upload the file.
    # 傳回與上傳檔案對應的 GDrive ID, 將會存入資料庫 gdiveID 欄位
    return file1['id']
    #print('Created file %s with mimeType %s' % (file1['title'], file1['mimeType']))   
    #print("upload fileID:" + str(file1['id']))
    # 以下為下載檔案測試
    # file2 = drive.CreateFile({'id': file1['id']})
    #file2.GetContentFile('./test/downloaded_ModernC.pdf') # Download file as 'downloaded_ModernC.pdf under directory test'.

    '''
    file1.Trash()  # Move file to trash.
    file1.UnTrash()  # Move file out of trash.
    file1.Delete()  # Permanently delete the file.
    '''
def uploadToGdrive3(fileName, mimeType):
    # get upload folder id
    # GDrive 上 uploaded 目錄的 fileID
    with open("./../gdrive_uploaded_id.txt", 'r') as content_file:
        folderID = content_file.read()

    creds = None
    with open('./../gdrive_write_token.pickle', 'rb') as token:
        creds = pickle.load(token)
    # 讀進既有的 token, 建立 service
    driveService = build('drive', 'v3', credentials=creds)

    metadata = {
        'name': fileName,
        'mimeType': mimeType,
        # 注意: 必須提供數列格式資料
        'parents': [folderID]
        }

    filePath = _curdir + "/downloads/" + fileName
    media = MediaFileUpload(filePath,
                                            mimetype=mimeType,
                                            chunksize=1024*1024,
                                            resumable=True
                                            )

    gdFile = driveService.files().create(
        body=metadata,
        media_body=media,
        fields='id'
    ).execute()
    fileID = gdFile.get("id")

    return fileID

上述程式利用較新的 GDrive V3 上傳資料之前, 可攜系統必須安裝 google-api-python-client:

# for uploadToGDrive3
# pip install google-api-python-client
# https://github.com/googleapis/google-api-python-client
import pickle
from googleapiclient.discovery import build
from apiclient.http import MediaFileUpload

Github, Gitlab 與 Fossil SCM

針對 Github 與 Gitlab 的操作, 可以參考 https://git-scm.com/book/en/v2, 但是 Fossil SCM 的參考資料則相對較少, 以下將針對 Fossil SCM 的應用稍加說明, 為了因應未來上述各種網際免費數位儲存資料系統的更迭, 在近端配置 Fossil SCM, 並用 BD-R or BD-RE (25GB) 進行備份, 也是一個不錯的資料永續儲存方案.

Fossil SCM 的使用非常簡單, 只要配合操作系統從 https://fossil-scm.org/home/uv/download.html 下載相應版本, 並讓系統可以執行 fossil.exe (以 Windows 10 為例) 即可, 唯一要注意的是若操作過程牽涉兩個不同操作系統, 必須透過 fossil version 查驗雙方的版本是否相同.

有關 Fossil SCM 的先前參考資料, 可參閱 Fossil SCM 簡介.

Ubuntu 安裝 fossil scm

使用 sudo apt install fossil 安裝 Fossil SCM 所取得的版本可以利用 fossil version 檢查. 若版本並非最新版本或與 Windows 10 所用的版本相同, 可以至 https://fossil-scm.org/home/uv/download.html 下載最新的 fossil 後, 以 sudo cp /home/user/fossil /usr/bin/, 然後再透過 fossil version 查驗是否已經更新為最新版本.

安裝 stunnel4

由於 Fossil SCM 並無 https 啟動功能的設置, 因此在實作上必須透過 stunnel SSL 代理主機啟動 https 伺服功能.

首先安裝 stunnel4:

sudo apt install stunnel4

接下來將系統環境設為 HTTPS:

sudo vi /etc/environment

加入 HTTPS=on

並且在 /etc/default/stunnel4 中加入 ENABLED=1

然後透過 sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout localhost.key -out localhost.crt, 在 /etc/stunnel/ 目錄中建立所需的 localhost.key 與 localhost.crt

同時建立 /etc/stunnel/stunnel.conf 如下:

[https]
accept = your_IPv4_ip:443
accept = :::443
cert = /etc/stunnel/localhost.crt
key = /etc/stunnel/localhost.key
exec = /usr/bin/fossil
execargs = /usr/bin/fossil http /home/user/repository/ --https --nojail --notfound cd2021

實際配置下, 使用 :::443 並無法讓 stunnel 綁定至系統的 ipv6 網址, 必須使用:

[https]
accept = 140.xxx.xxx.xxx:443
accept = 2001:288:6004:xx::1:443
cert = /etc/stunnel/localhost.crt
key = /etc/stunnel/localhost.key
exec = /usr/bin/fossil
execargs = /usr/bin/fossil http /home/user/repository/ --https --nojail --notfound cd2021

似乎 stunnel 會自動取最後的 :443 作為 port, 而無需如 https://[ipv6 address]:443 中以 [] 隔開 ipv6 網址與埠號.

execargs = /usr/bin/fossil http /home/user/repository/ --https --nojail --notfound cd2021 設定的意思為 stunnel 代理啟動的指令為 fossil http, 指定 /home/yser/repository/ 作為倉儲目錄, 可以透過 URL 加上倉儲名稱伺服多個 repo.fossil 倉儲, 隨後的 --https 表示要使用 https 協定擷取資料, --nojail 表示不要使用 root 權限啟動, 且不進入 jail 模式, --notfound cd2021 表示內定 https URL 擷取的倉儲為 /home/user/repository/cd2021.fossil

啟動-停止-重新啟動 stunnel4.service

sudo systemctl start stunnel4.service

sudo systemctl stop stunnel4.service

sutod systemctl restart stunnel4.service

KMOLab 網誌

Wednesday, February 10, 2021

資料儲存的永續性

資料存至 Google Drive

Github, Gitlab 與 Fossil SCM

Ubuntu 安裝 fossil scm

安裝 stunnel4

啟動-停止-重新啟動 stunnel4.service

No comments:

Post a Comment

syntaxhighlight and mathjax