Skip to content

index

PostgreSQL刪除一堆欄位資料,資料庫佔用硬碟空間反而變大

一個原本佔用磁碟空間25.921GB的資料庫,刪除某個欄位內容資料(SET foo=''),大概1000萬筆資料,發現資料庫佔用空間反而變大: 33.478GB(花了1703秒)。

PostgreSQL有個VACUUM指令,試用看看,果然執行完磁碟空間變成24.8GB(花了1106秒)。

VACUUM FULL清的最乾淨,但是花時間,而且會lock table,正式站要小心使用。

VACUUM FULL

Python 製作縮圖 (Pillow/PIL)

Pillow (PIL Fork) 10.4.0 documentation

Usage:

Read image and make thumbnails
thumb = Image.open(img)
thumb.thumbnail(i[1] , Image.LANCZOS)
# thumb = thumb.convert('RGB')
thumb.save(target_path, "JPEG")
thumb.close()

演算法可以參考下圖。 Compare Filters

screenshot via: Filters

Pillow-SIMD

Uploadcare提供了SIMD加速的Pillow: uploadcare/pillow-simd: The friendly PIL fork

Benchmark測試 Pillow Performance

Linux

CPU: Intel Celeron N4505 2.0GHz

安裝libjpeg-dev, zlib1g-dev後,安裝pillow-simd才會成功。但執行python出現illegal hardware instruction的錯誤訊息。

Linux
# install requirements
sudo apt install libjpeg-dev zlib1g-dev

# install pillow-simd
CC="cc -mavx2" pip install -U --force-reinstall pillow-simd

MacOS

MacBook Pro: 3.1GHz Intel Core i7

用brew安裝jpeg後,安裝pillow-simd成功,執行也沒問題。

MacOS
# install requirements
brew install jpeg
# install pillow-simd
pip install pillow-simd

總結

速度有比較快,希望有空來做benchmark。

Collection Management Systems (TDWG 2020 Symposium)

TDWG 2020 Challenges of alignment of collection management sys. across globe & diff. domains - SYM04 - YouTube

隨筆紀錄一下TDWG 2020關於自然史典藏系統的討論(線上),已經是4年前的討論了,有的似乎也沒有在維護了,仍有參考價值。

RECODE

Vince Smith

英國自然史提出的解決方案,完整的data model,滿強調Linked Data,很有啟發性。

多人協作? recode::community curation

NHM data workflow recode::collection object

NHM與世界的連結 recode::collection object

RECODE Data Model的關鍵: CollectionObject recode::collection object

Kotka

芬蘭的自然史典藏系統,強調Simple and Flexible,不用關聯式資料庫,很像新創邊移動邊開槍的模式。

Mikko Heikkinen

Collection Management System | Suomen Lajitietokeskus

  • focus on 80/20-rule, flexibility and simplicity
  • not focus on comprehensive data model (denormalized data)

沒有好用的系統、又有開發的人,所以就可以自幹 kotka::Background

跨組織要使用,要讓系統簡單而保持彈性。因為非技術的問題就夠麻煩的了。 kotka::Simple and Flexible

Symbiota

去中心化系統很厲害,但感覺要花很多精力處理系統之間的同步,不知道是不是美國這種人多地大物博的才運作的起來?

Edward Gilbert

  • decentralized data network (isolated decentralized network of mini-aggregators)
  • live-managed Vs. snapshot

Specify

新版(Specify 7)轉移到網頁,很大的破壞式更新。

  • Community-Driven decision making

DINA

不知道是不是沒繼續了,感覺沒有很活躍?

DIgital Information system for NAtural history data)

DINA: Open Source and Open Services - A Modern Approach for Sustainable Natural History Collection Management Systems

  • web-based modules, throuth API, components can be modified or replaced by other components

Meeting In-Between: Moving beyond the buzz, bottlenecks, and bubble to collaboratively develop digitization tooling

很讚的總結,但我暫時無法吸收了。

Matt Yoder

  • Digital Specimens in TaxonWorks
Name start mantance status tech stack stats
RECODE 2022 NHM good model concept
Kotka 2012 Finnish Museum of Natural History Luomus PHP, Zend 2020: 2.5 million specimens,12 institutions
Symbiota 2008 Arizona State University PHP 2020:50-60 public portals
Specify Specify Collections Consortium
DINA 2014 RBGE? not available (2024) web

Design System For Public Transportion

Smashing Newsletter 看到關於大眾運輸的Design System,有趣。

Research of Digitizing Herbarium

Case1: Oklahoma State Universaty Herbarium

Search portal (Symbiota)

Select dataset: okla-symbiota-filter1.png

Filter by taxon: okla-symbiota-filter2.png

Other Filters: okla-symbiota-filter3.png

Filter results: okla-symbiota-filter-result.png

Species page: okla-symbiota-species.png

Specimen page: okla-symbiota-specimen.png

Digitizing Process

Imaging: okla-imaging.png

Transcription from label and image information by volunteering and student worker

upload images to Notes from Nature — Zooniverse for volunteers to help transcribe.

ref: Digitizing Herbariums for Future Historians - YouTube

Case2: University of Alaska Herbarium (ALA)

ArctosDB

HUNG-YI LEE (李宏毅)

start: 2023年4月 2024-05-30

https://www.youtube.com/watch?v=fegAeph9UaA&list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49

ML Lecture 0: Intro

  • AI (目標) → Machine Learning (手段) → Deep Learning (ML的其中一種方式)
  • AI ⇒ 人類賦予的本能
  • Machine Learning ≈ Looking for a Function from Data
  • Framework
  • Step1: define a set of function ⇒ Model
  • Step2: goodness of function
  • Step3: pick the best function
  • Learning Map
  • Supervised Learning
    1. Regression [task]: the output of the target function f is "scalar" (數值)
    2. Classification [task]
    3. Linear Model [method]
    4. Non-linear Model
      • Deep Learning [method]
      • SVM, decision tree, K-NN... [method]
    5. Structured Learning
      • Beyond Classification
  • Semi-supervised Learning
    • Labelled + Unlabeled data
  • Transfer Learning
    • Labelled data + Data not related to the task considered (can be either labeled or unlabeled)
    • ex: 不相干的圖片,有什麼方式可以幫助學習
  • Unsupervisied Learning
    • 無師自通
  • Reinforcement Learning
    • Supervised .vs.Reinforcement Learning: Learning from teacher v.s. Learning from critics (比較像人類的學習方式)

ML Lecture 1: Regression

Andrew Ng

Deep Learning Specialization [5 courses] (DeepLearning.AI) | Coursera