Reprints from my posting to SAN-Tech Mailing List and ...

2011/06/11

[san-tech][02075] 大規模システム (16kノード) での OS Jitter報告 (HPC Colony Project)

Date: Wed, 10 Feb 2010 17:41:27 +0900
--------------------------------------------------
2012/05/01
"Practical experiences with OS Jitter"
 Feb 09, 2012, IBM developerWorks Wikis
  https://www.ibm.com/developerworks/wikis/display/LinuxP/Practical+experiences+with+OS+Jitter

"OS Jitter Mitigation Techniques"
 Feb 09, 2010, IBM developerWorks Wikis
  https://www.ibm.com/developerworks/wikis/display/LinuxP/OS+Jitter+Mitigation+Techniques
--------------------------------------------------
大規模システム (16,000ノード) での OS Jitterに関するレポートです
(実際は HPC Colony Project報告)

"Linux OS Jitter Measurements at Large Node Counts using a BlueGene/L"
 Jones, Terry R [ORNL] ;
 Tauferner, Mr. Andrew [IBM T.J. Watson Research Center] ;
 Inglett, Mr. Todd [IBM T.J. Watson Research Center]
 Publication Date: 2010 Jan 01 (On Paper: November 30, 2009)
  http://www.osti.gov/bridge/product.biblio.jsp?query_id=1&page=0&osti_id=971232

Abstract
 "We present experimental results for a coordinated scheduling
  implementation of the Linux operating system. Results were collected
  on an IBM Blue Gene/L machine at scales up to 16K nodes. Our results
  indicate coordinated scheduling was able to provide a dramatic
  improvement in scaling performance for two applications characterized
  as bulk synchronous parallel programs."



比較オペレーティングシステム (カーネル)
  Kernel 1: Blue Gene/L Compute Node Kernel (CNK)
    "One of CNK's principal design points was to avoid OS noise.
     It runs one process at a time; therefore it does not need to
     perform  time-slicing or preemptive multitasking."
    "This static memory map completely avoids TLB misses ..."
  Kernel 2: Colony Linux Kernel with unmodified Scheduler
     Linux version 2.6.16
    "A console driver and RAS driver were added in addition to various
     changes to support the BlueGene/L platform. The default 4KB pages
     were replaced with 64KB pages."
  Kernel 3: Colony Linux Kernel with Coordinated Scheduler
    "Two /proc interfaces were created and the scheduler was modified
     to give priority to the HPC applications in a coordinated fashion."
今回の検証アプリケーション
  Application 1: Allreduce
  Application 2: glob

いろいろ試行錯誤しながら、大規模システムに適した OSを作り込んでいます。
(後述しますが、HPC-Colonyプロジェクトは INCITE 2010に採択されました)

HPC-Colony Project
  http://www.hpc-colony.org/
  ソースコード公開は未だのようです

"Colony Update", Terry Jones, Principal Investigator
  http://sites.google.com/site/fastos2/fastos-workshop-slides/fastos-workshop-materials/sc09-fastos-slides/Colony_SC2009_Talk.ppt?attredirects=0&d=1
  ↑ PPTファイル
FastOS 2, Birds-of-a-Feather at  Supercomputing 2009
  http://sites.google.com/site/fastos2/supercomputing-2009-bof

Terry Jones, Application Performance Tools group, CSM, ORNL
  http://www.csm.ornl.gov/~trj/
Terry Jones, Stanford University
  http://www-cs-students.stanford.edu/~trj/


HPC Colonyは、INCITE 2010で新規に採択されました。マシンは XT5ですが、
協同研究者の半数以上は IBMの方です。4,000,000コア時間 (= 455年)
Title:
    "HPC Colony: Removing Scalability, Fault, and Performance
     Barriers in Leadership Class Systems through Adaptive System
     Software"
Principal Investigator: Terry Jones (Oak Ridge National Laboratory)
Co-Investigators
    Laxmikant Kale(University of Illinois?Urbana-Champaign)
    Jose Moreira (International Business Machines)
    Celso Mendes, Esteban Meneses, (UIUC),
    Yoav Tock, Eliezer Dekel, Roie Melamed, Eli Luboshitz,
    Menachem Shtalhaim, Benjamin Mandler (IBM)
Scientific Discipline: Computer Science
INCITE Allocation: 4,000,000 processor hours
Site: Oak Ridge National Laboratory
Machine (Allocation): Cray XT (4,000,000 processor hours)


[san-tech][02043] Re: US DOE INCITE 2010 AWARDS発表 (10/01/26), 28 Jan 2010
2010 Awards Fact Sheet
  http://www.er.doe.gov/ascr/incite/2010INCITEFactSheets.pdf

0 件のコメント:

コメントを投稿