Compiler Optimizations for High Performance Architectures

来源：99网

CompilerOptimizationsforHighPerformanceArchitectures

HwansooHan,GabrielRivera,Chau-WenTseng

DepartmentofComputerScience

UniversityofMarylandCollegePark,MD20742

Abstract

WedescribetwoongoingcompilerprojectsforhighperformancearchitecturesattheUniversityofMarylandbeingdevelopedus-ingtheStanfordSUIFcompilerinfrastructure.First,wearein-vestigatingtheimpactofcompilationtechniquesforeliminat-ingsynchronizationoverheadincompiler-parallelizedprogramsrunningonsoftwaredistributed-shared-memory(DSM)systems.Second,weareevaluatingdatalayouttransformationstoim-provecacheperformanceonuniprocessorsbyeliminatingconﬂictmissesthroughinter-andintra-variablepadding.Ouroptimiza-tionshavebeenimplementedinSUIFandtestedonanumberofprograms.Preliminaryresultsareencouraging.

1Introduction

OurgoalattheUniversityofMarylandistosupportefﬁcientmachine-independentprogrammingofadvancedarchitectures.Usersdesiretheabilitytowriteprogramsthatcanrunwellonava-rietyofcomputers,sincesuchprogramsareportableandprotectsoftwareinvestment.However,modernprocessorarchitecturesarequitevariedandcomplex.Toachievehighperformance,itiscriticalthatthecompilergeneratescodethatcanefﬁcientlyutilizetheunderlyinghardware.Tobridgethegapbetweenapplications,operatingsystems,andadvancedarchitectures,thecompilermustbeabletoperformadvancedanalysesandtransformations.Wearecurrentlypursuingresearchintwoareas:compilingforsoft-waredistributed-shared-memory(DSM)systemsanddatalayoutoptimizationsforhighperformancearchitectures.

2ReducingSynchronizationOverhead

Webeginbylookingatcompileroptimizationsforreducingsyn-chronizationoverhead,Theseoptimizationsareexaminedinthecontextofaprototypesystem[17]extendingSUIF[11]totargettheCVM[15]softwaredistributed-shared-memory(DSM).Re-centstudiesindicatesuchsystemscanapproachtheperformanceofcurrentmessage-passingHPFcompilersandexplicitmessage-passingMPIprograms[6,17].However,loadimbalanceandsynchronizationoverheadwereidentiﬁedaskeysourcesofinef-ﬁciency.

ExecutiontimesforﬁveSUIF-parallelizedapplicationsona16processorIBMSP-2areshowninFigure1.Totalexecutiontimeissplitintoapplicationprocessingtime,OSoverhead,commu-nicationcost,sequentialwait(idletimewaitingfornextparalleltask),loadimbalance(idletimewaitingforcompletionofcurrentparalleltask),andbarrieroverhead(timespentexecutingbarriercode).Wefoundwhilenotmuchtimeisspentexplicitlyexe-cutingbarriers(1.5%ofexecutiontime),idleprocessortimeintheformofsequentialwaitandloadimbalancecomprisesalargepercentageoftotalexecutiontime(14%and27%,respectively).

with

toapplytheﬂushprotocolforthosepagesattheappropriatetime.2.1.3Compile-timeBarrierElimination

Thefork-joinmodelusedbyshared-memorycompilersisﬂexibleandcaneasilyhandlesequentialportionsofthecomputation;however,itimposestwosynchronizationeventsperparallelloopasshowninFigure2(A).First,abroadcastbarrierisinvokedbeforetheparallelloopbodytowakeupavailableworkerthreadsandprovideworkerswiththeaddressofthecomputationtobeperformedandparametersifneeded.Abarrieristheninvokedaftertheloopbodytoensureallworkerthreadshavecompletedbeforethemastercancontinue.

Inpreviouswork,wedevelopedcompileralgorithmsforbarriereliminationbasedontheobservationthatsynchronizationbetweenapairofprocessorsisonlynecessaryiftheycommunicateshareddata[25].Ifdatadependenceanalysiscanshowtwoadjacentloopnestsaccessdisjointsetsofdata,thenthebarrierseparatingthemmaybeeliminated,asinFigure2(B).Iftwoloopnestsaccessesthesamedata,butcommunicationanalysisprovesnoremotedataisaccessedbasedondataandcomputationdecompositioninforma-tion,thebarriermayalsobeeliminated,asinFigure2(C).Finally,ifcommunicationanalysisidentiﬁessimpleinterprocessorsharing

neighborneededelsebarrierneededifbarrierneededforanydependenceinsertbarrierelseifsyncpatterns,thebarriermaybereplacedwithlessexpensiveformsofsynchronization.Inparticular,ifdataisonlysharedbetweenneighboringprocessors,thebarriermaybereplacedbynearest-neighborsynchronization,asshowninFigure2(D).

2.2NovelTechniques

2.2.1CommunicationAnalysisUsingLocalSubscripts

Previouscommunicationanalysisreliedoncompile-timeinfor-mationonthedataandcomputationdecompositionselectedforaprogramtopreciselydeterminewhetherinterprocessorcommuni-cationtakesplace[25].Weﬁndthatanalternativecommunicationanalysistechniquebasedonlocalsubscriptanalysiscanalsoyieldgoodresults,butwithlesscomplexanalysis.TheLocalsubscriptanalysisalgorithmisshowninFigure3.

Weassumedependenceanalysishasalreadyeliminatedallar-rayreferencepairsbetweenloopnestsproventoaccessdisjointmemorylocations.Localsubscriptanalysisreliesonthecompilerselectingaconsistentchunkpartitioningofparallelloopitera-tions.Consistentiterationpartitioningofloopswithidenticalloopboundswillthenalwaysassignthesameloopiterationstoeachprocessor.Ifanysubscriptpaircontainingparallelloopindexisidentical,eachprocessorwillonlyaccesslocaldata.Ifthesub-scriptsdifferbyonlyaconstant,theneachprocessorwillaccessremotedataaconstantnumberofprocessorsaway.Sincecon-stantdifferencesinsubscriptsareusuallysmall,processorswillendupaccessingremotedataonneighboringprocessors,allowingbarrierstobereplacedbynearest-neighborsynchronization.

Localsubscriptanalysisisdesignedtocomplementfullcommu-nicationanalysis.Itislessprecise,butcanbecomputedefﬁcientlybecauseitonlyreliesonlocalsymbolicinformation.Localsub-scriptanalysiscanthusquicklyeliminatesimplearrayreferencepairsthatdonotrequiresynchronization,reducingthenumberofdependencesthatmustbeanalyzedwithcommunicationanalysis.Localsubscriptanalysiscanalsobeappliedinsomecaseswherestandardcommunicationanalysisfail,sinceitreliesonlyonlocalprograminformation.Forinstance,inFigure4(A),localsubscriptanalysisdoesnotneedtofullyanalyze1()or2()beforeelimi-natingthebarrier.Incomparison,communicationanalysismustcalculatewhatdataiscommunicated,andcanbedisabledbyasinglecomplexsubscript.

2.2.2ExploitingLazyReleaseConsistency

Asecondenhancementtoourcompile-timebarriereliminationalgorithmismadepossiblebytheunderlyingsemanticsoflazy-release-consistencysoftwareDSMs.Inatraditionalshared-

with

memorymodel,synchronizationisneededbetweentwoloopnestsiftwoprocessorsaccessshareddata,withatleastoneofthepro-cessorsperformingawritetotheshareddata.However,inalazy-release-consistencysoftwareDSM,ifthesharedreferenceisareadintheﬁrstloopandawriteinthesecondloop,nosynchro-nizationisneededbecausewritesfromthesecondloopwillnotbecomevisibletothereadintheﬁrstloopuntilsynchronizationisencountered.Inotherwords,anti-dependences(write-after-read)donotneedtobesynchronized.

Toseehowthecompilercanusethisproperty,considertheexampleshowninFigure4(b).TheﬁrstloopnestreadsnonlocalvaluesofBwhicharedeﬁnedinthesecondloopnest.Thecross-processordependencecausedbyBisthusaloop-independentanti-dependence.Normally,synchronizationisneededtoensuretheoldvaluesofBarereadbeforethenewvaluesofBarewritten.However,withlazyreleaseconsistencythesoftwareDSMguar-anteesthatnewvaluesofBonanotherprocessorwillnotbemadevisibleuntilthetwoprocessorssynchronize.Sincetherearenootherloop-independentdependencesbetweenthetwoloopnests,synchronizationbetweenthemisnotrequired.

Across-processortrue/ﬂowdependence(read-after-write)ex-istswhichdoesneedsynchronization.Itisthedependencebe-tweendeﬁnitionsofBinthesecondloopnestandreadsofnonlocalvaluesofBintheﬁrstloopnest.ThisdependenceiscarriedbytheouterTIMEloop,sincetheendpointsofthedependenceoccurondifferentiterationsoftheTIMEloop.ThecompilernormallyinsertsabarrierasthelaststatementoftheTIMEloop,butlocalsubscriptanalysiscanshowonlynearest-neighborsynchroniza-tionisneeded.

Theonecasewheresynchronizationisneededforanti-dependencesiswhentheprocessorperformingareaddoesnot

yetpossessacopyoftheshareddata,sinceitmayretrieveacopyofthedatawiththenewvalues.Forscientiﬁccomputationswhereiterativecomputationsaretherule,thisisrarelythecase.Ourimplementationofnearest-neighborsynchronizationsolvesthisproblembyinvokingaglobalbarriertheﬁrsttimeitisin-vokedateachlocationintheprogram.Sinceanti-dependencesmaybeignored,thealgorithmforinsertingbarriersynchronizationbecomessimilartothealgorithmformessagevectorization[13].Thelevelofthedeepesttrue/ﬂowcross-processordependencebe-comesthepointwheresynchronizationmustbeinsertedtopreventdataraces.Synchronizationatlowerlooplevelsisnotneeded.2.2.3CustomizedNearest-NeighborSynchronization

Atsomebarriers,thecompilercandetectcommunicationonlytakesplacebetweenneighboringprocessors[25].Totakeadvan-tageofthisinformation,weimplementedacustomizedroutinefornearest-neighborsynchronization(whereeachprocessorhaseitherzero,one,ortwoneighbors)directlyinCVM.Therou-tinesendsasinglemessagetoeachneighboringprocessoruponarrival,andcontinuesassoonasmessagesarereceivedfromallneighboringprocessors.Incomparison,fornormalglobalbarriersallprocessorssendasinglemessagetothebarriermaster,whichbroadcastsareplyonceallprocessorshavecheckedin.

Thecustomizednearest-neighborsynchronizationhasseveraladvantagesoverstandardglobalbarriers.Themostimportantisthatnearest-neighborsynchronizationallowsatleastsomeoftheinducedloadimbalancetosmoothoutbeforeitdelaysallprocessors.However,thisbeneﬁtusuallyoccursonlyiftherearemultiplenearest-neighborsynchronizationeventsinvokedinsequence.Ifnearest-neighborsynchronizationandglobalbarriersareexecutedinalternatingsequence,opportunitiestosmoothoutloadimbalancearelessened.Second,theserialbottleneckofthebarriermasterisavoided.Thisisnotalargeadvantageforthesizeofthesystemthatwearecurrentlyevaluating,butshouldbesigniﬁcantforlargersystems.Finally,commonmessagescanbeusedtocarrybothsynchronizationanddata,becausebothﬂowonlybetweenneighbors.2.3ExperimentalResults2.3.1Applications

Weevaluatedtheperformanceofourcompiler/softwareDSMinterfacewithﬁveprogramsshowninTable1.The“Granularity”columnreferstotheaveragelengthinsecondsofaparallelizedloop.Exceptwhereindicated,numbersbelowrefertothelargerdatasetforeachapplication.expl,andredblackaredensestencilkernelstypicallyfoundiniterativePDEsolvers.jacobiisastencilkernelcombinedwithaconvergencetestthatcheckstheresidualvalueusingamaxreduction.swmandtomcatvareprogramsfromtheSPECbenchmarksuitecontainingamixtureofstencilsandreductions.WeusedtheversionoftomcatvfromAPRwhosearrayshavebeentransposedtoimprovedatalocality.

ProblemSizesNameexpl

256251220.06

0.34

JacobiIterationw/ConvergenceTest

redblack

5122102420.01

0.14

ShallowWaterModel

(SPEC)

2tomcatv

25651220.04

0.15

AllapplicationswereoriginallywritteninFortran,andtyp-icallycontainaninitializationsectionfollowedbyiterationsofatime-steploop.Statisticsandtimingsarecollectedaftertheinitializationsection.OptimizedversionsofeachprogramwereautomaticallygeneratedbytheSUIFcompiler.Thecompileran-alyzedbutwasunabletoapplysynchronizationoptimizationstofourotherprograms,indicatingthatnotallprogramsmaybeneﬁtfromthetechniquespresentedinthispaper.2.3.2ExperimentalEnvironment

WeevaluatedouroptimizationsonanIBMSP-2with66MHzRS/6000Power2processorsoperatingAIX4.1.Nodesarecon-nectedbya120Mbit/secbi-directionalOmegaswitchcapableofasustainedbandwidthofapproximately40Mbytespersec-ond.SimpleRPCsontheSP-2require160secs.Aone-hoppagemiss,wherethepagemanagerisalsotheowner,requirestwomessagesand939secs.Two-hoppagemissesrequirethreemessagessystemsecstocallscalland1376secs.Inthebestcase,AIXrequires128requireuser-level12handlerssecs.However,forpagevirtualfaults,memoryandmprotectprimitivecostsinthecurrentsystemarelocation-dependent,occasionallyincreasingthesecoststoamillisecondormore.

Inourexperiments,CVM[15]applicationswritteninFor-tran77wereautomaticallyparallelizedbytheStanfordSUIFpar-allelizingcompilerversion1.1.2[11],withcloseto100%ofthecomputationinparallelregions.Asimplechunkschedulingpolicyassignscontiguousiterationsofequalornear-equalsizetoeachprocessor,resultinginaconsistentcomputationpartitionthaten-couragesgoodlocality.TheresultingCoutputcodewascompiledbyg++version2.7.2withthe-O2ﬂag,thenlinkedwiththeSUIFrun-timesystemandtheCVMlibrariestoproduceexecutablecodeontheIBMSP-2.Customizedsupportforreductionsandtheﬂushupdateprotocolwereusedtoimproveoverallperformance[17].

2.3.3EffectivenessofCompilerSynchronization

Optimizations

First,weexaminetheeffectivenessofcompileralgorithmsinelim-inatingsynchronization.Table2displaysthenumberofbarriersfoundineachprogramatcompiletime,andthepercentageelim-inatedbydifferentlevelsofoptimization.Table3presentsthesameinformationforbarriersexecuteddynamicallyatruntimeforeachapplication.Theﬁrsttwocolumnsfor“barriers”indi-catethenumberofbarriersexecutedintheoriginalprogramandthepercentagereductionbymergingdoallsintothesameparallelProgramBarriersinprogram

numdepcommlazyjacobi62533swm

81618

6325255017133127

–

6362–11.2

6.0

orig

%eliminated[%replaced]mergeexpl333333

802550

–redblack386313

5303333

–

tomcatv3671–Average

304514

Table3:DynamicMeasurementofSynchronizationOpts

region.Theremainingcolumnsshowthepercentageofbarrierseliminatedorreplacedbynearest-neighborsynchronizationfordifferentlevelsofoptimization.Thecompileroptimizationlevelsareasfollows:“dep”eliminatesbarriersinparallelregionsbasedondatadependences,“comm”performscommunicationanalysisusinglocalsubscriptanalysistoeliminatebarriers,“lazy”elim-inatesbarriersguardingonlyanti-dependences.Communicationanalysismayalsoreplacebarrierswithnearest-neighborsynchro-nization,Optimizationsarecumulative.

Weseefrombothtablesthatthecompileriseffectiveatelimi-natingparallelloopsandbarriersencounteredduringcompilation,withroughlysimilarbeneﬁtsforthenumberofparallelloopsandbarriersactuallyexecutedbytheapplicationatruntime.Exam-iningtherun-timemeasurementsinTable3,weﬁndthecompilerisquitesuccessfulindiscoveringparallelloopswhichmaybemergedintoasingleparallelregion,eliminatingonaverage30%ofbarriersexecutedwhichisequivalentto60%eliminationofdoallinvocations.Dependenceanalysisaloneisonlyabletoeliminatebarriersintwoprograms,redblackandswm,buttheimprovementinredblackissigniﬁcant.Communicationanal-ysiscaneliminatebarriersintomcatvandreplacebarriersinexpl,jacobi,andredblack.Detectingbarriersguardingonlyanti-dependences,thecompilercaneliminatemorebarriersoutrightinexplandjacobi.Thenumberofreplacedbarriersgoesdowninexplandjacobi,sincethecompilercanprovesomenearest-neighborbarriersguardonlyanti-dependences.Ap-plyingalloptimizations,onaverage53%ofallbarrierexecutionsareeliminatedintheseﬁveprograms,with6%ofbarriersreplacedbynearest-neighborsynchronization.

2.3.4ImpactofCompilerOptimizationson

SynchronizationOverhead

Inordertoevaluatesynchronizationoptimizationsinmoredetail,weinstrumentedCVMtodirectlymeasurebarrieroverhead(timespentexecutingbarriercode),loadimbalance(idletimewaitingforcompletionofcurrentparalleltask)andsequentialwait(idletimewaitingfornextparalleltask).Wefoundtheactualtimespentinbarrierroutinestobesmall(refertoﬁgure1).Instead,mostoftheoverheadwascausedbysequentialwaitandloadimbalance.Table4showssequentialwaitandloadimbalanceaspercentageofoverallexecutiontime.Themeasurementsarefor16processorSP-2runswithsmallandlargedatasets.Averagesequentialwaitandloadimbalanceare14%and27%ofexecutiontimeforlargedatasetsand23%and31%forsmalldatasetsrespectively.Loadimbalancetakesuplesspercentageofexecutiontimewithlargedatasets,becausethelargeramountofdataonindividualproces-sorsgivesthegreaterchancetosmoothoutloadimbalance.We

wait+imbalance

wait

imbal.

expl

sm3738lg

2631272414

redblack

677272lg

6969153212

tomcatv

sm182325lg

7.7

233114

Table4:EffectofOptimizationsonLoadImbalanceandSequen-tialWait(16Processors)

seq_wait70%imbalance% exec time (unoptimized)60%50%40%30%20%10%0%UDCLUDCLUDCLUDCLUDCLexplUnoptimizedjacobiredblackswmtomcatvLazy analysisDep analysisComm analysisFigure5:EffectofOptimizationsonLoadImbalance(16Proces-sors)

alsomeasuredtheimpactofsynchronizationoptimizationsonre-ducingloadimbalance.Table4displaysthepercentagereductioninidletimeafterapplyingdifferentlevelsofoptimization.Fig-ure5graphicallypresentsthesameinformationforthelargedatasets.Weseethatoptimizationscansigniﬁcantlyreducesequentialwaitandloadimbalanceforsomeoftheapplicationsstudied(72%decreaseforredblackwithsmalldataset).Averagesequentialwaitandloadimbalancedecreaseby28%,32%,and34%withlargedatasetsandby34%,37%,and43%withsmalldatasetsforthreeoptimizationlevels,respectively.

2.3.5ImpactofCompilerOptimizationsonProgram

PerformanceFigure6displaystheimpactofsynchronizationoptimizationsonapplicationperformanceontheSP-2,forbothsmallandlargedatasets.Foreachgraph,theY-axismeasuresimprovementoverunoptimizedprograms,theX-axispresentsthreeoptimizedver-sionsofeachprogramforboth8and16processorruns:“depanalysis”mergesparallelregionsandeliminatesbarriersbasedondatadependences,“commanalysis”useslocalsubscriptanal-ysistoeliminatebarriersorreplacethemwithnearest-neighborsynchronization,“lazyanalysis”eliminatesbarriersguardingonlyanti-dependences.Exceptforredblackandswm,performancefor“depanalysis”isthesameasthatforsimplymergingadjacentparallelloopsintothesameparallelregion.Optimizationsarecumulative.

Performanceforapplicationscoverabroadrange.For16pro-cessorrunswithlargedatasets,averageimprovementfromsyn-chronizationoptimizationsis14%fordependenceanalysis,17%

jacobi

lgsmlgsmlgsm

unopt5.7121.32.21.36.7

dep6.6132.72.41.57.7

comm6.9133.02.41.77.9

lazy7.1133.12.41.78.1

swm

Average

sagestripledforsomeapplicationswhenusingnearest-neighborbarriers.Nonetheless,theoverallimpactonperformanceislim-ited.Theprimaryreasonisthatmanyofthebarriersynchroniza-tionsthatareprimecandidatestobereplacedbynearest-neighborsynchronizationcanbeeliminatedinstead.However,nearest-neighborsynchronizationmayprovemoreimportantwhenusingmorethan16processors.

%Improvement(depvs.comm)

expl5.52.5redblack–

–

tomcatvAverage

Table6:ImpactofNearest-NeighborBarriers(16Processors)

3DataLayoutTransformations

Inadditiontosynchronizationoptimizations,wearealsoinves-tigatingdatalayouttransformationsforbothuniprocessorandmultiprocessorarchitectures.Becauseoftheincreasinggapbe-tweenprocessorandmemoryspeeds,applicationsthatdonotusecacheseffectivelyspendmostoftheirtimewaitingformemoryaccesses.Duetohardwareconstraints,cacheshavesetassocia-tivity,wherememoryaddressescanonlybemappedtooneoflocationsina-wayassociativecache.Limitedsetassociativitycanreducetheamountofreuseprogramsexploit.Conﬂictmissesmayoccurwhentoomanydataitemsmaptothesamesetofcachelocations,causingcachelinestobeﬂushedfromcachebe-foretheymaybereused,despitesufﬁcientcapacityintheoverallcache.Conﬂictmisseshavebeenfoundtobeasigniﬁcantsourceofpoorperformanceinscientiﬁcprograms,particularlywithinloopnests[18].

3.1EliminatingConﬂictMisses

Compilertransformationscanbeveryeffectiveineliminatingcon-ﬂictmissesforscientiﬁcprogramswithregularaccesspatterns.Weevaluatetwocompilertransformationstoeliminateconﬂictmisses:1)inter-variablepadding(modifyvariablebaseaddress),2)intra-variablepadding(increasearraydimensionsize)[2,24].Unlikecompilertransformationsthatrestructurethecomputationperformedbytheprogram,thesetwotechniquesmodifythepro-gram’sdatalayout.

TwomotivatingexamplesofthesetransformationsareshowninFigures7and8.InFigure7,unit-stridereferencestoB(i)andC(i)providespatiallocality,leadingtocachereuse.However,ifBandCareseparatedbyamultipleofthecachesizeandthecacheisdirectmapped,everyaccesstoC(i)willmaptothesamecachelineasthepreviousB(i)accessandviceversa.Asaresult,everyreferencewillproduceaconﬂictmiss.Toregainthebeneﬁtsofspatialreuse,inter-variablepaddingmaybeappliedasshowninFigure7.

InFigure8,thestencilcomputationexhibitsmuchspatialandtemporalreusebetweenthereferencestoB.However,ifthecol-umnsizeofB(i)isamultipleofthecachesize,conﬂictmisseswillresult,eliminatingreusebetweenreferencestoB(i).Toregainspatialreuse,intra-variablepaddingappliedtothecolumnsofBcanchangeitsinternallayoutsothatcolumnsnolongermaptoconﬂictinglocationsonthecache,asshowninFigure8.

Bothoftheseexamplesshowinstancesofsevereconﬂictmisses,whenconﬂictmissesoccuroneveryloopiteration.Intheremainderofthissection,weevaluateseveralheuristicsfor

realA(N,N),B(N,N)doj=1,Ndoi=1,N

A(i,j)=0.25*(B(i+1,j)+B(i-1,j)

+B(i,j+1)+B(i,j-1))-->realA(N,N),B(N+PAD,N)

Figure8:Intra-VariablePadding(ChangeColumnDimension)

locations,separatedby

cachelines.Basedonempiricalstudies[22],ourimplementationcurrentlyseparatesarraybaseaddressesbyfourcachelines(=4).Analternativeheuristicwhichseparatesarraysasfaraspossibleinthecachewasshowntobelesseffectiveineliminatingconﬂictmisses[22].

3.2.2Intra-variablePaddingHeuristic

Apossibleintra-variablepaddingheuristicistopadthedimen-sionsofeacharraysothatthesizeofsubarrayscomprisinglowerdimensionsarefarenoughfromanycachesizemultiple.Inthesimplestcase,arraycolumnswouldbepaddedwhenneighbor-ingcolumnsaremappedtothesamelocationincache.Empiricalstudies[22]indicatethisheuristicmissesmanyproﬁtableopportu-nitiesforintra-variablepadding.Asaresult,ourimplementationcurrentlypadsallarraycolumnsbyfourelementsunconditionallywhensafe.Itshouldbeclearthatthisheuristicisnotpractical;however,ourexperimentsdemonstrateitisquiteeffectiveiniden-tifyingmanyprogramswhichstandtobeneﬁtfromintra-variablepadding.

3.3Analysis-BasedPadding

Formoresophisticatedpaddingheuristics,additionalcompileranalysisofarrayreferencesisrequired.First,weﬁnduniformlygeneratedreferencesdescribedbyGannonetal.[9].Wethencomputetheconﬂictdistance,thedifferencebetweenthememory

addressesoftwoarrayreferencesmodulothecachesize

.Ifoneachloopiterationtheconﬂictdistanceislessthanthecache

linesize

,severeconﬂictmissescanresult.Observethatmakingtheconﬂictdistanceorgreaterisasufﬁcient(thoughnotnecessary)conditionforeliminatingsevereconﬂicts;weusethistoguideourdatalayoutoptimizations.Ourtechniquesforcomputingconﬂictdistancesisaspecialsimpliﬁedcaseofcachemissequations[10].

3.3.1Inter-variablePaddingHeuristic

Centraltoourtechniqueforeliminatingsevereconﬂictmissesisconsideringpairsofarraysreferenceswithconstantconﬂictdistancesoneachloopiteration.Weﬁndthatsuchpairsarecommonlyuniformlygeneratedreferences[9]extendedforcon-formingarrays,arraysthathaveequaldimensionsizesinallbuttheirhighestdimensionandpossessequal-sizedelements.Apairofuniformlygeneratedreferencesto-dimensionalconforming

arraysAandBhastheform11andorthe1value120,and2

eachand,whereissomeeach22

isanintegervariableconstantintegervalue.Uniformlygeneratedreferencesmayincludereferencesnestedinforloopswithindices,referencesto1-dimensionalarraysofar-bitrarysize,andreferenceswithintegersubscripts(inwhichcasereducesto0).

Whenwecalculatetheconﬂictdistanceonagivenloopiterationbysubtractingthetwomemorylocationsaccessedbyuniformlygeneratedreferences,weobtainanexpressioninwhichalltermscancel:

(1)

Inthisexpression,andarethebaseaddressesofAandB,isthecommonelementsize,andisthecommentsizeofdimension.Alowerboundof0is

assumedineachdimension,butotherlowerbounds

andmaybeaccountedforbyaddingtothefactorTakingthemoduloofthecachesizethenyieldstheconﬂict.distance.

Wewishtoidentifyallpairsofuniformlygeneratedreferenceswhichmayconﬂict.Ouralgorithmthereforeanalyzeseachlooptoﬁndsuchreferences.Sincereferencesandbeinguniformandreferencesandbeinguniformimpliesandareuni-formaswell,pairsofuniformlygeneratedreferencesinaloopcanbeefﬁcientlyrepresentedasgroupsofuniformlygeneratedreferences.Eachreferenceinagroupisuniformwithrespecttoallotherreferencesinthegroup.Ourinitialanalysisisthusistocomputeforeveryloopthethesetofmaximalgroupsofuniformlygeneratedreferences.

Weapplyinter-variablepaddingasfollows.Theobjectiveistoassigneacharrayabaseaddresssothatnotwouniformly-generatedreferencesinagroupconﬂict.Weaccomplishthisinasimple,greedyfashion.Allarraybaseaddressesareinitiallyundeﬁned.Then,eacharrayinturnisassignedabaseaddressasfollows.First,thenextavailableaddressisattempted.Ifthisaddressleadstoaconﬂictwithanyotherarraywithadeﬁnedbase

address,theaddressisincrementedbythecachelinesize

,andtheprocessrepeats.Thetestforaconﬂictbetweentwoarraysinvolvesinstantiatingexpression1withtheandvaluesfromallpairsreferencingthetwoarraysineachoftheanalyzedgroups,andthe,,andBaseAddrvaluesimposedbythearraysunderconsideration.Iftheconﬂictdistanceobtainedislessthan,thenweassumetheaddressesconﬂict.

Abaseaddressneedstobeincrementedatmostormoretimes,sinceconﬂictsfortheﬁrstvaluesimpliesconﬂictsforanyremainingvaluesobtainedbyadding.Ifthiscasehappens,thealgorithmhasfailedtoﬁndasatisfactoryaddressand

assignstheﬁrst,unpaddedaddressattempted.Inourexperimentsthecompilercanalwaysﬁndanon-conﬂictingbaseaddress.3.3.2Intra-variablePaddingHeuristic

Intra-arraypaddingaimstomodifythedistancesbetweenrefer-encestothesamearrayinordertoeliminateconﬂicts.Thenotionofuniformlygeneratedreferencesisagainusefulinidentifyingcommonformsofarrayreferencesaccessinglocationsofaﬁxeddistance.Bythedeﬁnitiongiveninthelastsection,tworeferencestothesamearrayareuniformlygeneratedaslongastheyareoftheformagainwhere1each12is2anunmodiﬁed,variable11or20,and2

eachand,isanarbitraryinteger.Allotherconditionsarealwayssatisﬁedsincethetworeferencesaccessthesamearray.Expression1inthisspecialcasereducesto

(2)

sincethebaseaddresses(andanyarraylowerbounds)canceleachotherout.Statedintermsofexpression2,intra-arraypadding

modiﬁesthe

valuesforeacharrayasneededtoproducenon-conﬂictingdistancesbetweenalluniformlygeneratedreferences.Weapplyintra-variablepaddingasfollows.First,eachloopisanalyzedtocomputemaximalgroupsofuniformlygeneratedreferencestothesamearray.Arrayspassedtoproceduresarenotpadded,sincenointerproceduralanalysisisperformed.Second,thecompilerinstantiatesexpression2withtheandvaluesofallreferencepairsfromeachgroupcontainingreferencestothearray,

andthe

andvaluesimposedbycurrentarraylayout.Iftheresultingdistanceconﬂicts,apadofoneelementisattemptedonallofthelowerdimensionsuntilnoconﬂictsarefound.Ifthepadsizefails,itisincrementedbyone,andthereanalyzed.Anupperboundonthepaddingsizeisimposedtoensuretermination.Inourexperiments,forallprogramsencounterednopadgreaterthantwoelementswasrequired.

3.4PrototypeSUIFImplementation

Toevaluatetheimpactofcompileranalysisonpaddingheuris-tics,eachofthefourtransformations(analysis-free/analysis-basedintra/inter-variablepadding)wereimplementedaspassesinSUIF.Whenapplyingbothintra/inter-variablepaddingincombination,weperformintra-variablepaddingbeforeinter-variablepadding.Beforeperformingpadding,anumberoftransformationsareappliedtoeachprograminapreprocessingphase.Preprocessingbeginswithconstantpropagation,whichcanhavetheeffectoftransformingarraysubscriptsinvolvingtheadditionofmultiplevariablestotheformrequiredofuniformlygeneratedreferences.Inthisway,constantpropagationcanimprovetheprecisionoftheanalysis.

Theremainingtransformationsareperformedtogivetheop-timizercompile-timecontrolovervariablebaseaddresses.First,localarrayandstructurevariablesarepromotedintotheglobalscope.Formalparameterstofunctionscannotanddonotneedtobepromoted,sincetheyrepresentvariablesdeclaredelsewhere.Arrayandstructureglobalvariablesarethenmadeintoﬁeldsofsuifmem,alargestructuredvariable(orcommonblockinFor-tran).Structures/commonblockswhicharethemselvesﬁeldsofsuifmemarerecursivelysplitsothatindividualvariablesmaybemovedaboutasnecessary.Preprocessingthusresultsinasingleglobalvariablecontainingallofthevariablestobeoptimized,asshowninFigure9.Optimizingpassesmaynowmodifythebaseaddressesofvariablesbyreorderingﬁeldsinthesuifmemstructureandinsertingpadvariables.3.5ExperimentalEvaluation

EachtransformationwasappliedtoanumberofscientiﬁckernelsandapplicationstakenfromtheSPEC92andNASbenchmarks

safe

adi32dgefa256dot256erleexpl512irr500kjacobi512linpackdmult300rb512shal512

6222394263115

70701007061100100

appbtappluappspbukembarfftpde

423441537

2509

padded

inter-arraypaddinglinesskipped

39204274612410262725195

arrayreferences

%uniform

%notglobal

401

191

3579150416694610135

27742165510084

6723987131

7504

intA[100];-->struct_globmem{foo(){...intB[100];intA[100];for(i)intB[100];A[i]=B[i];...}}suifmem;

foo(){

for(i)

suifmem.A[i]=suifmem.B[i];}

Figure9:Globalizationinpreprocessingstage

1timestoeach

candidatebaseaddress,ineachtestthecombinedtotalforallvariableswasalwayslessthan20.

Theﬁnalcolumnscategorizethearrayreferencesanalyzedineachprogram.Thefourcolumnsunderarrayreferencesindicatethenumberofarrayreferences(num),thepercentageofarrayreferenceswhichentersomegroup(uniform),thepercentageofglobalarrayreferenceswhichbytheirformcannotbeunformlygeneratednotuniform,andthepercentageofarrayreferenceswhicharen’tglobalnotglobal.(Thus,uniform+notuniform+notglobal=100).Onlyuniformreferencesareanalyzed;notuniformandnotglobalreferences,thoughcommoninanumberofprograms,areignored.Resultsshowmostreferenceswereuni-formlygenerated,thoughmanybelongedtoprocedureparameterswhichcouldnotbetransformed.3.5.2CacheMissRates

Table8presentstheimpactofoptimizationsonsimulatedcachemissrates.Foreachprogram,missratesaregivenfortheoriginalversionandanumberofversionstransformedbytheanalysis-freeandanalysis-basedheuristics.Thebestimprovementcomparedtotheoriginalprogramispresentedbothasthedropincachemissrate(bestimprove)andasthepercentdecreaseinmissrate(bestimprove%).Overall,datalayoutoptimizationsareveryeffectiveineliminatingcachemissesinmanyprograms.Cachemissratesarereducedby5%ormoreinelevenofthe24programstested,andby20%ormoreinsixprograms.Degradationsinperformancedooccur.Nooptimizedversionoutperformstheoriginalinthecasesofdoduc,fpppp,hydro2d,andappbt.Missrateincreasesof0.1%occurinthreeotherprogramsaswell.However,allofthese

intra

adi32dgefa256dot256erleexpl512irr500kjacobi512linpackdmult300rb512shal512

59.623.366.121.077.43.828.910.110.417.830.7

5..013.90.019.63.97.7

appbtappluappspbukembarfftpde

3..427.38.30.79.2

9.7

4.44.416.31.50.79.024.813.816.75.214.03.88.810.110.117.95.1

intra

11.723.316.74.813.93.88.810.110.417.85.3

5.33.813.50.019.63.97.7

3.94.515.01.50.79.0

9.5

bestimprove

804175778207003084

-0.4-0.2-0.30.020.30.010.9

-72458202

11.8

Table8:ImpactofOptimizationsonCacheMissRates(16Kdirect-mappedcache,32Blines)

increasesarerelativelymild.Innoprogramsdomissratesrisebymorethan1%.

Incomparingtheanalysis-freeapproachwiththeanalysis-basedapproach,weﬁndthatwhileeachmayperformbetterincertaincases,overtheentirerangeofprogramstested,theanalysis-basedapproachperformssomewhatbetter.TherowlabeledAver-agerevealsthemagnitudeofthesedifferences.Programsunder-goingintra-variablepaddingwithoutanalysisaveragea9.7%missrateversus9.5%undertheanalysis-basedtransformation.Asim-ilar9.3%to9.0%ratioresultsinthecaseofcomposedintra+inter-variablepadding.Sincetheanalysis-freeintra-variabletransfor-mationpadsunconditionally,itinevitablydegradesperformanceinanumberofprograms.Theworstexampleisadi32,forwhichthemissrateincreasesbyover9%.Theanalysis-basedintra-variabletransformationmakesnoseriousmistakes.

Manysuchweaknessesinourpresentimplementationforanalysis-basedpaddingmaystemfromthesimplicityofitsanaly-sis.Forinstance,referencestoarrayswhichareformalparametersareignoredduringanalysis.Interproceduralanalysiswillenablethecompilertomaptheseformalparameterstooneormoreglobalarrays(ortooffsetsofglobalarrays),allowingthemtobeopti-mized.

Additionalscalaranalysisshouldimprovethelevelofanalysisaswell.Inparticular,analysismaystandtobeneﬁtconsiderablybytheforwardpropagationofvariableassignmentsintoarraysub-scripts.WeillustratethispointwithsamplecodefromtheSPECbenchmarktomcatv.ThenestinFigure10contains8arrayref-erenceswhosesubscriptsinclude6differentvariables.However,variableJPandJMcontainthevaluesJ+1andJ-1throughoutthenest.IPandIMarerelatedtoIsimilarly.Whenwepropagatetheassignmentsofthesevariablesintothearraysubscripts,all8referencescanbeconﬁrmedasuniformlygeneratedandenterthesamegroupduringanalysis.Withoutforwardpropagation,anal-ysisfailstodetectthisrelationship,producing4groupswith2referenceseach.

Weshouldnotethatinourtests,weusedaversionoftomcatv

withtheforwardpropagationdescribedabove.Whenappliedtheunmodiﬁedtomcatv,ouranalysis-basedtransformationsobtainedmissratesof13.7%.Forwardpropagationreducedthesevaluesto7.7%.Theseresultsindicatethatdatalayoutoptimizationsbasedonuniformlygeneratedreferenceswillbemosteffectiveonlyinconjunctionwithstandardformsofscalaranalysis.3.5.3ExecutionTime

Versionsforprogramsexhibitingasigniﬁcant(5%ormore)re-ductioninmissratesweretimedonDECAlphas,asshowninTable9.Resultsdemonstratethatobservedmissratereductionstranslateintoimprovementsinperformance.Allexecutiontimesdecreaseby10%ormore.Fortheprogramsgiveninthistable,theanalysis-basedtransformationsagaindoslightlybetteroverall.However,withrespecttotheintra+interaveragespresented,theanalysis-basedheuristicowesasigniﬁcantshareofitsadvantage

programadi32dot256expl512shal512tomcatvbukAverage

orig7.910.132.343.433.02.029.0

0.81.55.1141.616.1

intra+inter7.58.519.130.929.71.823.8

0.81.55.1141.116.0

intra+inter6.48.518.829.929.21.823.5

2213621110

Table9:ImpactofOptimizationsonExecutionTimesforDECAlpha(inseconds)

tothedegradationexperiencedintheversionofadi32transformedbyitsanalysis-freecounterpart.

3.5.4Discussion

Ingeneral,wefoundsmallinter-variablepadsweresufﬁcienttoeliminatetheworstofthesevereconﬂictmisses.Boththeanalysis-freeandanalysis-basedheuristicsweresuccessful,thoughtheanalysis-basedheuristicismoreaccurateforadi32.Onlytwoprograms(dgefa256,erle)beneﬁtfromintra-variablepadding.Theanalysis-basedheuristicidentiﬁedtheneedforintra-variablepaddingforerle,butfailsfordgefa256.Theanalysis-freeheuristicofpaddingallarraycolumnsdegradestheperformanceofadi32.

Theadvantageofanalysis-basedheuristicsthusappearstobetheabilitytousuallydetectingproﬁtableopportunitiesforpadding,whileavoidingpaddingdecisionswhichdegradeper-formance.Previousstudiesindicatepaddingtransformationsbe-comemoreimportantastheratioofapplicationdatatocachesizeincreases[22],soanalysis-basedheuristicsmaybecomemoreimportantforreal-worldprograms.Ourresultsshownotallpro-gramsweregreatlyimproved,buttheimpactissigniﬁcantenoughthatdatalayoutoptimizationsshouldbeconsideredanimportanttoolforoptimizingcompilers.

RajamonyandCoxdevelopedaperformancedebuggerforde-tectingunnecessarysynchronizationatrun-timebyinstrumentingallloadsandstores[21].IntheSPLASHapplicationWater,itwasabletodetectbarriersguardingonlyantiandoutputdependencesthatmaybeeliminatedbyapplyingodd-evenrenaming.Incom-parison,SUIFatcompiletimeeliminatesmanybarriersguardingonlyanti-dependences.

4.2DataLocalityOptimizations

Datalocalityhasbeenrecognizedasasigniﬁcantperformanceissueforbothscalarandparallelarchitectures.WolfandLamprovideaconcisedeﬁnitionandsummaryofimportanttypesofdatalocality[26].Gannonetal.introducethenotionofuniformlygeneratedreferencesasameansofdiscoveringgroupreusebe-tweenreferencestothesamearray[9].McKinleyandTemamperformedastudyofloop-nestorientedcachebehaviorforsci-entiﬁcprogramsandconcludedthatconﬂictmissescausehalfofallcachemissesandmostintra-nestmisses[18].Mostre-searchershaveconcentratedoncomputation-reorderingtransfor-mations.Looppermutationandtilingaretheprimaryoptimizationtechniques[3,9,26],thoughloopﬁssion(distribution)andloopfusionhavealsobeenfoundtobehelpful[3].

Researchershavepreviouslyexaminedchangingdatalayoutinparallelapplications,buthavegenerallynotstudieditseffectonse-quentialapplications.JeremiassenandEggersautomaticallyelim-inatefalsesharinginexplicitlyparallelprograms[14].CierniakandLiexaminedcombiningarraytransposeandlooptransforma-tionstoimprovelocalityforparallelprograms[5].Amarasingheetal.demonstratedtheutilityofdatalayouttransformationsforparallelapplications[1].Theyfoundittobesigniﬁcantinelim-inatingadversecacheeffects,thoughspecializedoptimizationswerenecessarytoreducecomputationoverheadformodiﬁedar-raysubscripts.

Manyresearchershavealsoexaminedtheproblemofderivingestimatesofcachemissesinordertohelpguidedatalocalityop-timizations[8,9,26].Thesemodelstypicallycanpredictonlycapacitymissesbecausetheyassumeafully-associativecache.Incomparison,Ghoshetal.candetermineconﬂictmissesbycal-culatingcachemissequations,linearDiophantineequationsthatsummarizeeachloop’smemorybehaviorbycalculatingexactlythecacheslottowhicheachreferenceismapped[10].Theydemonstratehowtousecachemissequationstoselectarraypaddingstoeliminateconﬂictmisses,andblocksizesfortiling.Incomparison,wefocusonlyonsevereconﬂicts,usingasim-pliﬁedversionofcachemissequationstocalculatetheconﬂictdistancebetweenpairsofuniformlygeneratedarrayreferences.Ourapproachislessprecise,butshouldbemoreefﬁcient.

4RelatedWork

4.1SynchronizationOptimizations

Researcherscompilingforﬁne-graindata-parallellanguagessoughttoeliminatebarriersfollowingeachexpressionevaluation[12,20].Simpledatadependenceanalysiscanbeusedtoreducebarriersynchronizationbyordersofmagnitude,greatlyimprovingperformance.Eliminatingbarriersincompiler-parallelizedcodesismoredifﬁcult.Inpreviouswork[25]wepresentedtechniquestoeliminateorlessensynchronizationbasedoncommunicationanal-ysisusedbydistributed-memorycompilerstocalculateexplicitcommunication[13].O’BoyleandBodin[19]presenttechniquessimilartolocalsubscriptanalysistotoidentifydatadependencesthatcrossprocessorboundaries.StoherandO’Boyle[23]extendthisworkpresentanoptimalalgorithmforeliminatingbarriersynchronizationinperfectlynestedloops.

RecentlyresearchershaveinvestigatedoptimizationswhencompilingforsoftwareDSMs.Dwarkadasetal.appliedcom-pileranalysistoexplicitlyparallelprogramstoimprovetheirper-formanceonasoftwareDSM[7].ChandraandLarusevaluatedcombiningthePGIHPFcompilerandtheTempestsoftwareDSMsystem[4].Coxetal.conductedanexperimentalstudytoevalu-atetheperformanceofTreadMarksasatargetfortheForgeSPFshared-memorycompilerfromAPR[6].Theyidentifyopportuni-tiestoeliminateunneededbarriersynchronization;manyoftheirsuggestionsareimplementedintheSUIF/CVMsystem.

5Conclusions

ThispaperdescribesongoingSUIFcompilerprojectsforhighperformancearchitecturesattheUniversityofMaryland.First,weareinvestigatingcompilationtechniquesforeliminatingbar-riersynchronizationoverheadincompiler-parallelizedprogramsrunningonsoftwaredistributed-shared-memory(DSM)systems.Second,weareevaluatingdatalayoutoptimizations,startingwithtechniquestoeliminatecacheconﬂictmissesthroughvariablepadding.OuroptimizationshavebeenimplementedinSUIFandtestedonanumberofprograms.Preliminaryexperimentalre-sultsareencouraging,butindicatemorecarefulcompileranalysisisneededtoguideoptimizationheuristicsforbothparallelandsequentialcodes.Weintendtocontinueworkinbothareasinpursuitofourgoal:efﬁcientmachine-independentprogrammingofbothsequentialandparallelprocessors.

References

[1]J.putationAnderson,S.Amarasinghe,andM.Lam.Dataandcom-ofPracticetheFifthtransformationformultiprocessors.InProceedings1995.ofParallelACMSIGPLANProgrammingSymposium,SantaonBarbara,PrinciplesCA,Julyand[2]W.tiveBolosky,ceedingstechniquesR.Fitzgerald,andM.Scott.Simplebuteffec-PrinciplesofforNUMAmemorymanagement.InPro-,LitchﬁeldtheTwelfthPark,SymposiumAZ,DecemberonOperating19.

Systems[3]S.mizationsCarr,K.S.McKinley,andC.-W.SixthProgrammingLanguagesInternationalforimprovingConferencedatalocality.Tseng.Compileropti-onInProceedingsoftheVI),SanJose,CA,Octoberand1994.OperatingArchitecturalSystemsSupport(ASPLOS-for[4]S.HPFChandraProceedingsprogramsandforJ.R.ﬁne-grainLarus.Optimizingdistributedsharedcommunicationmemory.in

Principlesgas,NV,Juneandof1997.PracticetheSixthofACMParallelSIGPLANInProgrammingSymposium,LasVe-on[5]M.mationsCierniakandW.Li.UnifyingceedingsforLanguageofthedistributeddataandcontroltransfor-SIGPLANshared-memorymachines.InPro-1995.DesignandImplementation’95Conference,LaonJolla,ProgrammingCA,June[6]A.ingCox,astheperformanceS.Dwarkadas,ofH.softwareLu,anddistributedW.Zwaenepoel.Evaluat-11thatargetSwitzerland,InternationalforparallelizingAprilParallelcompilers.sharedmemory1997.ProcessingSymposiumInProceedings,Geneva,ofthe[7]S.compile-time/run-timeDwarkadas,A.Cox,andW.Zwaenepoel.Anintegrated

system.enceandonArchitecturalInProceedingssoftwareoftheEighthdistributedInternationalsharedmemoryConfer-ber1996.OperatingSystemsSupport(ASPLOS-VIII)forProgramming,Boston,LanguagesMA,Octo-[8]J.hancingFerrante,A.cacheV.Sarkar,effectiveness.andW.Thrash.Onestimatinganden-ersNicolau,andD.Padua,editors,InU.LanguagesBanerjee,D.andGelernter,Compil-SantaforClara,ParallelCA,Computing,August1991.FourthSpringer-Verlag.InternationalWorkshop,[9]D.andGannon,formation.localmemoryW.Jalby,managementandK.Gallivan.byglobalStrategiesprogramforcache

5(5):587–616,JournalOctoberofParallel1988.andDistributedComputingtrans-,[10]S.tions:Ghosh,M.ceedingsAnanalyticalMartonosi,representationandS.Malik.ofcacheCachemisses.missInequa-percomputingofthe,Vienna,1997ACMAustria,InternationalJuly1997.

ConferenceonPro-Su-[11]M.DetectingHall,S.parallelizingcoarse-grainAmarasinghe,parallelismB.Murphy,S.Liao,andM.Lam.

’95,SanDiego,compiler.CA,DecemberInProceedingsusing1995.ofanSupercomputinginterprocedural[12]P.MIMDHatcherComputersandM..Quinn.TheMITData-parallelPress,Cambridge,ProgrammingMA,1991.on

[13]S.FortranHiranandani,municationsDforK.Kennedy,andC.-W.Tseng.Compiling

ofMIMDtheACMdistributed-memory,35(8):66–80,Augustmachines.1992.Com-[14]T.sharedJeremiassenandS.Eggers.Reducingfalsesharingon

transformations.memorymultiprocessorsthroughcompiletimedataSymposiumInProceedingsoftheFifthACMSIGPLANming,SantaonBarbara,PrinciplesCA,andJulyPractice1995.ofParallelProgram-[15]P.weakKeleher.Therelativeimportanceofconcurrentwritersand

onDistributedconsistencyComputingmodels.InSystems16thInternational,HongKong,ConferenceMay1996.[16]P.consistencyKeleher,A.ceedingsforL.softwareCox,andW.Zwaenepoel.Lazyrelease

Computerofdistributedsharedmemory.InPro-Architecturethe19thAnnual,pagesInternational13–21,May1992.Symposiumon[17]P.compiler-parallelizedKeleherandC.-W.Tseng.EnhancingsoftwareDSMfor

11thSwitzerland,Internationalapplications.InProceedingsoftheAprilParallel1997.ProcessingSymposium,Geneva,[18]K.loopS.ConferenceonnestMcKinleylocality.andInProceedingsofO.Temam.Aquantitativeanalysisof

guagesOctoberand1996.OperatingArchitecturalSystemsSupportthe(ASPLOS-VIII)forEighthProgrammingLan-International,Boston,MA,[19]M.nizationO’BoyleofinsharedandF.Bodin.Compilerreductionofsynchro-ingthevirtualmemorysystems.InProceedings,Barcelona,1995ACMSpain,InternationalJuly1995.ConferenceonSupercomput-[20]M.eliminationPhilippsenTheinsynchronousandE.Heinz.FORALLs.Automaticsynchronization

Computation5thSymposium,McLean,ontheVA,FrontiersFebruaryof1995.MassivelyInFrontiersParallel’95:[21]R.eliminatingRajamonyallelexcessandA.L.Cox.Aperformancedebuggerfor

Workshopprograms.puteronModeling,InProceedingssynchronizationoftheinFourthshared-memoryInternationalpar-ary1996.andTelecommunicationAnalysis,Systemsand(MASCOTS)Simulationof,Febru-Com-[22]G.eliminatedRiveraand3819,cacheC.-W.conﬂictTseng.misses.CompilerTechnicaloptimizationsReportCS-TR-for

atCollegeDept.Park,ofComputerJuly1997.Science,UniversityofMaryland[23]E.synchronisationStohrandM.O’Boyle.ACMminimisation.AgraphInbasedProceedingsapproachoftothebarrier

Austria,InternationalJuly1997.ConferenceonSupercomputing,Vienna,1997[24]J.optimizationsTorrellas,M.ProceedingstoLam,reduceandmultiprocessorJ.Hennessy.Sharedcachedatamissplacement

rates.InallelProcessingofthe,St.1990Charles,InternationalIL,AugustConference1990.onPar-[25]C.-W.synchronization.Tseng.CompilerSymposiumInProceedingsofoptimizationstheforFiftheliminatingACMSIGPLANbarrier

ming,SantaonBarbara,PrinciplesCA,andJulyPractice1995.ofParallelProgram-[26]M.gorithm.E.WolfonInandProceedingsM.Lam.Adatalocalityoptimizingal-Toronto,ProgrammingCanada,JuneLanguageoftheSIGPLAN’91Conference1991.

DesignandImplementation,

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文

全部频道

Compiler Optimizations for High Performance Architectures