HwansooHan,GabrielRivera,Chau-WenTseng
DepartmentofComputerScience
UniversityofMarylandCollegePark,MD20742
Abstract
WedescribetwoongoingcompilerprojectsforhighperformancearchitecturesattheUniversityofMarylandbeingdevelopedus-ingtheStanfordSUIFcompilerinfrastructure.First,wearein-vestigatingtheimpactofcompilationtechniquesforeliminat-ingsynchronizationoverheadincompiler-parallelizedprogramsrunningonsoftwaredistributed-shared-memory(DSM)systems.Second,weareevaluatingdatalayouttransformationstoim-provecacheperformanceonuniprocessorsbyeliminatingconflictmissesthroughinter-andintra-variablepadding.Ouroptimiza-tionshavebeenimplementedinSUIFandtestedonanumberofprograms.Preliminaryresultsareencouraging.
1Introduction
OurgoalattheUniversityofMarylandistosupportefficientmachine-independentprogrammingofadvancedarchitectures.Usersdesiretheabilitytowriteprogramsthatcanrunwellonava-rietyofcomputers,sincesuchprogramsareportableandprotectsoftwareinvestment.However,modernprocessorarchitecturesarequitevariedandcomplex.Toachievehighperformance,itiscriticalthatthecompilergeneratescodethatcanefficientlyutilizetheunderlyinghardware.Tobridgethegapbetweenapplications,operatingsystems,andadvancedarchitectures,thecompilermustbeabletoperformadvancedanalysesandtransformations.Wearecurrentlypursuingresearchintwoareas:compilingforsoft-waredistributed-shared-memory(DSM)systemsanddatalayoutoptimizationsforhighperformancearchitectures.
2ReducingSynchronizationOverhead
Webeginbylookingatcompileroptimizationsforreducingsyn-chronizationoverhead,Theseoptimizationsareexaminedinthecontextofaprototypesystem[17]extendingSUIF[11]totargettheCVM[15]softwaredistributed-shared-memory(DSM).Re-centstudiesindicatesuchsystemscanapproachtheperformanceofcurrentmessage-passingHPFcompilersandexplicitmessage-passingMPIprograms[6,17].However,loadimbalanceandsynchronizationoverheadwereidentifiedaskeysourcesofinef-ficiency.
ExecutiontimesforfiveSUIF-parallelizedapplicationsona16processorIBMSP-2areshowninFigure1.Totalexecutiontimeissplitintoapplicationprocessingtime,OSoverhead,commu-nicationcost,sequentialwait(idletimewaitingfornextparalleltask),loadimbalance(idletimewaitingforcompletionofcurrentparalleltask),andbarrieroverhead(timespentexecutingbarriercode).Wefoundwhilenotmuchtimeisspentexplicitlyexe-cutingbarriers(1.5%ofexecutiontime),idleprocessortimeintheformofsequentialwaitandloadimbalancecomprisesalargepercentageoftotalexecutiontime(14%and27%,respectively).
with
toapplytheflushprotocolforthosepagesattheappropriatetime.2.1.3Compile-timeBarrierElimination
Thefork-joinmodelusedbyshared-memorycompilersisflexibleandcaneasilyhandlesequentialportionsofthecomputation;however,itimposestwosynchronizationeventsperparallelloopasshowninFigure2(A).First,abroadcastbarrierisinvokedbeforetheparallelloopbodytowakeupavailableworkerthreadsandprovideworkerswiththeaddressofthecomputationtobeperformedandparametersifneeded.Abarrieristheninvokedaftertheloopbodytoensureallworkerthreadshavecompletedbeforethemastercancontinue.
Inpreviouswork,wedevelopedcompileralgorithmsforbarriereliminationbasedontheobservationthatsynchronizationbetweenapairofprocessorsisonlynecessaryiftheycommunicateshareddata[25].Ifdatadependenceanalysiscanshowtwoadjacentloopnestsaccessdisjointsetsofdata,thenthebarrierseparatingthemmaybeeliminated,asinFigure2(B).Iftwoloopnestsaccessesthesamedata,butcommunicationanalysisprovesnoremotedataisaccessedbasedondataandcomputationdecompositioninforma-tion,thebarriermayalsobeeliminated,asinFigure2(C).Finally,ifcommunicationanalysisidentifiessimpleinterprocessorsharing
neighborneededelsebarrierneededifbarrierneededforanydependenceinsertbarrierelseifsyncpatterns,thebarriermaybereplacedwithlessexpensiveformsofsynchronization.Inparticular,ifdataisonlysharedbetweenneighboringprocessors,thebarriermaybereplacedbynearest-neighborsynchronization,asshowninFigure2(D).
2.2NovelTechniques
2.2.1CommunicationAnalysisUsingLocalSubscripts
Previouscommunicationanalysisreliedoncompile-timeinfor-mationonthedataandcomputationdecompositionselectedforaprogramtopreciselydeterminewhetherinterprocessorcommuni-cationtakesplace[25].Wefindthatanalternativecommunicationanalysistechniquebasedonlocalsubscriptanalysiscanalsoyieldgoodresults,butwithlesscomplexanalysis.TheLocalsubscriptanalysisalgorithmisshowninFigure3.
Weassumedependenceanalysishasalreadyeliminatedallar-rayreferencepairsbetweenloopnestsproventoaccessdisjointmemorylocations.Localsubscriptanalysisreliesonthecompilerselectingaconsistentchunkpartitioningofparallelloopitera-tions.Consistentiterationpartitioningofloopswithidenticalloopboundswillthenalwaysassignthesameloopiterationstoeachprocessor.Ifanysubscriptpaircontainingparallelloopindexisidentical,eachprocessorwillonlyaccesslocaldata.Ifthesub-scriptsdifferbyonlyaconstant,theneachprocessorwillaccessremotedataaconstantnumberofprocessorsaway.Sincecon-stantdifferencesinsubscriptsareusuallysmall,processorswillendupaccessingremotedataonneighboringprocessors,allowingbarrierstobereplacedbynearest-neighborsynchronization.
Localsubscriptanalysisisdesignedtocomplementfullcommu-nicationanalysis.Itislessprecise,butcanbecomputedefficientlybecauseitonlyreliesonlocalsymbolicinformation.Localsub-scriptanalysiscanthusquicklyeliminatesimplearrayreferencepairsthatdonotrequiresynchronization,reducingthenumberofdependencesthatmustbeanalyzedwithcommunicationanalysis.Localsubscriptanalysiscanalsobeappliedinsomecaseswherestandardcommunicationanalysisfail,sinceitreliesonlyonlocalprograminformation.Forinstance,inFigure4(A),localsubscriptanalysisdoesnotneedtofullyanalyze1()or2()beforeelimi-natingthebarrier.Incomparison,communicationanalysismustcalculatewhatdataiscommunicated,andcanbedisabledbyasinglecomplexsubscript.
2.2.2ExploitingLazyReleaseConsistency
Asecondenhancementtoourcompile-timebarriereliminationalgorithmismadepossiblebytheunderlyingsemanticsoflazy-release-consistencysoftwareDSMs.Inatraditionalshared-
with
memorymodel,synchronizationisneededbetweentwoloopnestsiftwoprocessorsaccessshareddata,withatleastoneofthepro-cessorsperformingawritetotheshareddata.However,inalazy-release-consistencysoftwareDSM,ifthesharedreferenceisareadinthefirstloopandawriteinthesecondloop,nosynchro-nizationisneededbecausewritesfromthesecondloopwillnotbecomevisibletothereadinthefirstloopuntilsynchronizationisencountered.Inotherwords,anti-dependences(write-after-read)donotneedtobesynchronized.
Toseehowthecompilercanusethisproperty,considertheexampleshowninFigure4(b).ThefirstloopnestreadsnonlocalvaluesofBwhicharedefinedinthesecondloopnest.Thecross-processordependencecausedbyBisthusaloop-independentanti-dependence.Normally,synchronizationisneededtoensuretheoldvaluesofBarereadbeforethenewvaluesofBarewritten.However,withlazyreleaseconsistencythesoftwareDSMguar-anteesthatnewvaluesofBonanotherprocessorwillnotbemadevisibleuntilthetwoprocessorssynchronize.Sincetherearenootherloop-independentdependencesbetweenthetwoloopnests,synchronizationbetweenthemisnotrequired.
Across-processortrue/flowdependence(read-after-write)ex-istswhichdoesneedsynchronization.Itisthedependencebe-tweendefinitionsofBinthesecondloopnestandreadsofnonlocalvaluesofBinthefirstloopnest.ThisdependenceiscarriedbytheouterTIMEloop,sincetheendpointsofthedependenceoccurondifferentiterationsoftheTIMEloop.ThecompilernormallyinsertsabarrierasthelaststatementoftheTIMEloop,butlocalsubscriptanalysiscanshowonlynearest-neighborsynchroniza-tionisneeded.
Theonecasewheresynchronizationisneededforanti-dependencesiswhentheprocessorperformingareaddoesnot
yetpossessacopyoftheshareddata,sinceitmayretrieveacopyofthedatawiththenewvalues.Forscientificcomputationswhereiterativecomputationsaretherule,thisisrarelythecase.Ourimplementationofnearest-neighborsynchronizationsolvesthisproblembyinvokingaglobalbarrierthefirsttimeitisin-vokedateachlocationintheprogram.Sinceanti-dependencesmaybeignored,thealgorithmforinsertingbarriersynchronizationbecomessimilartothealgorithmformessagevectorization[13].Thelevelofthedeepesttrue/flowcross-processordependencebe-comesthepointwheresynchronizationmustbeinsertedtopreventdataraces.Synchronizationatlowerlooplevelsisnotneeded.2.2.3CustomizedNearest-NeighborSynchronization
Atsomebarriers,thecompilercandetectcommunicationonlytakesplacebetweenneighboringprocessors[25].Totakeadvan-tageofthisinformation,weimplementedacustomizedroutinefornearest-neighborsynchronization(whereeachprocessorhaseitherzero,one,ortwoneighbors)directlyinCVM.Therou-tinesendsasinglemessagetoeachneighboringprocessoruponarrival,andcontinuesassoonasmessagesarereceivedfromallneighboringprocessors.Incomparison,fornormalglobalbarriersallprocessorssendasinglemessagetothebarriermaster,whichbroadcastsareplyonceallprocessorshavecheckedin.
Thecustomizednearest-neighborsynchronizationhasseveraladvantagesoverstandardglobalbarriers.Themostimportantisthatnearest-neighborsynchronizationallowsatleastsomeoftheinducedloadimbalancetosmoothoutbeforeitdelaysallprocessors.However,thisbenefitusuallyoccursonlyiftherearemultiplenearest-neighborsynchronizationeventsinvokedinsequence.Ifnearest-neighborsynchronizationandglobalbarriersareexecutedinalternatingsequence,opportunitiestosmoothoutloadimbalancearelessened.Second,theserialbottleneckofthebarriermasterisavoided.Thisisnotalargeadvantageforthesizeofthesystemthatwearecurrentlyevaluating,butshouldbesignificantforlargersystems.Finally,commonmessagescanbeusedtocarrybothsynchronizationanddata,becausebothflowonlybetweenneighbors.2.3ExperimentalResults2.3.1Applications
Weevaluatedtheperformanceofourcompiler/softwareDSMinterfacewithfiveprogramsshowninTable1.The“Granularity”columnreferstotheaveragelengthinsecondsofaparallelizedloop.Exceptwhereindicated,numbersbelowrefertothelargerdatasetforeachapplication.expl,andredblackaredensestencilkernelstypicallyfoundiniterativePDEsolvers.jacobiisastencilkernelcombinedwithaconvergencetestthatcheckstheresidualvalueusingamaxreduction.swmandtomcatvareprogramsfromtheSPECbenchmarksuitecontainingamixtureofstencilsandreductions.WeusedtheversionoftomcatvfromAPRwhosearrayshavebeentransposedtoimprovedatalocality.
ProblemSizesNameexpl
256251220.06
0.34
JacobiIterationw/ConvergenceTest
redblack
5122102420.01
0.14
ShallowWaterModel
(SPEC)
2tomcatv
25651220.04
0.15
AllapplicationswereoriginallywritteninFortran,andtyp-icallycontainaninitializationsectionfollowedbyiterationsofatime-steploop.Statisticsandtimingsarecollectedaftertheinitializationsection.OptimizedversionsofeachprogramwereautomaticallygeneratedbytheSUIFcompiler.Thecompileran-alyzedbutwasunabletoapplysynchronizationoptimizationstofourotherprograms,indicatingthatnotallprogramsmaybenefitfromthetechniquespresentedinthispaper.2.3.2ExperimentalEnvironment
WeevaluatedouroptimizationsonanIBMSP-2with66MHzRS/6000Power2processorsoperatingAIX4.1.Nodesarecon-nectedbya120Mbit/secbi-directionalOmegaswitchcapableofasustainedbandwidthofapproximately40Mbytespersec-ond.SimpleRPCsontheSP-2require160secs.Aone-hoppagemiss,wherethepagemanagerisalsotheowner,requirestwomessagesand939secs.Two-hoppagemissesrequirethreemessagessystemsecstocallscalland1376secs.Inthebestcase,AIXrequires128requireuser-level12handlerssecs.However,forpagevirtualfaults,memoryandmprotectprimitivecostsinthecurrentsystemarelocation-dependent,occasionallyincreasingthesecoststoamillisecondormore.
Inourexperiments,CVM[15]applicationswritteninFor-tran77wereautomaticallyparallelizedbytheStanfordSUIFpar-allelizingcompilerversion1.1.2[11],withcloseto100%ofthecomputationinparallelregions.Asimplechunkschedulingpolicyassignscontiguousiterationsofequalornear-equalsizetoeachprocessor,resultinginaconsistentcomputationpartitionthaten-couragesgoodlocality.TheresultingCoutputcodewascompiledbyg++version2.7.2withthe-O2flag,thenlinkedwiththeSUIFrun-timesystemandtheCVMlibrariestoproduceexecutablecodeontheIBMSP-2.Customizedsupportforreductionsandtheflushupdateprotocolwereusedtoimproveoverallperformance[17].
2.3.3EffectivenessofCompilerSynchronization
Optimizations
First,weexaminetheeffectivenessofcompileralgorithmsinelim-inatingsynchronization.Table2displaysthenumberofbarriersfoundineachprogramatcompiletime,andthepercentageelim-inatedbydifferentlevelsofoptimization.Table3presentsthesameinformationforbarriersexecuteddynamicallyatruntimeforeachapplication.Thefirsttwocolumnsfor“barriers”indi-catethenumberofbarriersexecutedintheoriginalprogramandthepercentagereductionbymergingdoallsintothesameparallelProgramBarriersinprogram
numdepcommlazyjacobi62533swm
81618
6325255017133127
–
6362–11.2
36
50
6.0
orig
%eliminated[%replaced]mergeexpl333333
802550
–redblack386313
5303333
–
tomcatv3671–Average
304514
Table3:DynamicMeasurementofSynchronizationOpts
region.Theremainingcolumnsshowthepercentageofbarrierseliminatedorreplacedbynearest-neighborsynchronizationfordifferentlevelsofoptimization.Thecompileroptimizationlevelsareasfollows:“dep”eliminatesbarriersinparallelregionsbasedondatadependences,“comm”performscommunicationanalysisusinglocalsubscriptanalysistoeliminatebarriers,“lazy”elim-inatesbarriersguardingonlyanti-dependences.Communicationanalysismayalsoreplacebarrierswithnearest-neighborsynchro-nization,Optimizationsarecumulative.
Weseefrombothtablesthatthecompileriseffectiveatelimi-natingparallelloopsandbarriersencounteredduringcompilation,withroughlysimilarbenefitsforthenumberofparallelloopsandbarriersactuallyexecutedbytheapplicationatruntime.Exam-iningtherun-timemeasurementsinTable3,wefindthecompilerisquitesuccessfulindiscoveringparallelloopswhichmaybemergedintoasingleparallelregion,eliminatingonaverage30%ofbarriersexecutedwhichisequivalentto60%eliminationofdoallinvocations.Dependenceanalysisaloneisonlyabletoeliminatebarriersintwoprograms,redblackandswm,buttheimprovementinredblackissignificant.Communicationanal-ysiscaneliminatebarriersintomcatvandreplacebarriersinexpl,jacobi,andredblack.Detectingbarriersguardingonlyanti-dependences,thecompilercaneliminatemorebarriersoutrightinexplandjacobi.Thenumberofreplacedbarriersgoesdowninexplandjacobi,sincethecompilercanprovesomenearest-neighborbarriersguardonlyanti-dependences.Ap-plyingalloptimizations,onaverage53%ofallbarrierexecutionsareeliminatedinthesefiveprograms,with6%ofbarriersreplacedbynearest-neighborsynchronization.
2.3.4ImpactofCompilerOptimizationson
SynchronizationOverhead
Inordertoevaluatesynchronizationoptimizationsinmoredetail,weinstrumentedCVMtodirectlymeasurebarrieroverhead(timespentexecutingbarriercode),loadimbalance(idletimewaitingforcompletionofcurrentparalleltask)andsequentialwait(idletimewaitingfornextparalleltask).Wefoundtheactualtimespentinbarrierroutinestobesmall(refertofigure1).Instead,mostoftheoverheadwascausedbysequentialwaitandloadimbalance.Table4showssequentialwaitandloadimbalanceaspercentageofoverallexecutiontime.Themeasurementsarefor16processorSP-2runswithsmallandlargedatasets.Averagesequentialwaitandloadimbalanceare14%and27%ofexecutiontimeforlargedatasetsand23%and31%forsmalldatasetsrespectively.Loadimbalancetakesuplesspercentageofexecutiontimewithlargedatasets,becausethelargeramountofdataonindividualproces-sorsgivesthegreaterchancetosmoothoutloadimbalance.We
wait+imbalance
wait
imbal.
expl
sm3738lg
22
2631272414
15
redblack
sm
677272lg
66
6969153212
34
tomcatv
sm182325lg
7.7
12
14
233114
27
Table4:EffectofOptimizationsonLoadImbalanceandSequen-tialWait(16Processors)
seq_wait70%imbalance% exec time (unoptimized)60%50%40%30%20%10%0%UDCLUDCLUDCLUDCLUDCLexplUnoptimizedjacobiredblackswmtomcatvLazy analysisDep analysisComm analysisFigure5:EffectofOptimizationsonLoadImbalance(16Proces-sors)
alsomeasuredtheimpactofsynchronizationoptimizationsonre-ducingloadimbalance.Table4displaysthepercentagereductioninidletimeafterapplyingdifferentlevelsofoptimization.Fig-ure5graphicallypresentsthesameinformationforthelargedatasets.Weseethatoptimizationscansignificantlyreducesequentialwaitandloadimbalanceforsomeoftheapplicationsstudied(72%decreaseforredblackwithsmalldataset).Averagesequentialwaitandloadimbalancedecreaseby28%,32%,and34%withlargedatasetsandby34%,37%,and43%withsmalldatasetsforthreeoptimizationlevels,respectively.
2.3.5ImpactofCompilerOptimizationsonProgram
PerformanceFigure6displaystheimpactofsynchronizationoptimizationsonapplicationperformanceontheSP-2,forbothsmallandlargedatasets.Foreachgraph,theY-axismeasuresimprovementoverunoptimizedprograms,theX-axispresentsthreeoptimizedver-sionsofeachprogramforboth8and16processorruns:“depanalysis”mergesparallelregionsandeliminatesbarriersbasedondatadependences,“commanalysis”useslocalsubscriptanal-ysistoeliminatebarriersorreplacethemwithnearest-neighborsynchronization,“lazyanalysis”eliminatesbarriersguardingonlyanti-dependences.Exceptforredblackandswm,performancefor“depanalysis”isthesameasthatforsimplymergingadjacentparallelloopsintothesameparallelregion.Optimizationsarecumulative.
Performanceforapplicationscoverabroadrange.For16pro-cessorrunswithlargedatasets,averageimprovementfromsyn-chronizationoptimizationsis14%fordependenceanalysis,17%
jacobi
lgsmlgsmlgsm
unopt5.7121.32.21.36.7
dep6.6132.72.41.57.7
comm6.9133.02.41.77.9
lazy7.1133.12.41.78.1
swm
Average
sagestripledforsomeapplicationswhenusingnearest-neighborbarriers.Nonetheless,theoverallimpactonperformanceislim-ited.Theprimaryreasonisthatmanyofthebarriersynchroniza-tionsthatareprimecandidatestobereplacedbynearest-neighborsynchronizationcanbeeliminatedinstead.However,nearest-neighborsynchronizationmayprovemoreimportantwhenusingmorethan16processors.
%Improvement(depvs.comm)
expl5.52.5redblack–
–
tomcatvAverage
Table6:ImpactofNearest-NeighborBarriers(16Processors)
3DataLayoutTransformations
Inadditiontosynchronizationoptimizations,wearealsoinves-tigatingdatalayouttransformationsforbothuniprocessorandmultiprocessorarchitectures.Becauseoftheincreasinggapbe-tweenprocessorandmemoryspeeds,applicationsthatdonotusecacheseffectivelyspendmostoftheirtimewaitingformemoryaccesses.Duetohardwareconstraints,cacheshavesetassocia-tivity,wherememoryaddressescanonlybemappedtooneoflocationsina-wayassociativecache.Limitedsetassociativitycanreducetheamountofreuseprogramsexploit.Conflictmissesmayoccurwhentoomanydataitemsmaptothesamesetofcachelocations,causingcachelinestobeflushedfromcachebe-foretheymaybereused,despitesufficientcapacityintheoverallcache.Conflictmisseshavebeenfoundtobeasignificantsourceofpoorperformanceinscientificprograms,particularlywithinloopnests[18].
3.1EliminatingConflictMisses
Compilertransformationscanbeveryeffectiveineliminatingcon-flictmissesforscientificprogramswithregularaccesspatterns.Weevaluatetwocompilertransformationstoeliminateconflictmisses:1)inter-variablepadding(modifyvariablebaseaddress),2)intra-variablepadding(increasearraydimensionsize)[2,24].Unlikecompilertransformationsthatrestructurethecomputationperformedbytheprogram,thesetwotechniquesmodifythepro-gram’sdatalayout.
TwomotivatingexamplesofthesetransformationsareshowninFigures7and8.InFigure7,unit-stridereferencestoB(i)andC(i)providespatiallocality,leadingtocachereuse.However,ifBandCareseparatedbyamultipleofthecachesizeandthecacheisdirectmapped,everyaccesstoC(i)willmaptothesamecachelineasthepreviousB(i)accessandviceversa.Asaresult,everyreferencewillproduceaconflictmiss.Toregainthebenefitsofspatialreuse,inter-variablepaddingmaybeappliedasshowninFigure7.
InFigure8,thestencilcomputationexhibitsmuchspatialandtemporalreusebetweenthereferencestoB.However,ifthecol-umnsizeofB(i)isamultipleofthecachesize,conflictmisseswillresult,eliminatingreusebetweenreferencestoB(i).Toregainspatialreuse,intra-variablepaddingappliedtothecolumnsofBcanchangeitsinternallayoutsothatcolumnsnolongermaptoconflictinglocationsonthecache,asshowninFigure8.
Bothoftheseexamplesshowinstancesofsevereconflictmisses,whenconflictmissesoccuroneveryloopiteration.Intheremainderofthissection,weevaluateseveralheuristicsfor
realA(N,N),B(N,N)doj=1,Ndoi=1,N
A(i,j)=0.25*(B(i+1,j)+B(i-1,j)
+B(i,j+1)+B(i,j-1))-->realA(N,N),B(N+PAD,N)
Figure8:Intra-VariablePadding(ChangeColumnDimension)
locations,separatedby
cachelines.Basedonempiricalstudies[22],ourimplementationcurrentlyseparatesarraybaseaddressesbyfourcachelines(=4).Analternativeheuristicwhichseparatesarraysasfaraspossibleinthecachewasshowntobelesseffectiveineliminatingconflictmisses[22].
3.2.2Intra-variablePaddingHeuristic
Apossibleintra-variablepaddingheuristicistopadthedimen-sionsofeacharraysothatthesizeofsubarrayscomprisinglowerdimensionsarefarenoughfromanycachesizemultiple.Inthesimplestcase,arraycolumnswouldbepaddedwhenneighbor-ingcolumnsaremappedtothesamelocationincache.Empiricalstudies[22]indicatethisheuristicmissesmanyprofitableopportu-nitiesforintra-variablepadding.Asaresult,ourimplementationcurrentlypadsallarraycolumnsbyfourelementsunconditionallywhensafe.Itshouldbeclearthatthisheuristicisnotpractical;however,ourexperimentsdemonstrateitisquiteeffectiveiniden-tifyingmanyprogramswhichstandtobenefitfromintra-variablepadding.
3.3Analysis-BasedPadding
Formoresophisticatedpaddingheuristics,additionalcompileranalysisofarrayreferencesisrequired.First,wefinduniformlygeneratedreferencesdescribedbyGannonetal.[9].Wethencomputetheconflictdistance,thedifferencebetweenthememory
addressesoftwoarrayreferencesmodulothecachesize
.Ifoneachloopiterationtheconflictdistanceislessthanthecache
linesize
,severeconflictmissescanresult.Observethatmakingtheconflictdistanceorgreaterisasufficient(thoughnotnecessary)conditionforeliminatingsevereconflicts;weusethistoguideourdatalayoutoptimizations.Ourtechniquesforcomputingconflictdistancesisaspecialsimplifiedcaseofcachemissequations[10].
3.3.1Inter-variablePaddingHeuristic
Centraltoourtechniqueforeliminatingsevereconflictmissesisconsideringpairsofarraysreferenceswithconstantconflictdistancesoneachloopiteration.Wefindthatsuchpairsarecommonlyuniformlygeneratedreferences[9]extendedforcon-formingarrays,arraysthathaveequaldimensionsizesinallbuttheirhighestdimensionandpossessequal-sizedelements.Apairofuniformlygeneratedreferencesto-dimensionalconforming
arraysAandBhastheform11andorthe1value120,and2
eachand,whereissomeeach22
isanintegervariableconstantintegervalue.Uniformlygeneratedreferencesmayincludereferencesnestedinforloopswithindices,referencesto1-dimensionalarraysofar-bitrarysize,andreferenceswithintegersubscripts(inwhichcasereducesto0).
Whenwecalculatetheconflictdistanceonagivenloopiterationbysubtractingthetwomemorylocationsaccessedbyuniformlygeneratedreferences,weobtainanexpressioninwhichalltermscancel:
1
(1)
1
1
Inthisexpression,andarethebaseaddressesofAandB,isthecommonelementsize,andisthecommentsizeofdimension.Alowerboundof0is
assumedineachdimension,butotherlowerbounds
andmaybeaccountedforbyaddingtothefactorTakingthemoduloofthecachesizethenyieldstheconflict.distance.
Wewishtoidentifyallpairsofuniformlygeneratedreferenceswhichmayconflict.Ouralgorithmthereforeanalyzeseachlooptofindsuchreferences.Sincereferencesandbeinguniformandreferencesandbeinguniformimpliesandareuni-formaswell,pairsofuniformlygeneratedreferencesinaloopcanbeefficientlyrepresentedasgroupsofuniformlygeneratedreferences.Eachreferenceinagroupisuniformwithrespecttoallotherreferencesinthegroup.Ourinitialanalysisisthusistocomputeforeveryloopthethesetofmaximalgroupsofuniformlygeneratedreferences.
Weapplyinter-variablepaddingasfollows.Theobjectiveistoassigneacharrayabaseaddresssothatnotwouniformly-generatedreferencesinagroupconflict.Weaccomplishthisinasimple,greedyfashion.Allarraybaseaddressesareinitiallyundefined.Then,eacharrayinturnisassignedabaseaddressasfollows.First,thenextavailableaddressisattempted.Ifthisaddressleadstoaconflictwithanyotherarraywithadefinedbase
address,theaddressisincrementedbythecachelinesize
,andtheprocessrepeats.Thetestforaconflictbetweentwoarraysinvolvesinstantiatingexpression1withtheandvaluesfromallpairsreferencingthetwoarraysineachoftheanalyzedgroups,andthe,,andBaseAddrvaluesimposedbythearraysunderconsideration.Iftheconflictdistanceobtainedislessthan,thenweassumetheaddressesconflict.
Abaseaddressneedstobeincrementedatmostormoretimes,sinceconflictsforthefirstvaluesimpliesconflictsforanyremainingvaluesobtainedbyadding.Ifthiscasehappens,thealgorithmhasfailedtofindasatisfactoryaddressand
assignsthefirst,unpaddedaddressattempted.Inourexperimentsthecompilercanalwaysfindanon-conflictingbaseaddress.3.3.2Intra-variablePaddingHeuristic
Intra-arraypaddingaimstomodifythedistancesbetweenrefer-encestothesamearrayinordertoeliminateconflicts.Thenotionofuniformlygeneratedreferencesisagainusefulinidentifyingcommonformsofarrayreferencesaccessinglocationsofafixeddistance.Bythedefinitiongiveninthelastsection,tworeferencestothesamearrayareuniformlygeneratedaslongastheyareoftheformagainwhere1each12is2anunmodified,variable11or20,and2
eachand,isanarbitraryinteger.Allotherconditionsarealwayssatisfiedsincethetworeferencesaccessthesamearray.Expression1inthisspecialcasereducesto
1
(2)
1
1
sincethebaseaddresses(andanyarraylowerbounds)canceleachotherout.Statedintermsofexpression2,intra-arraypadding
modifiesthe
valuesforeacharrayasneededtoproducenon-conflictingdistancesbetweenalluniformlygeneratedreferences.Weapplyintra-variablepaddingasfollows.First,eachloopisanalyzedtocomputemaximalgroupsofuniformlygeneratedreferencestothesamearray.Arrayspassedtoproceduresarenotpadded,sincenointerproceduralanalysisisperformed.Second,thecompilerinstantiatesexpression2withtheandvaluesofallreferencepairsfromeachgroupcontainingreferencestothearray,
andthe
andvaluesimposedbycurrentarraylayout.Iftheresultingdistanceconflicts,apadofoneelementisattemptedonallofthelowerdimensionsuntilnoconflictsarefound.Ifthepadsizefails,itisincrementedbyone,andthereanalyzed.Anupperboundonthepaddingsizeisimposedtoensuretermination.Inourexperiments,forallprogramsencounterednopadgreaterthantwoelementswasrequired.
3.4PrototypeSUIFImplementation
Toevaluatetheimpactofcompileranalysisonpaddingheuris-tics,eachofthefourtransformations(analysis-free/analysis-basedintra/inter-variablepadding)wereimplementedaspassesinSUIF.Whenapplyingbothintra/inter-variablepaddingincombination,weperformintra-variablepaddingbeforeinter-variablepadding.Beforeperformingpadding,anumberoftransformationsareappliedtoeachprograminapreprocessingphase.Preprocessingbeginswithconstantpropagation,whichcanhavetheeffectoftransformingarraysubscriptsinvolvingtheadditionofmultiplevariablestotheformrequiredofuniformlygeneratedreferences.Inthisway,constantpropagationcanimprovetheprecisionoftheanalysis.
Theremainingtransformationsareperformedtogivetheop-timizercompile-timecontrolovervariablebaseaddresses.First,localarrayandstructurevariablesarepromotedintotheglobalscope.Formalparameterstofunctionscannotanddonotneedtobepromoted,sincetheyrepresentvariablesdeclaredelsewhere.Arrayandstructureglobalvariablesarethenmadeintofieldsofsuifmem,alargestructuredvariable(orcommonblockinFor-tran).Structures/commonblockswhicharethemselvesfieldsofsuifmemarerecursivelysplitsothatindividualvariablesmaybemovedaboutasnecessary.Preprocessingthusresultsinasingleglobalvariablecontainingallofthevariablestobeoptimized,asshowninFigure9.Optimizingpassesmaynowmodifythebaseaddressesofvariablesbyreorderingfieldsinthesuifmemstructureandinsertingpadvariables.3.5ExperimentalEvaluation
EachtransformationwasappliedtoanumberofscientifickernelsandapplicationstakenfromtheSPEC92andNASbenchmarks
safe
adi32dgefa256dot256erleexpl512irr500kjacobi512linpackdmult300rb512shal512
6222394263115
70701007061100100
appbtappluappspbukembarfftpde
423441537
2509
padded
21
inter-arraypaddinglinesskipped
39204274612410262725195
5
arrayreferences
%uniform
10
%notglobal
401
191
1
3579150416694610135
27742165510084
6723987131
7504
intA[100];-->struct_globmem{foo(){...intB[100];intA[100];for(i)intB[100];A[i]=B[i];...}}suifmem;
foo(){
for(i)
suifmem.A[i]=suifmem.B[i];}
Figure9:Globalizationinpreprocessingstage
1timestoeach
candidatebaseaddress,ineachtestthecombinedtotalforallvariableswasalwayslessthan20.
Thefinalcolumnscategorizethearrayreferencesanalyzedineachprogram.Thefourcolumnsunderarrayreferencesindicatethenumberofarrayreferences(num),thepercentageofarrayreferenceswhichentersomegroup(uniform),thepercentageofglobalarrayreferenceswhichbytheirformcannotbeunformlygeneratednotuniform,andthepercentageofarrayreferenceswhicharen’tglobalnotglobal.(Thus,uniform+notuniform+notglobal=100).Onlyuniformreferencesareanalyzed;notuniformandnotglobalreferences,thoughcommoninanumberofprograms,areignored.Resultsshowmostreferenceswereuni-formlygenerated,thoughmanybelongedtoprocedureparameterswhichcouldnotbetransformed.3.5.2CacheMissRates
Table8presentstheimpactofoptimizationsonsimulatedcachemissrates.Foreachprogram,missratesaregivenfortheoriginalversionandanumberofversionstransformedbytheanalysis-freeandanalysis-basedheuristics.Thebestimprovementcomparedtotheoriginalprogramispresentedbothasthedropincachemissrate(bestimprove)andasthepercentdecreaseinmissrate(bestimprove%).Overall,datalayoutoptimizationsareveryeffectiveineliminatingcachemissesinmanyprograms.Cachemissratesarereducedby5%ormoreinelevenofthe24programstested,andby20%ormoreinsixprograms.Degradationsinperformancedooccur.Nooptimizedversionoutperformstheoriginalinthecasesofdoduc,fpppp,hydro2d,andappbt.Missrateincreasesof0.1%occurinthreeotherprogramsaswell.However,allofthese
intra
adi32dgefa256dot256erleexpl512irr500kjacobi512linpackdmult300rb512shal512
59.623.366.121.077.43.828.910.110.417.830.7
5..013.90.019.63.97.7
appbtappluappspbukembarfftpde
3..427.38.30.79.2
9.7
4.44.416.31.50.79.024.813.816.75.214.03.88.810.110.117.95.1
intra
11.723.316.74.813.93.88.810.110.417.85.3
5.33.813.50.019.63.97.7
3.94.515.01.50.79.0
9.5
bestimprove
804175778207003084
-0.4-0.2-0.30.020.30.010.9
-72458202
11.8
Table8:ImpactofOptimizationsonCacheMissRates(16Kdirect-mappedcache,32Blines)
increasesarerelativelymild.Innoprogramsdomissratesrisebymorethan1%.
Incomparingtheanalysis-freeapproachwiththeanalysis-basedapproach,wefindthatwhileeachmayperformbetterincertaincases,overtheentirerangeofprogramstested,theanalysis-basedapproachperformssomewhatbetter.TherowlabeledAver-agerevealsthemagnitudeofthesedifferences.Programsunder-goingintra-variablepaddingwithoutanalysisaveragea9.7%missrateversus9.5%undertheanalysis-basedtransformation.Asim-ilar9.3%to9.0%ratioresultsinthecaseofcomposedintra+inter-variablepadding.Sincetheanalysis-freeintra-variabletransfor-mationpadsunconditionally,itinevitablydegradesperformanceinanumberofprograms.Theworstexampleisadi32,forwhichthemissrateincreasesbyover9%.Theanalysis-basedintra-variabletransformationmakesnoseriousmistakes.
Manysuchweaknessesinourpresentimplementationforanalysis-basedpaddingmaystemfromthesimplicityofitsanaly-sis.Forinstance,referencestoarrayswhichareformalparametersareignoredduringanalysis.Interproceduralanalysiswillenablethecompilertomaptheseformalparameterstooneormoreglobalarrays(ortooffsetsofglobalarrays),allowingthemtobeopti-mized.
Additionalscalaranalysisshouldimprovethelevelofanalysisaswell.Inparticular,analysismaystandtobenefitconsiderablybytheforwardpropagationofvariableassignmentsintoarraysub-scripts.WeillustratethispointwithsamplecodefromtheSPECbenchmarktomcatv.ThenestinFigure10contains8arrayref-erenceswhosesubscriptsinclude6differentvariables.However,variableJPandJMcontainthevaluesJ+1andJ-1throughoutthenest.IPandIMarerelatedtoIsimilarly.Whenwepropagatetheassignmentsofthesevariablesintothearraysubscripts,all8referencescanbeconfirmedasuniformlygeneratedandenterthesamegroupduringanalysis.Withoutforwardpropagation,anal-ysisfailstodetectthisrelationship,producing4groupswith2referenceseach.
Weshouldnotethatinourtests,weusedaversionoftomcatv
withtheforwardpropagationdescribedabove.Whenappliedtheunmodifiedtomcatv,ouranalysis-basedtransformationsobtainedmissratesof13.7%.Forwardpropagationreducedthesevaluesto7.7%.Theseresultsindicatethatdatalayoutoptimizationsbasedonuniformlygeneratedreferenceswillbemosteffectiveonlyinconjunctionwithstandardformsofscalaranalysis.3.5.3ExecutionTime
Versionsforprogramsexhibitingasignificant(5%ormore)re-ductioninmissratesweretimedonDECAlphas,asshowninTable9.Resultsdemonstratethatobservedmissratereductionstranslateintoimprovementsinperformance.Allexecutiontimesdecreaseby10%ormore.Fortheprogramsgiveninthistable,theanalysis-basedtransformationsagaindoslightlybetteroverall.However,withrespecttotheintra+interaveragespresented,theanalysis-basedheuristicowesasignificantshareofitsadvantage
programadi32dot256expl512shal512tomcatvbukAverage
orig7.910.132.343.433.02.029.0
0.81.55.1141.616.1
intra+inter7.58.519.130.929.71.823.8
0.81.55.1141.116.0
intra+inter6.48.518.829.929.21.823.5
2213621110
Table9:ImpactofOptimizationsonExecutionTimesforDECAlpha(inseconds)
tothedegradationexperiencedintheversionofadi32transformedbyitsanalysis-freecounterpart.
3.5.4Discussion
Ingeneral,wefoundsmallinter-variablepadsweresufficienttoeliminatetheworstofthesevereconflictmisses.Boththeanalysis-freeandanalysis-basedheuristicsweresuccessful,thoughtheanalysis-basedheuristicismoreaccurateforadi32.Onlytwoprograms(dgefa256,erle)benefitfromintra-variablepadding.Theanalysis-basedheuristicidentifiedtheneedforintra-variablepaddingforerle,butfailsfordgefa256.Theanalysis-freeheuristicofpaddingallarraycolumnsdegradestheperformanceofadi32.
Theadvantageofanalysis-basedheuristicsthusappearstobetheabilitytousuallydetectingprofitableopportunitiesforpadding,whileavoidingpaddingdecisionswhichdegradeper-formance.Previousstudiesindicatepaddingtransformationsbe-comemoreimportantastheratioofapplicationdatatocachesizeincreases[22],soanalysis-basedheuristicsmaybecomemoreimportantforreal-worldprograms.Ourresultsshownotallpro-gramsweregreatlyimproved,buttheimpactissignificantenoughthatdatalayoutoptimizationsshouldbeconsideredanimportanttoolforoptimizingcompilers.
RajamonyandCoxdevelopedaperformancedebuggerforde-tectingunnecessarysynchronizationatrun-timebyinstrumentingallloadsandstores[21].IntheSPLASHapplicationWater,itwasabletodetectbarriersguardingonlyantiandoutputdependencesthatmaybeeliminatedbyapplyingodd-evenrenaming.Incom-parison,SUIFatcompiletimeeliminatesmanybarriersguardingonlyanti-dependences.
4.2DataLocalityOptimizations
Datalocalityhasbeenrecognizedasasignificantperformanceissueforbothscalarandparallelarchitectures.WolfandLamprovideaconcisedefinitionandsummaryofimportanttypesofdatalocality[26].Gannonetal.introducethenotionofuniformlygeneratedreferencesasameansofdiscoveringgroupreusebe-tweenreferencestothesamearray[9].McKinleyandTemamperformedastudyofloop-nestorientedcachebehaviorforsci-entificprogramsandconcludedthatconflictmissescausehalfofallcachemissesandmostintra-nestmisses[18].Mostre-searchershaveconcentratedoncomputation-reorderingtransfor-mations.Looppermutationandtilingaretheprimaryoptimizationtechniques[3,9,26],thoughloopfission(distribution)andloopfusionhavealsobeenfoundtobehelpful[3].
Researchershavepreviouslyexaminedchangingdatalayoutinparallelapplications,buthavegenerallynotstudieditseffectonse-quentialapplications.JeremiassenandEggersautomaticallyelim-inatefalsesharinginexplicitlyparallelprograms[14].CierniakandLiexaminedcombiningarraytransposeandlooptransforma-tionstoimprovelocalityforparallelprograms[5].Amarasingheetal.demonstratedtheutilityofdatalayouttransformationsforparallelapplications[1].Theyfoundittobesignificantinelim-inatingadversecacheeffects,thoughspecializedoptimizationswerenecessarytoreducecomputationoverheadformodifiedar-raysubscripts.
Manyresearchershavealsoexaminedtheproblemofderivingestimatesofcachemissesinordertohelpguidedatalocalityop-timizations[8,9,26].Thesemodelstypicallycanpredictonlycapacitymissesbecausetheyassumeafully-associativecache.Incomparison,Ghoshetal.candetermineconflictmissesbycal-culatingcachemissequations,linearDiophantineequationsthatsummarizeeachloop’smemorybehaviorbycalculatingexactlythecacheslottowhicheachreferenceismapped[10].Theydemonstratehowtousecachemissequationstoselectarraypaddingstoeliminateconflictmisses,andblocksizesfortiling.Incomparison,wefocusonlyonsevereconflicts,usingasim-plifiedversionofcachemissequationstocalculatetheconflictdistancebetweenpairsofuniformlygeneratedarrayreferences.Ourapproachislessprecise,butshouldbemoreefficient.
4RelatedWork
4.1SynchronizationOptimizations
Researcherscompilingforfine-graindata-parallellanguagessoughttoeliminatebarriersfollowingeachexpressionevaluation[12,20].Simpledatadependenceanalysiscanbeusedtoreducebarriersynchronizationbyordersofmagnitude,greatlyimprovingperformance.Eliminatingbarriersincompiler-parallelizedcodesismoredifficult.Inpreviouswork[25]wepresentedtechniquestoeliminateorlessensynchronizationbasedoncommunicationanal-ysisusedbydistributed-memorycompilerstocalculateexplicitcommunication[13].O’BoyleandBodin[19]presenttechniquessimilartolocalsubscriptanalysistotoidentifydatadependencesthatcrossprocessorboundaries.StoherandO’Boyle[23]extendthisworkpresentanoptimalalgorithmforeliminatingbarriersynchronizationinperfectlynestedloops.
RecentlyresearchershaveinvestigatedoptimizationswhencompilingforsoftwareDSMs.Dwarkadasetal.appliedcom-pileranalysistoexplicitlyparallelprogramstoimprovetheirper-formanceonasoftwareDSM[7].ChandraandLarusevaluatedcombiningthePGIHPFcompilerandtheTempestsoftwareDSMsystem[4].Coxetal.conductedanexperimentalstudytoevalu-atetheperformanceofTreadMarksasatargetfortheForgeSPFshared-memorycompilerfromAPR[6].Theyidentifyopportuni-tiestoeliminateunneededbarriersynchronization;manyoftheirsuggestionsareimplementedintheSUIF/CVMsystem.
5Conclusions
ThispaperdescribesongoingSUIFcompilerprojectsforhighperformancearchitecturesattheUniversityofMaryland.First,weareinvestigatingcompilationtechniquesforeliminatingbar-riersynchronizationoverheadincompiler-parallelizedprogramsrunningonsoftwaredistributed-shared-memory(DSM)systems.Second,weareevaluatingdatalayoutoptimizations,startingwithtechniquestoeliminatecacheconflictmissesthroughvariablepadding.OuroptimizationshavebeenimplementedinSUIFandtestedonanumberofprograms.Preliminaryexperimentalre-sultsareencouraging,butindicatemorecarefulcompileranalysisisneededtoguideoptimizationheuristicsforbothparallelandsequentialcodes.Weintendtocontinueworkinbothareasinpursuitofourgoal:efficientmachine-independentprogrammingofbothsequentialandparallelprocessors.
References
[1]J.putationAnderson,S.Amarasinghe,andM.Lam.Dataandcom-ofPracticetheFifthtransformationformultiprocessors.InProceedings1995.ofParallelACMSIGPLANProgrammingSymposium,SantaonBarbara,PrinciplesCA,Julyand[2]W.tiveBolosky,ceedingstechniquesR.Fitzgerald,andM.Scott.Simplebuteffec-PrinciplesofforNUMAmemorymanagement.InPro-,LitchfieldtheTwelfthPark,SymposiumAZ,DecemberonOperating19.
Systems[3]S.mizationsCarr,K.S.McKinley,andC.-W.SixthProgrammingLanguagesInternationalforimprovingConferencedatalocality.Tseng.Compileropti-onInProceedingsoftheVI),SanJose,CA,Octoberand1994.OperatingArchitecturalSystemsSupport(ASPLOS-for[4]S.HPFChandraProceedingsprogramsandforJ.R.fine-grainLarus.Optimizingdistributedsharedcommunicationmemory.in
Principlesgas,NV,Juneandof1997.PracticetheSixthofACMParallelSIGPLANInProgrammingSymposium,LasVe-on[5]M.mationsCierniakandW.Li.UnifyingceedingsforLanguageofthedistributeddataandcontroltransfor-SIGPLANshared-memorymachines.InPro-1995.DesignandImplementation’95Conference,LaonJolla,ProgrammingCA,June[6]A.ingCox,astheperformanceS.Dwarkadas,ofH.softwareLu,anddistributedW.Zwaenepoel.Evaluat-11thatargetSwitzerland,InternationalforparallelizingAprilParallelcompilers.sharedmemory1997.ProcessingSymposiumInProceedings,Geneva,ofthe[7]S.compile-time/run-timeDwarkadas,A.Cox,andW.Zwaenepoel.Anintegrated
system.enceandonArchitecturalInProceedingssoftwareoftheEighthdistributedInternationalsharedmemoryConfer-ber1996.OperatingSystemsSupport(ASPLOS-VIII)forProgramming,Boston,LanguagesMA,Octo-[8]J.hancingFerrante,A.cacheV.Sarkar,effectiveness.andW.Thrash.Onestimatinganden-ersNicolau,andD.Padua,editors,InU.LanguagesBanerjee,D.andGelernter,Compil-SantaforClara,ParallelCA,Computing,August1991.FourthSpringer-Verlag.InternationalWorkshop,[9]D.andGannon,formation.localmemoryW.Jalby,managementandK.Gallivan.byglobalStrategiesprogramforcache
5(5):587–616,JournalOctoberofParallel1988.andDistributedComputingtrans-,[10]S.tions:Ghosh,M.ceedingsAnanalyticalMartonosi,representationandS.Malik.ofcacheCachemisses.missInequa-percomputingofthe,Vienna,1997ACMAustria,InternationalJuly1997.
ConferenceonPro-Su-[11]M.DetectingHall,S.parallelizingcoarse-grainAmarasinghe,parallelismB.Murphy,S.Liao,andM.Lam.
’95,SanDiego,compiler.CA,DecemberInProceedingsusing1995.ofanSupercomputinginterprocedural[12]P.MIMDHatcherComputersandM..Quinn.TheMITData-parallelPress,Cambridge,ProgrammingMA,1991.on
[13]S.FortranHiranandani,municationsDforK.Kennedy,andC.-W.Tseng.Compiling
ofMIMDtheACMdistributed-memory,35(8):66–80,Augustmachines.1992.Com-[14]T.sharedJeremiassenandS.Eggers.Reducingfalsesharingon
transformations.memorymultiprocessorsthroughcompiletimedataSymposiumInProceedingsoftheFifthACMSIGPLANming,SantaonBarbara,PrinciplesCA,andJulyPractice1995.ofParallelProgram-[15]P.weakKeleher.Therelativeimportanceofconcurrentwritersand
onDistributedconsistencyComputingmodels.InSystems16thInternational,HongKong,ConferenceMay1996.[16]P.consistencyKeleher,A.ceedingsforL.softwareCox,andW.Zwaenepoel.Lazyrelease
Computerofdistributedsharedmemory.InPro-Architecturethe19thAnnual,pagesInternational13–21,May1992.Symposiumon[17]P.compiler-parallelizedKeleherandC.-W.Tseng.EnhancingsoftwareDSMfor
11thSwitzerland,Internationalapplications.InProceedingsoftheAprilParallel1997.ProcessingSymposium,Geneva,[18]K.loopS.ConferenceonnestMcKinleylocality.andInProceedingsofO.Temam.Aquantitativeanalysisof
guagesOctoberand1996.OperatingArchitecturalSystemsSupportthe(ASPLOS-VIII)forEighthProgrammingLan-International,Boston,MA,[19]M.nizationO’BoyleofinsharedandF.Bodin.Compilerreductionofsynchro-ingthevirtualmemorysystems.InProceedings,Barcelona,1995ACMSpain,InternationalJuly1995.ConferenceonSupercomput-[20]M.eliminationPhilippsenTheinsynchronousandE.Heinz.FORALLs.Automaticsynchronization
Computation5thSymposium,McLean,ontheVA,FrontiersFebruaryof1995.MassivelyInFrontiersParallel’95:[21]R.eliminatingRajamonyallelexcessandA.L.Cox.Aperformancedebuggerfor
Workshopprograms.puteronModeling,InProceedingssynchronizationoftheinFourthshared-memoryInternationalpar-ary1996.andTelecommunicationAnalysis,Systemsand(MASCOTS)Simulationof,Febru-Com-[22]G.eliminatedRiveraand3819,cacheC.-W.conflictTseng.misses.CompilerTechnicaloptimizationsReportCS-TR-for
atCollegeDept.Park,ofComputerJuly1997.Science,UniversityofMaryland[23]E.synchronisationStohrandM.O’Boyle.ACMminimisation.AgraphInbasedProceedingsapproachoftothebarrier
Austria,InternationalJuly1997.ConferenceonSupercomputing,Vienna,1997[24]J.optimizationsTorrellas,M.ProceedingstoLam,reduceandmultiprocessorJ.Hennessy.Sharedcachedatamissplacement
rates.InallelProcessingofthe,St.1990Charles,InternationalIL,AugustConference1990.onPar-[25]C.-W.synchronization.Tseng.CompilerSymposiumInProceedingsofoptimizationstheforFiftheliminatingACMSIGPLANbarrier
ming,SantaonBarbara,PrinciplesCA,andJulyPractice1995.ofParallelProgram-[26]M.gorithm.E.WolfonInandProceedingsM.Lam.Adatalocalityoptimizingal-Toronto,ProgrammingCanada,JuneLanguageoftheSIGPLAN’91Conference1991.
DesignandImplementation,
因篇幅问题不能全部显示,请点此查看更多更全内容