CyclicDist работает медленнее на нескольких локалях

Я попытался реализовать умножение матриц с помощью модуля CyclicDist.

Когда я тестирую одну локаль против двух локалей, одна локаль работает намного быстрее. Это потому, что время для связи между двумя наноплатами Jetson действительно велико, или моя реализация не использует то, как работает CyclicDist?

Вот мой код:

 use Random, Time, CyclicDist;
var t : Timer;
t.start();

config const size = 10;
const Space = {1..size, 1..size};

const gridSpace = Space dmapped Cyclic(startIdx=Space.low);
var grid: [gridSpace] real;
fillRandom(grid);
const gridSpace2 = Space dmapped Cyclic(startIdx=Space.low);
var grid2: [gridSpace2] real;
fillRandom(grid2);
const gridSpace3 = Space dmapped Cyclic(startIdx=Space.low);
var grid3: [gridSpace] real;
forall i in 1..size do {
    forall j in 1..size do {
        forall k in 1..size do {
            grid3[i,j] += grid[i,k] * grid2[k,j];
        }
    }
}
t.stop();
writeln("Done!:");
writeln(t.elapsed(),"seconds");
writeln("Size of matrix was:", size);
t.clear()

Я знаю, что моя реализация не оптимальна для систем с распределенной памятью.

Bofo 14.12.2019 источник

comment

работает на 551? - Galaxy 14.12.2019

Ответы (2)

arrow_upward
4
arrow_downward

Вероятно, основная причина того, что эта программа не масштабируется, заключается в том, что при вычислении никогда не используются никакие локали, кроме исходной. В частности, циклы forall по диапазонам, как в вашем коде:

forall i in 1..size do

всегда запускайте все свои итерации, используя задачи, выполняемые в текущей локали. Это связано с тем, что диапазоны не являются распределенными значениями в Chapel, и в результате их параллельные итераторы не распределяют работу между локалями. В результате этого все size**3 выполнения тела цикла:

grid3[i,j] += grid[i,k] * grid2[k,j];

будет работать в локали 0, и ни один из них не будет работать в локали 1. Вы можете увидеть, что это так, поместив следующее в тело самого внутреннего цикла:

writeln("locale ", here.id, " running ", (i,j,k));

(где here.id выводит идентификатор локали, в которой выполняется текущая задача). Это покажет, что локаль 0 выполняет все итерации:

0 running (9, 1, 1)
0 running (1, 1, 1)
0 running (1, 1, 2)
0 running (9, 1, 2)
0 running (1, 1, 3)
0 running (9, 1, 3)
0 running (1, 1, 4)
0 running (1, 1, 5)
0 running (1, 1, 6)
0 running (1, 1, 7)
0 running (1, 1, 8)
0 running (1, 1, 9)
0 running (6, 1, 1)
...

Сравните это с выполнением цикла forall в распределенном домене, таком как gridSpace:

forall (i,j) in gridSpace do
  writeln("locale ", here.id, " running ", (i,j));

где итерации будут распределены между локалями:

locale 0 running (1, 1)
locale 0 running (9, 1)
locale 0 running (1, 2)
locale 0 running (9, 2)
locale 0 running (1, 3)
locale 0 running (9, 3)
locale 0 running (1, 4)
locale 1 running (8, 1)
locale 1 running (10, 1)
locale 1 running (8, 2)
locale 1 running (2, 1)
locale 1 running (8, 3)
locale 1 running (10, 2)
...

Поскольку все вычисления выполняются в локали 0, но половина данных находится в локали 1 (из-за распределенных массивов), генерируется много сообщений для извлечения удаленных значений из памяти локали 1 в локаль 0 для вычислений. Это.

Brad 16.12.2019

comment

Еще один совет по вашему коду: поскольку значения и распределения gridSpace, gridSpace2 и gridSpace3 идентичны, вы можете сэкономить время, пространство и сложность, просто объявив один домен и используя его для объявления всех трех массивов. - Brad; 16.12.2019

arrow_upward
2
arrow_downward

Вопрос : Это связано с тем, что время связи₍₁₎ между двумя платами Jetson nano составляет действительно большой или моя реализация ₍₂₎ не использует преимущества того, как работает CyclicDist?

Второй вариант — беспроигрышный вариант: ~ 100 x хуже производительность была достигнута на CyclicDist данных для небольших размеров.

Документация явно предупреждает на это, говоря:

Циклическое распределение сопоставляет индексы с локалями по циклическому шаблону, начиная с заданного индекса.
...
Ограничения
Это распределение не было настроено для повышения производительности.

Неблагоприятное влияние на эффективность обработки было продемонстрировано на платформе с одной локалью, где все данные находятся в локальном пространстве памяти, таким образом, без каких-либо дополнительных затрат на межплатную связь NUMA. Достигнута еще ~ 100 x худшая производительность по сравнению с Vass single-forall{} D3< /strong>-итерируемый суммарный продукт

_{( до сих пор не замечено изменение производительности Васса с оригинальной forall-in-D3-do-{} на другую сконфигурированную forall-in-D2-do-for{}-тандемно-итерированную ревизию - до сих пор небольшой размер --fast --ccflags -O3 выполнил тест, показывающий почти половину length ХУЖЕ производительность для результатов forall-in-D2-do-for{}-iterator-in-iterator, даже хуже, чем исходное предложение O/P triple-forall{}, за исключением размеров менее 512x512 и после оптимизации -O3, но для наименьшего размера 128x128

Наивысшая производительность была достигнута ~ 850 [ns] на ячейку для исходного соло-итератора Vass-D3, неожиданно без --ccflags -O3 ( который, очевидно, может быть изменен для обработки больших --size={ 1024 | 2048 | 4096 | 8192 } макетов данных, тем больше, если в гонку вступают устройства с более широкой поддержкой NUMA, мультилокальностью и более высоким параллелизмом )) )}

TiO.run platform uses 1 numLocales, having 2 physical CPU-cores accessible (numPU-s) with 2 maxTaskPar parallelism limit

Использование CyclicDist влияет на расположение DATA в памяти, не так ли?

Validated by measurements on small sizes --size={128 | 256 | 512 | 640} with and without a minor --ccflags -O3 effect

// -------------------------------------------------------------------------------------------------------------------------------- // --fast // ------ // // For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 255818 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 128, 128] the tested forall sum-product took 3075 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 3040 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 128, 128] the tested forall sum-product took 2198 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 128, 128] the Vass-D3 orig sum-product took 1974 [us] excl. fillRandom()-ops <-- 127x SLOWER with CyclicDist dmapped DATA // For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 2122 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 252439 [us] excl. fillRandom()-ops // // For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2141444 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 256, 256] the tested forall sum-product took 27095 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 25339 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 256, 256] the tested forall sum-product took 23493 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 256, 256] the Vass-D3 orig sum-product took 21631 [us] excl. fillRandom()-ops <-- 98x SLOWER then w/o CyclicDist dmapped data // For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 21971 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2122417 [us] excl. fillRandom()-ops // // For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 16988685 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17448207 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the tested forall sum-product took 268111 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 270289 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the tested forall sum-product took 250896 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the Vass-D3 orig sum-product took 239898 [us] excl. fillRandom()-ops <-- 71x SLOWER with dmapped CyclicDist DATA // For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 257479 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17391049 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 16932503 [us] excl. fillRandom()-ops <~~ ~2e5 [us] faster without --ccflags -O3 // // For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35136377 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 640, 640] the tested forall sum-product took 362205 [us] incl. fillRandom()-ops <-- 97x SLOWER with dmapped CyclicDist DATA // For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 367651 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 640, 640] the tested forall sum-product took 345865 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 640, 640] the Vass-D3 orig sum-product took 337896 [us] excl. fillRandom()-ops <-- 103x SLOWER with dmapped CyclicDist DATA // For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 351101 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35052849 [us] excl. fillRandom()-ops <~~ ~3e4 [us] faster without --ccflags -O3 // // -------------------------------------------------------------------------------------------------------------------------------- // --fast --ccflags -O3 // -------------------- // // For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 250372 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 128, 128] the tested forall sum-product took 3189 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 2966 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 128, 128] the tested forall sum-product took 2284 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 128, 128] the Vass-D3 orig sum-product took 1949 [us] excl. fillRandom()-ops <-- 126x FASTER than with dmapped CyclicDist DATA // For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 2072 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 246965 [us] excl. fillRandom()-ops // // For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2114615 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 256, 256] the tested forall sum-product took 37775 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 38866 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 256, 256] the tested forall sum-product took 32384 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 256, 256] the Vass-D3 orig sum-product took 29264 [us] excl. fillRandom()-ops <-- 71x FASTER than with dmapped CyclicDist DATA // For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 33973 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2098344 [us] excl. fillRandom()-ops // // For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17136826 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17081273 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the tested forall sum-product took 251786 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 266766 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the tested forall sum-product took 239301 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the Vass-D3 orig sum-product took 233003 [us] excl. fillRandom()-ops <~~ ~6e3 [us] faster with --ccflags -O3 // For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 253642 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17025339 [us] excl. fillRandom()-ops // For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17081352 [us] excl. fillRandom()-ops <~~ ~2e5 [us] slower with --ccflags -O3 // // For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35164630 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 640, 640] the tested forall sum-product took 363060 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 489529 [us] incl. fillRandom()-ops // For grid{1,2,3}[ 640, 640] the tested forall sum-product took 345742 [us] excl. fillRandom()-ops <-- 104x SLOWER with dmapped CyclicDist DATA // For grid{1,2,3}[ 640, 640] the Vass-D3 orig sum-product took 353353 [us] excl. fillRandom()-ops <-- 102x SLOWER with dmapped CyclicDist DATA // For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 471213 [us] excl. fillRandom()-ops <~~~12e5 [us] slower with --ccflags -O3 // For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35075435 [us] excl. fillRandom()-ops

В любом случае, идеи команды Chapel (как в плане дизайна, так и в плане тестирования) важны. @Brad попросили помочь предоставить аналогичное покрытие тестирования и сравнения для принципиально более высоких размеров --size={1024 | 2048 | 4096 | 8192 | ...} и для «более широких» NUMA-платформ, имеющих решения для нескольких языков и многих языков, доступные в Cray для команды Chapel. R&D, that will not suffer from a hardware and ~ 60 [s] limits on a public, sponsored, shared TiO.RUN platform.

user3666197 14.12.2019

CyclicDist работает медленнее на нескольких локалях

Ответы (2)

Вопросы по теме