Python parallelization sorting outweighs performance benefits?

Hi friends,

I'm messing around with using the new parallel module to speed up processing of large data sets- absolutely nothing geometry-related. I get multi cpu utilization when running a function on a very large list (think 8769 x ~3000 items), and the output is correct, however performance is lower than the single-threaded version.

So I've either a) messed up the parallel call or b) my (unedcuated) hunch is that after the parallel run, putting that large an amount of data into the right order again takes so much time that it's not worth it. Is the latter possible? Zero experience with parallelization..

Edit: the weird thing is.. this changes from canvas to canvas and run times are not always the same upon toggling, either. I'm not sure what's going on.

Thanks!

Max

Replies to This Discussion

Permalink Reply by Giulio Piacentino on January 27, 2014 at 10:07am

Hi Max

concurrency is not an easy topic, and cannot always be discussed in general terms. Yes, concurrency has a cost, and there are unfortunately several cases where the cost outweighs the benefit.

In short, a checklist:

- not everything can be parallelized, logically. Sorting algorithms do not usually lend themselves easily to parallelism, as any item can require the whole set to be redistributed (there are though "special" sorting algorithms like radix sorting that should behave better)

- if something can be parallelized, then there might be a synchronization cost at the beginning, a race to fetch resources, a bottleneck during execution, or a queue to gather results, for example

- if any of these costs outweigh the benefit of parallelism, then the work to make the code multi-threaded might be futile

- you should be careful when using third-party libraries. Check that they: allow parallelism and that they do not intrinsically synchronize calls (have global interpreter locks, for example)

>> [time] changes from canvas to canvas and run times are not always the same [...]

- yes, this is because there's more to add when using multiple threads. The operating system gives each thread some time, and might then move on to another thread on the same CPU at any moment after the time has expired. This means that the rotation might happen in one order once, and another order, another time, giving very different results.

Concurrency and multi-threading are very well-covered topics, if you want, you can read about them on several books, or msdn, or even in general on programming websites.

I hope this helps,

Giulio
--
Giulio Piacentino
for Robert McNeel & Associates
giulio@mcneel.com

Permalink Reply by David Rutten on January 27, 2014 at 10:08am

Hi Max,

I'm not entirely sure how the parallel approach recombines data into the correct order, but there's almost always overhead when doing multi-threading. It's not a panacea for long computation times, even though many people seem to think so. Sometimes the overhead is so large that it negates the benefit of performing some of the computations in parallel.

You should post this on our Discourse forum with the code. I think Steve will find it quicker there.

David Rutten

david@mcneel.com

Permalink Reply by Steve Baer on January 27, 2014 at 10:55am

In order to really give any help for this specific case, a sample would help a lot.

Permalink Reply by Max D on January 27, 2014 at 11:43am

Dear Giulio, David and Steve,

thank you very much- From your replies I take it that overhead might be an issue, little as I know about it otherwise.

I've prepped a sample pack of the definition snippet in question and uploaded it here:

http://mrcomfy.org/releases/pack/Parallel_Daylight.zip

It includes two input data files you need to reference in the path components. What that thing basically does is to bake scheduling into hourly simulation input data and returns daily subnested lists for each day (365 per data point).

The two different parallel/nonparallel calls are the last in the "dschedule" component; the input functions for the parallel/nonp. calls are identical apart from removed nested loops in the parallel one. Printed output is identical, runtimes on the parallel one are longer. I'm beginning to assume I've made an obvious mistake somewhere..

Thanks,

Permalink Reply by Steve Baer on January 28, 2014 at 12:44pm

Hi Max,

I haven't figured out why the parallel implementation is running at the same speed as the serial version yet, but wanted to post an updated definition. I added caching to the file parsing code so you don't have to wait a minute every time you change something. I also simplified your test python script. I'm still don't quite understand what your script is doing.

Attachments:

daylight_scheduler.gh, 15 KB

Permalink Reply by Max D on January 28, 2014 at 1:07pm

Thank you, Steve! I'll have a look at it soon and learn from it- and thanks for looking into the parallel issue again.

What that thing does is bake scheduling into parsed hourly daylight data, so I get daily chunks of data only- which is then passed to another component I haven't included in that slimmed down definition, to generate monthly averages etc. The code is practically identical to the mrcomfy pack you can find on the mrcomfy.org site, and does the same stuff only not for thermal zones but daylight sensors.

Permalink Reply by Mathias s n on October 28, 2014 at 3:58am

Hi Max

I'm very interested in, if you got this to work? I'm looking for a definition to look at peaks, averages, number of hours above xxx lux pr year in working time, etc without doing it in GH components (they take me 20-50 secs per iteration with 1000 points of 8760 yearly values.)

Or atleast a simple defination to cull out hours outside working hours (as it looks like yours is doing, but i dont understand the output)

Best, Mathias

RSS