python追加文件特别慢啊_性能：Python的大熊猫DataFrame.to_csv追加逐渐变得更慢

Initial Question:

I’m looping through a couple of thousand pickle files with Python Pandas DataFrames in it which vary in the number of rows (between aprox. 600 and 1300) but not in the number of collumns (636 to be exact). Then I transform them (exactly the same tranformations to each) and append them to a csv file using the DataFrame.to_csv() method.

The to_csv code excerpt:

if picklefile == ‘0000.p’:

dftemp.to_csv(finalnormCSVFile)

else:

dftemp.to_csv(finalnormCSVFile, mode=’a’, header=False)

What bothers me is that it starts off pretty fast but performance decreases exponentially, I kept a processing time log:

start: 2015-03-24 03:26:36.958058

2015-03-24 03:26:36.958058

count = 0

time: 0:00:00

2015-03-24 03:30:53.254755

count = 100

time: 0:04:16.296697

2015-03-24 03:39:16.149883

count = 200

time: 0:08:22.895128

2015-03-24 03:51:12.247342

count = 300

time: 0:11:56.097459

2015-03-24 04:06:45.099034

count = 400

time: 0:15:32.851692

2015-03-24 04:26:09.411652

count = 500

time: 0:19:24.312618

2015-03-24 04:49:14.519529

count = 600

time: 0:23:05.107877

2015-03-24 05:16:30.175175

count = 700

time: 0:27:15.655646

2015-03-24 05:47:04.792289

count = 800

time: 0:30:34.617114

2015-03-24 06:21:35.137891

count = 900

time: 0:34:30.345602

2015-03-24 06:59:53.313468

count = 1000

time: 0:38:18.175577

2015-03-24 07:39:29.805270

count = 1100

time: 0:39:36.491802

2015-03-24 08:20:30.852613

count = 1200

time: 0:41:01.047343

2015-03-24 09:04:14.613948

count = 1300

time: 0:43:43.761335

2015-03-24 09:51:45.502538

count = 1400

time: 0:47:30.888590

2015-03-24 11:09:48.366950

count = 1500

time: 1:18:02.864412

2015-03-24 13:02:33.152289

count = 1600

time: 1:52:44.785339

2015-03-24 15:30:58.534493

count = 1700

time: 2:28:25.382204

2015-03-24 18:09:40.391639

count = 1800

time: 2:38:41.857146

2015-03-24 21:03:19.204587

count = 1900

time: 2:53:38.812948

2015-03-25 00:00:05.855970

count = 2000

time: 2:56:46.651383

2015-03-25 03:53:05.020944

count = 2100

time: 3:52:59.164974

2015-03-25 05:02:16.534149

count = 2200

time: 1:09:11.513205

2015-03-25 06:07:32.446801

count = 2300

time: 1:05:15.912652

2015-03-25 07:13:45.075216

count = 2400

time: 1:06:12.628415

2015-03-25 08:20:17.927286

count = 2500

time: 1:06:32.852070

2015-03-25 09:27:20.676520

count = 2600

time: 1:07:02.749234

2015-03-25 10:35:01.657199

count = 2700

time: 1:07:40.980679

2015-03-25 11:43:20.788178

count = 2800

time: 1:08:19.130979

2015-03-25 12:53:57.734390

count = 2900

time: 1:10:36.946212

2015-03-25 14:07:20.936314

count = 3000

time: 1:13:23.201924

2015-03-25 15:22:47.076786

count = 3100

time: 1:15:26.140472

2015-03-25 19:51:10.776342

count = 3200

time: 4:28:23.699556

2015-03-26 03:06:47.372698

count = 3300

time: 7:15:36.596356

count = 3324

end of cycle: 2015-03-26 03:59:54.161842

end: 2015-03-26 03:59:54.161842

total duration: 2 days, 0:33:17.203784

Update #1:

I did as you suggested @Alexander but it has certainly to do with the to_csv() mehod:

start: 2015-03-26 05:18:25.948410

2015-03-26 05:18:25.948410

count = 0

time: 0:00:00

2015-03-26 05:20:30.425041

count = 100

time: 0:02:04.476631

2015-03-26 05:22:27.680582

count = 200

time: 0:01:57.255541

2015-03-26 05:24:26.012598

count = 300

time: 0:01:58.332016

2015-03-26 05:26:16.542835

count = 400

time: 0:01:50.530237

2015-03-26 05:27:58.063196

count = 500

time: 0:01:41.520361

2015-03-26 05:29:45.769580

count = 600

time: 0:01:47.706384

2015-03-26 05:31:44.537213

count = 700

time: 0:01:58.767633

2015-03-26 05:33:41.591837

count = 800

time: 0:01:57.054624

2015-03-26 05:35:43.963843

count = 900

time: 0:02:02.372006

2015-03-26 05:37:46.171643

count = 1000

time: 0:02:02.207800

2015-03-26 05:38:36.493399

count = 1100

time: 0:00:50.321756

2015-03-26 05:39:42.123395

count = 1200

time: 0:01:05.629996

2015-03-26 05:41:13.122048

count = 1300

time: 0:01:30.998653

2015-03-26 05:42:41.885513

count = 1400

time: 0:01:28.763465

2015-03-26 05:44:20.937519

count = 1500

time: 0:01:39.052006

2015-03-26 05:46:16.012842

count = 1600

time: 0:01:55.075323

2015-03-26 05:48:14.727444

count = 1700

time: 0:01:58.714602

2015-03-26 05:50:15.792909

count = 1800

time: 0:02:01.065465

2015-03-26 05:51:48.228601

count = 1900

time: 0:01:32.435692

2015-03-26 05:52:22.755937

count = 2000

time: 0:00:34.527336

2015-03-26 05:52:58.289474

count = 2100

time: 0:00:35.533537

2015-03-26 05:53:39.406794

count = 2200

time: 0:00:41.117320

2015-03-26 05:54:11.348939

count = 2300

time: 0:00:31.942145

2015-03-26 05:54:43.057281

count = 2400

time: 0:00:31.708342

2015-03-26 05:55:19.483600

count = 2500

time: 0:00:36.426319

2015-03-26 05:55:52.216424

count = 2600

time: 0:00:32.732824

2015-03-26 05:56:27.409991

count = 2700

time: 0:00:35.193567

2015-03-26 05:57:00.810139

count = 2800

time: 0:00:33.400148

2015-03-26 05:58:17.109425

count = 2900

time: 0:01:16.299286

2015-03-26 05:59:31.021719

count = 3000

time: 0:01:13.912294

2015-03-26 06:00:49.200303

count = 3100

time: 0:01:18.178584

2015-03-26 06:02:07.732028

count = 3200

time: 0:01:18.531725

2015-03-26 06:03:28.518541

count = 3300

time: 0:01:20.786513

count = 3324

end of cycle: 2015-03-26 06:03:47.321182

end: 2015-03-26 06:03:47.321182

total duration: 0:45:21.372772

And as requested, the source code:

import pickle

import pandas as pd

import numpy as np

from os import listdir

from os.path import isfile, join

from datetime import datetime

# Defining function to deep copy pandas data frame:

def very_deep_copy(self):

return pd.DataFrame(self.values.copy(), self.index.copy(), self.columns.copy())

# Adding function to Dataframe module:

pd.DataFrame.very_deep_copy = very_deep_copy

#Define Data Frame Header:

head = [

‘ConcatIndex’, ‘Concatenated String Index’, ‘FileID’, …, ‘Attribute’, ‘Attribute’

]

exclude = [

‘ConcatIndex’, ‘Concatenated String Index’, ‘FileID’, … , ‘Real URL Array’

]

path = “./dataset_final/”

pickleFiles = [ f for f in listdir(path) if isfile(join(path,f)) ]

finalnormCSVFile = ‘finalNormalizedDataFrame2.csv’

count = 0

start_time = datetime.now()

t1 = start_time

print(“start: ” + str(start_time) + “\n”)

for picklefile in pickleFiles:

if count%100 == 0:

t2 = datetime.now()

print(str(t2))

print(‘count = ‘ + str(count))

print(‘time: ‘ + str(t2 – t1) + ‘\n’)

t1 = t2

#DataFrame Manipulation:

df = pd.read_pickle(path + picklefile)

df[‘ConcatIndex’] = 100000*df.FileID + df.ID

for i in range(0, len(df)):

df.loc[i, ‘Concatenated String Index’] = str(df[‘ConcatIndex’][i]).zfill(10)

df.index = df.ConcatIndex

#DataFrame Normalization:

dftemp = df.very_deep_copy()

for string in head:

if string in exclude:

if string != ‘ConcatIndex’:

dftemp.drop(string, axis=1, inplace=True)

else:

if ‘Real ‘ in string:

max = pd.DataFrame.max(df[string.strip(‘Real ‘)])

elif ‘child’ in string:

max = pd.DataFrame.max(df[string.strip(‘child’)+’desc’])

else:

max = pd.DataFrame.max(df[string])

if max != 0:

dftemp[string] = dftemp[string]/max

dftemp.drop(‘ConcatIndex’, axis=1, inplace=True)

#Saving DataFrame in CSV:

if picklefile == ‘0000.p’:

dftemp.to_csv(finalnormCSVFile)

else:

dftemp.to_csv(finalnormCSVFile, mode=’a’, header=False)

count += 1

print(‘count = ‘ + str(count))

cycle_end_time = datetime.now()

print(“end of cycle: ” + str(cycle_end_time) + “\n”)

end_time = datetime.now()

print(“end: ” + str(end_time))

print(‘total duration: ‘ + str(end_time – start_time) + ‘\n’)

Update #2:

As suggested I executed the command %prun %run “./DataSetNormalization.py” for the first couple of hundred picklefiles and the result is as followed:

136373640 function calls (136342619 primitive calls) in 1018.769 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)

220 667.069 3.032 667.069 3.032 {method ‘close’ of ‘_io.TextIOWrapper’ objects}

1540 42.046 0.027 46.341 0.030 {pandas.lib.write_csv_rows}

219 34.886 0.159 34.886 0.159 {built-in method collect}

3520 16.782 0.005 16.782 0.005 {pandas.algos.take_2d_axis1_object_object}

78323 9.948 0.000 9.948 0.000 {built-in method empty}

25336892 9.645 0.000 12.635 0.000 {built-in method isinstance}

1433941 9.344 0.000 9.363 0.000 generic.py:1845(__setattr__)

221051/220831 7.387 0.000 119.767 0.001 indexing.py:194(_setitem_with_indexer)

723540 7.312 0.000 7.312 0.000 {method ‘reduce’ of ‘numpy.ufunc’ objects}

273414 7.137 0.000 20.642 0.000 internals.py:2656(set)

604245 6.846 0.000 6.850 0.000 {method ‘copy’ of ‘numpy.ndarray’ objects}

1760 6.566 0.004 6.566 0.004 {pandas.lib.isnullobj}

276274 5.315 0.000 5.315 0.000 {method ‘ravel’ of ‘numpy.ndarray’ objects}

1719244 5.264 0.000 5.266 0.000 {built-in method array}

1102450 5.070 0.000 29.543 0.000 internals.py:1804(make_block)

1045687 5.056 0.000 10.209 0.000 index.py:709(__getitem__)

1 4.718 4.718 1018.727 1018.727 DataSetNormalization.py:6()

602485 4.575 0.000 15.087 0.000 internals.py:2586(iget)

441662 4.562 0.000 33.386 0.000 internals.py:2129(apply)

272754 4.550 0.000 4.550 0.000 internals.py:1291(set)

220883 4.073 0.000 4.073 0.000 {built-in method charmap_encode}

4781222 3.805 0.000 4.349 0.000 {built-in method getattr}

52143 3.673 0.000 3.673 0.000 {built-in method truediv}

1920486 3.671 0.000 3.672 0.000 {method ‘get_loc’ of ‘pandas.index.IndexEngine’ objects}

1096730 3.513 0.000 8.370 0.000 internals.py:3035(__init__)

875899 3.508 0.000 14.458 0.000 series.py:134(__init__)

334357 3.420 0.000 3.439 0.000 {pandas.lib.infer_dtype}

2581268 3.419 0.000 4.774 0.000 {pandas.lib.values_from_object}

1102450 3.036 0.000 6.110 0.000 internals.py:59(__init__)

824856 2.888 0.000 45.749 0.000 generic.py:1047(_get_item_cache)

2424185 2.657 0.000 3.870 0.000 numeric.py:1910(isscalar)

273414 2.505 0.000 9.332 0.000 frame.py:2113(_sanitize_column)

1646198 2.491 0.000 2.880 0.000 index.py:698(__contains__)

879639 2.461 0.000 2.461 0.000 generic.py:87(__init__)

552988 2.385 0.000 4.451 0.000 internals.py:3565(_get_blkno_placements)

824856 2.349 0.000 51.282 0.000 frame.py:1655(__getitem__)

220831 2.224 0.000 21.670 0.000 internals.py:460(setitem)

326437 2.183 0.000 11.352 0.000 common.py:1862(_possibly_infer_to_datetimelike)

602485 2.167 0.000 16.974 0.000 frame.py:1982(_box_item_values)

602485 2.087 0.000 23.202 0.000 internals.py:2558(get)

770739 2.036 0.000 6.471 0.000 internals.py:1238(__init__)

276494 1.966 0.000 1.966 0.000 {pandas.lib.get_blkno_indexers}

10903876/10873076 1.935 0.000 1.972 0.000 {built-in method len}

220831 1.924 0.000 76.647 0.000 indexing.py:372(setter)

220 1.893 0.009 1.995 0.009 {built-in method load}

1920486 1.855 0.000 8.198 0.000 index.py:1173(get_loc)

112860 1.828 0.000 9.607 0.000 common.py:202(_isnull_ndarraylike)

602485 1.707 0.000 8.903 0.000 series.py:238(from_array)

875899 1.688 0.000 2.493 0.000 series.py:263(_set_axis)

3300 1.661 0.001 1.661 0.001 {method ‘tolist’ of ‘numpy.ndarray’ objects}

1102670 1.609 0.000 2.024 0.000 internals.py:108(mgr_locs)

4211850 1.593 0.000 1.593 0.000 {built-in method issubclass}

1335546 1.501 0.000 2.253 0.000 generic.py:297(_get_axis_name)

273414 1.411 0.000 37.866 0.000 frame.py:1994(__setitem__)

441662 1.356 0.000 7.884 0.000 indexing.py:982(_convert_to_indexer)

220831 1.349 0.000 131.331 0.001 indexing.py:95(__setitem__)

273414 1.329 0.000 23.170 0.000 generic.py:1138(_set_item)

326437 1.276 0.000 6.203 0.000 fromnumeric.py:2259(prod)

274734 1.271 0.000 2.113 0.000 shape_base.py:60(atleast_2d)

273414 1.242 0.000 34.396 0.000 frame.py:2072(_set_item)

602485 1.183 0.000 1.979 0.000 generic.py:1061(_set_as_cached)

934422 1.175 0.000 1.894 0.000 {method ‘view’ of ‘numpy.ndarray’objects}

1540 1.144 0.001 58.217 0.038 format.py:1409(_save_chunk)

220831 1.144 0.000 9.198 0.000 indexing.py:139(_convert_tuple)

441662 1.137 0.000 3.036 0.000 indexing.py:154(_convert_scalar_indexer)

220831 1.087 0.000 1.281 0.000 arrayprint.py:343(array2string)

1332026 1.056 0.000 3.997 0.000 generic.py:310(_get_axis)

602485 1.046 0.000 9.949 0.000 frame.py:1989(_box_col_values)

220 1.029 0.005 1.644 0.007 internals.py:2429(_interleave)

824856 1.025 0.000 46.777 0.000 frame.py:1680(_getitem_column)

1491578 1.022 0.000 2.990 0.000 common.py:58(_check)

782616 1.010 0.000 3.513 0.000 numeric.py:394(asarray)

290354 0.988 0.000 1.386 0.000 internals.py:1950(shape)

220831 0.958 0.000 15.392 0.000 generic.py:2101(copy)

273414 0.940 0.000 1.796 0.000 indexing.py:1520(_convert_to_index_sliceable)

220831 0.920 0.000 1.558 0.000 common.py:1110(_possibly_downcast_to_dtype)

220611 0.914 0.000 0.914 0.000 {pandas.lib.is_bool_array}

498646 0.906 0.000 0.906 0.000 {method ‘clear’ of ‘dict’ objects}

715345 0.848 0.000 13.083 0.000 common.py:132(_isnull_new)

452882 0.824 0.000 1.653 0.000 index.py:256(__array_finalize__)

602485 0.801 0.000 0.801 0.000 internals.py:208(iget)

52583 0.748 0.000 2.038 0.000 common.py:1223(_fill_zeros)

606005 0.736 0.000 6.755 0.000 internals.py:95(make_block_same_class)

708971 0.732 0.000 2.156 0.000 internals.py:3165(values)

1760378 0.724 0.000 0.724 0.000 internals.py:2025(_get_items)

109560 0.720 0.000 6.140 0.000 nanops.py:152(_get_values)

220831 0.718 0.000 11.017 0.000 internals.py:2395(copy)

924669 0.712 0.000 1.298 0.000 common.py:2248(_get_dtype_type)

1515796 0.698 0.000 0.868 0.000 {built-in method hasattr}

220831 0.670 0.000 4.299 0.000 internals.py:435(copy)

875899 0.661 0.000 0.661 0.000 series.py:285(_set_subtyp)

220831 0.648 0.000 0.649 0.000 {method ‘get_value’ of ‘pandas.index.IndexEngine’ objects}

452882 0.640 0.000 0.640 0.000 index.py:218(_reset_identity)

715345 0.634 0.000 1.886 0.000 {pandas.lib.isscalar}

1980 0.626 0.000 1.172 0.001 internals.py:3497(_merge_blocks)

220831 0.620 0.000 2.635 0.000 common.py:1933(_is_bool_indexer)

272754 0.608 0.000 0.899 0.000 internals.py:1338(should_store)

220831 0.599 0.000 3.463 0.000 series.py:482(__getitem__)

498645 0.591 0.000 1.497 0.000 generic.py:1122(_clear_item_cache)

1119390 0.584 0.000 1.171 0.000 index.py:3936(_ensure_index)

220831 0.573 0.000 1.883 0.000 index.py:222(view)

814797 0.555 0.000 0.905 0.000 internals.py:3086(_values)

52583 0.543 0.000 15.545 0.000 ops.py:469(wrapper)

220831 0.536 0.000 3.760 0.000 internals.py:371(_try_cast_result)

228971 0.533 0.000 0.622 0.000 generic.py:1829(__getattr__)

769651 0.528 0.000 0.528 0.000 {built-in method min}

224351 0.509 0.000 2.030 0.000 generic.py:1099(_maybe_update_cacher)

…

I will rerun it for confirmation but looks like it certainly has something to do with pandas’ to_csv() method, because most of the run time is used on io and the csv writer. Why is it having this effect? Any suggestions?

Update #3:

Well, I did a full %prun test and indeed almost 90% of the time spent is used on {method ‘close’ of ‘_io.TextIOWrapper’ objects}. So I guess here’s the problem… What do you guys think?

My questions here are:

What originates here the decrease in performance?

Does pandas.DataFrames.to_csv() append mode load the whole file each time it writes to it?

Is there a way to enhance the process?

解决方案

In these kind of situation you should profile your code (to see which function calls are taking the most time), that way you can check empirically that it is indeed slow in the read_csv rather than elsewhere…

From looking at your code: Firstly there’s a lot of copying here and a lot of looping (not enough vectorization)… everytime you see looping look for a way to remove it. Secondly, when you use things like zfill, I wonder if you want to_fwf (fixed width format) rather than to_csv?

Some sanity testing: Are some files are significantly bigger than others (which could lead to you hitting swap)? Are you sure the largest files are only 1200 rows?? Have your checked this? e.g. using wc -l.

IMO I think it unlikely to be garbage collection.. (as was suggested in the other answer).

Here are a few improvements on your code, which should improve the runtime.

Columns are fixed I would extract the column calculations and vectorize the real, child and other normalizations. Use apply rather than iterating (for zfill).

columns_to_drop = set(head) & set(exclude) # maybe also – [‘ConcatIndex’]

remaining_cols = set(head) – set(exclude)

real_cols = [r for r in remaining_cols if ‘Real ‘ in r]

real_cols_suffix = [r.strip(‘Real ‘) for r in real]

remaining_cols = remaining_cols – real_cols

child_cols = [r for r in remaining_cols if ‘child’ in r]

child_cols_desc = [r.strip(‘child’+’desc’) for r in real]

remaining_cols = remaining_cols – child_cols

for count, picklefile in enumerate(pickleFiles):

if count % 100 == 0:

t2 = datetime.now()

print(str(t2))

print(‘count = ‘ + str(count))

print(‘time: ‘ + str(t2 – t1) + ‘\n’)

t1 = t2

#DataFrame Manipulation:

df = pd.read_pickle(path + picklefile)

df[‘ConcatIndex’] = 100000*df.FileID + df.ID

# use apply here rather than iterating

df[‘Concatenated String Index’] = df[‘ConcatIndex’].apply(lambda x: str(x).zfill(10))

df.index = df.ConcatIndex

#DataFrame Normalization:

dftemp = df.very_deep_copy() # don’t *think* you need this

# drop all excludes

dftemp.drop(columns_to_drop), axis=1, inplace=True)

# normalize real cols

m = dftemp[real_cols_suffix].max()

m.index = real_cols

dftemp[real_cols] = dftemp[real_cols] / m

# normalize child cols

m = dftemp[child_cols_desc].max()

m.index = child_cols

dftemp[child_cols] = dftemp[child_cols] / m

# normalize remaining

remaining = list(remaining – child)

dftemp[remaining] = dftemp[remaining] / dftemp[remaining].max()

# if this case is important then discard the rows of m with .max() is 0

#if max != 0:

# dftemp[string] = dftemp[string]/max

# this is dropped earlier, if you need it, then subtract [‘ConcatIndex’] from columns_to_drop

# dftemp.drop(‘ConcatIndex’, axis=1, inplace=True)

#Saving DataFrame in CSV:

if picklefile == ‘0000.p’:

dftemp.to_csv(finalnormCSVFile)

else:

dftemp.to_csv(finalnormCSVFile, mode=’a’, header=False)

As a point of style I would probably choose to wrap each of these parts into functions, this will also mean more things can be gc’d if that really was the issue…

Another options which would be faster is to use pytables (HDF5Store) if you didn’t need to resulting output to be csv (but I expect you do)…

The best thing to do by far is to profile your code. e.g. with %prun in ipython e.g. see http://pynash.org/2013/03/06/timing-and-profiling.html. Then you can see it definitely is read_csv and specifically where (which line of your code and which lines of pandas code).

Ah ha, I’d missed that you are appending all these to a single csv file. And in your prun it shows most of the time is spent in close, so let’s keep the file open:

# outside of the for loop (so the file is opened and closed only once)

f = open(finalnormCSVFile, ‘w’)

…

for picklefile in …

if picklefile == ‘0000.p’:

dftemp.to_csv(f)

else:

dftemp.to_csv(f, mode=’a’, header=False)

…

f.close()

Each time the file is opened before it can append to, it needs to seek to the end before writing, it could be that this is the expensive (I don’t see why this should be that bad, but keeping it open removes the need to do this).

原文链接：https://blog.csdn.net/weixin_39879674/article/details/111843290

你可能也喜欢