pept.utilities.read_csv_chunks#

pept.utilities.read_csv_chunks(filepath_or_buffer, chunksize, skiprows=None, nrows=None, dtype=<class 'float'>, sep='\\s+', header=None, engine='c', na_filter=False, quoting=3, memory_map=True, **kwargs)[source]#

Read chunks of data from a file lazily, returning numpy arrays of the values.

This function returns a generator - an object that can be iterated over once, creating data on-demand. This means that chunks of data will be read only when being accessed, making it a more efficient alternative to read_csv for large files (> 1.000.000 lines).

A more convenient and feature-complete alternative is pept.utilities.ChunkReader which is more reusable and can access out-of-order chunks using subscript notation (i.e. data[0]).

This is a convenience function that’s simply a proxy to pandas.read_csv, configured with default parameters for fast reading and parsing of usual PEPT data.

Most importantly, it lazily read chunks of size chunksize from a space-separated values file at filepath_or_buffer, optionally skipping skiprows lines and reading in nrows lines. It returns numpy.ndarray`s with `float values.

The parameters below are sent to pandas.read_csv with no further parsing. The descriptions below are taken from the pandas documentation.

Parameters
filepath_or_bufferstr, path object or file-like object

Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts any os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handler (e.g. via builtin open function) or StringIO.

chunksizeint

Number of lines read in a chunk of data. Return TextFileReader object for iteration.

skiprowslist-like, int or callable(), optional

Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

nrowsint, optional

Number of rows of file to read. Useful for reading pieces of large files.

dtypeType name, default float

Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}.

sepstr, default “s+”

Delimiter to use. Separators longer than 1 character and different from ‘s+’ will be interpreted as regular expressions and will also force the use of the Python parsing engine.

headerint, list of int, “infer”, optional

Row number(s) to use as the column names, and the start of the data. By default assume there is no header present (i.e. header = None).

engine{‘c’, ‘python’}, default “c”

Parser engine to use. The C engine is faster while the python engine is currently more feature-complete.

na_filterbool, default True

Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of reading a large file.

quotingint or csv.QUOTE_* instance, default csv.QUOTE_NONE

Control field quoting behavior per csv.QUOTE_* constants. Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).

memory_mapbool, default True

If a filepath is provided for filepath_or_buffer, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead.

**kwargsoptional

Extra keyword arguments that will be passed to pandas.read_csv.