pept.utilities.ChunkReader#
- class pept.utilities.ChunkReader(filepath_or_buffer, chunksize, skiprows=None, nrows=None, dtype=<class 'float'>, sep='\\s+', header=None, engine='c', na_filter=False, quoting=3, memory_map=True, **kwargs)[source]#
Bases:
object
Class for fast, on-demand reading / parsing and iteration over chunks of data from CSV files.
This is an abstraction above pandas.read_csv for easy and fast iteration over chunks of data from a CSV file. The chunks can be accessed using normal iteration (for chunk in reader: …) and subscripting (reader[0]).
The chunks are read lazily, only upon access. It is therefore a more efficient alternative to read_csv for large files (> 1.000.000 lines). For convenience, this class configures some default parameters for pandas.read_csv for fast reading and parsing of usual PEPT data.
Most importantly, it reads chunks containing chunksize lines from a space-separated values file at filepath_or_buffer, optionally skipping skiprows lines and reading in at most nrows lines. It returns numpy.ndarray`s with `float values.
- Raises
IndexError
Upon access to a non-existent chunk using subscript notation (i.e. data[100] when there are 50 chunks).
See also
pept.utilities.read_csv
Fast CSV file reading into numpy arrays.
pept.LineData
Encapsulate LoRs for ease of iteration and plotting.
pept.PointData
Encapsulate points for ease of iteration and plotting.
Examples
Say “data.csv” contains 1_000_000 lines of data. Read chunks of 10_000 lines as a time, skipping the first 100_000:
>>> from pept.utilities import ChunkReader >>> chunks = ChunkReader("data.csv", 10_000, skiprows = 100_000) >>> len(chunks) # 90 chunks >>> chunks.file_lines # 1_000_000
Normal iteration:
>>> for chunk in chunks: >>> ... # neat operations
Access a single chunk using subscripting:
>>> chunks[0] # First chunk >>> chunks[-1] # Last chunk >>> chunks[100] # IndexError
- Attributes
- filepath_or_buffer
str
,path
object
or file-likeobject
Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts any os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handler (e.g. via builtin open function) or StringIO.
- number_of_chunks
int
The number of chunks (also returned when using the len method), taking into account the lines skipped (skiprows), the number of lines in the file (file_lines) and the maximum number of lines to be read (nrows).
- file_lines
int
The number of lines in the file pointed at by filepath_or_buffer.
- chunksize
int
The number of lines in a chunk of data.
- skiprows
int
The number of lines to be skipped at the beginning of the file.
- nrows
int
The maximum number of lines to be read. Only has an effect if it is less than file_lines - skiprows. For example, if a file has 10 lines and skiprows = 5 and chunksize = 5, even if nrows were to be 20, the number_of_chunks should still be 1.
- filepath_or_buffer
- __init__(filepath_or_buffer, chunksize, skiprows=None, nrows=None, dtype=<class 'float'>, sep='\\s+', header=None, engine='c', na_filter=False, quoting=3, memory_map=True, **kwargs)[source]#
ChunkReader class constructor.
- Parameters
- filepath_or_buffer
str
,path
object
or file-likeobject
Any valid string path to a local file is acceptable. If you want to read in lines from an online location (i.e. using a URL), you should use pept.utilities.read_csv. If you want to pass in a path object, pandas accepts any os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handler (e.g. via builtin open function) or StringIO.
- chunksize
int
Number of lines read in a chunk of data.
- skiprowslist-like,
int
orcallable()
, optional Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.
- nrows
int
, optional Number of rows of file to read. Useful for reading pieces of large files.
- dtype
Type
name
,default
float Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}.
- sep
str
,default
“s+” Delimiter to use. Separators longer than 1 character and different from ‘s+’ will be interpreted as regular expressions and will also force the use of the Python parsing engine.
- header
int
,list
of
int
, “infer”, optional Row number(s) to use as the column names, and the start of the data. By default assume there is no header present (i.e. header = None).
- engine{‘c’, ‘python’},
default
“c” Parser engine to use. The C engine is faster while the python engine is currently more feature-complete.
- na_filterbool,
default
True Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of reading a large file.
- quoting
int
or csv.QUOTE_*instance
,default
csv.QUOTE_NONE Control field quoting behavior per csv.QUOTE_* constants. Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
- memory_mapbool,
default
True
If a filepath is provided for filepath_or_buffer, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead.
- **kwargsoptional
Extra keyword arguments that will be passed to pandas.read_csv.
- filepath_or_buffer
- Raises
- EOFError
End
Of
File
Error
If skiprows >= number_of_lines.
- EOFError
Methods
__init__
(filepath_or_buffer, chunksize[, ...])ChunkReader class constructor.
Attributes
- property number_of_chunks#
- property file_lines#
- property chunksize#
- property skiprows#
- property nrows#